PostgreSQL 12：K最近傍空間で分割された一般化された検索ツリーインデックスの実装

インデックス作成の価値

PostgreSQLは単純な線形距離演算子<->を提供します（直線距離）。これを使用して、特定の場所に最も近いポイントを見つけます。

PostgreSQLは単純な線形距離演算子にデータを提供し、最適化を実行せず、インデックスも持たないため、次の実行プランが表示されます。

time psql -qtAc "

EXPLAIN (ANALYZE ON, BUFFERS ON)
SELECT name, location
FROM geonames
ORDER BY location <-> '(29.9691,-95.6972)'
LIMIT 5;

"  <-- closing quote

                                      QUERY PLAN
-----------------------------------------------------------------------------------------------------------
Limit  (cost=418749.15..418749.73 rows=5 width=38) 
        (actual time=2553.970..2555.673 rows=5 loops=1)
  Buffers: shared hit=100 read=272836
  ->  Gather Merge  (cost=418749.15..1580358.21 rows=9955954 width=38) 
                    (actual time=2553.969..2555.669 rows=5 loops=1)
        Workers Planned: 2
        Workers Launched: 2
        Buffers: shared hit=100 read=272836
        ->  Sort  (cost=417749.12..430194.06 rows=4977977 width=38)
                 (actual time=2548.220..2548.221 rows=4 loops=3)
              Sort Key: ((location <-> '(29.9691,-95.6972)'::point))
              Sort Method: top-N heapsort  Memory: 25kB
              Worker 0:  Sort Method: top-N heapsort  Memory: 26kB
              Worker 1:  Sort Method: top-N heapsort  Memory: 25kB
              Buffers: shared hit=100 read=272836
              ->  Parallel Seq Scan on geonames  (cost=0.00..335066.71 rows=4977977 width=38) 
                                        (actual time=0.040..1637.884 rows=3982382 loops=3)
                    Buffers: shared hit=6 read=272836
Planning Time: 0.493 ms
Execution Time: 2555.737 ms

real    0m2.595s
user    0m0.011s
sys    0m0.015s

結果は次のとおりです:(すべてのリクエストで同じ結果になるため、後で省略します。）

名前	場所
サイプレス	（29.96911、-95.69717）
サイプレスポイントバプテスト教会	（29.9732、-95.6873）
サイプレス郵便局	（29.9743、-95.67953）
ホットウェル	（29.95689、-95.68189）
ドライクリーク空港	（29.98571、-95.68597）

したがって、418749.73は勝つためのOPTIMIZERコストであり、そのクエリを実行するのに2秒半（2555.673）かかりました。これは実際には非常に良い結果であり、1100万行のテーブルに対して最適化をまったく行わずにPostgreSQLを使用しています。これが、1,000万行未満のインデックスを使用した場合の違いが非常に少ないため、より大きなデータセットを選択した理由でもあります。並列シーケンシャルスキャンは素晴らしいですが、それは別の記事です。

GiSTインデックスの追加

最適化プロセスは、GiSTインデックスを追加することから始めます。この例のクエリには

があるためです

LIMIT

5項目の条項で、非常に高い選択性があります。これにより、プランナーはインデックスを使用するようになります。そのため、ジオメトリデータでかなりうまく機能するインデックスを提供します。

time psql -qtAc "CREATE INDEX idx_gist_geonames_location ON geonames USING gist(location);"

インデックスを作成する作業には少し費用がかかります。

CREATE INDEX
real    3m1.988s
user    0m0.011s
sys     0m0.014s

次に、同じクエリを再度実行します。

time psql -qtAc "

EXPLAIN (ANALYZE ON, BUFFERS ON)
SELECT name, location
FROM geonames
ORDER BY location <-> '(29.9691,-95.6972)'
LIMIT 5;

                                      QUERY PLAN
----------------------------------------------------------------------------------
Limit  (cost=0.42..1.16 rows=5 width=38) (actual time=0.797..0.881 rows=5 loops=1)
  Buffers: shared hit=5 read=15
  ->  Index Scan using idx_gist_geonames_location on geonames  
            (cost=0.42..1773715.32 rows=11947145 width=38) 
            (actual time=0.796..0.879 rows=5 loops=1)
        Order By: (location <-> '(29.9691,-95.6972)'::point)
        Buffers: shared hit=5 read=15
Planning Time: 0.768 ms
Execution Time: 0.939 ms

real    0m0.033s
user    0m0.011s
sys     0m0.013s

この場合、かなり劇的な改善が見られます。クエリの推定コストはわずか1.16です！これを、最適化されていないクエリの元のコスト（418749.73）と比較してください。実際にかかった時間は.939ミリ秒（10分の9ミリ秒）で、元のクエリの2.5秒と比較されます。この結果、計画にかかる時間が短縮され、見積もりが大幅に改善され、実行時間が約3桁短縮されました。

もっとうまくできるかどうか見てみましょう。

SP-GiSTインデックスの追加

time psql -qtAc "CREATE INDEX idx_spgist_geonames_location ON geonames USING spgist(location);"

CREATE INDEX 

real    1m25.205s
user    0m0.010s
sys        0m0.015s

次に、同じクエリを再度実行します。

time psql -qtAc "

EXPLAIN (ANALYZE ON, BUFFERS ON)
SELECT name, location
FROM geonames
ORDER BY location <-> '(29.9691,-95.6972)'
LIMIT 5;

                                      QUERY PLAN
-----------------------------------------------------------------------------------
 Limit  (cost=0.42..1.09 rows=5 width=38) (actual time=0.066..0.323 rows=5 loops=1)
   Buffers: shared hit=47
   ->  Index Scan using idx_spgist_geonames_location on geonames  
            (cost=0.42..1598071.32 rows=11947145 width=38) 
            (actual time=0.065..0.320 rows=5 loops=1)
         Order By: (location <-> '(29.9691,-95.6972)'::point)
         Buffers: shared hit=47
 Planning Time: 0.122 ms
 Execution Time: 0.358 ms
(7 rows)

real    0m0.040s
user    0m0.011s
sys        0m0.015s

わお！現在、SP-GiSTインデックスを使用しているため、クエリのコストはわずか1.09で、0.358ミリ秒（3分の1ミリ秒）で実行されます。

インデックス自体についていくつか調べて、ディスク上でインデックスがどのように積み重なっているかを見てみましょう。

インデックスの比較

インデックス名	作成時間	見積もり	クエリ時間	indexsize	計画時間
インデックスなし	0S	418749.73	2555.673	0	。493
idx_gist_geonames_location	3M 1S	1.16	。939ミリ秒	868 MB	。786
idx_spgist_geonames_location	1M 25S	1.09	。358ミリ秒	523 MB	。122

結論

したがって、SP-GiSTは実行時のGiSTの2倍の速度であり、計画の8倍の速度であり、ディスク上のサイズの約60％であることがわかります。また、（この記事に関連して）PostgreSQL 12以降のKNNインデックス検索もサポートしています。このタイプの操作では、明確な勝者があります。

付録データの設定

この記事では、GeoNames Gazetteerによって提供されたデータを使用します。
この作品は、Creative Commons Attribution 4.0ライセンスの下でライセンスされています。
データは、保証または表明なしで「現状有姿」で提供されます。正確性、適時性、または完全性。

構造を作成する

作業ディレクトリと少しのETLを作成することからプロセスを開始します。

# change to our home directory
cd
mkdir spgist
cd spgist
# get the base data.  
# This file is 350MB.  It will unpack to 1.5GB
# It will expand to 2GB in PostgreSQL,
#    and then you will still need some room for indexes
#  All together, you will need about 
#  3GB of space for this exercise
#  for about 12M rows of data.

psql -qtAc "

CREATE TABLE IF NOT EXISTS geonames (
geonameid           integer primary key
,name               text 
,asciiname          text 
,alternatenames     text 
,latitude           numeric(13,5) 
,longitude          numeric(13,5)
,feature_class      text 
,feature_code       text 
,country            text 
,cc2                text 
,admin1             text 
,admin2             bigint 
,admin3             bigint 
,admin4             bigint 
,population         bigint 
,elevation          bigint 
,dem                bigint 
,timezone           text 
,modification date  );

COMMENT ON COLUMN geonames.geonameid          
 IS ' integer id of record in geonames database';
COMMENT ON COLUMN geonames.name               
 IS ' name of geographical point (utf8) varchar(200)';
COMMENT ON COLUMN geonames.asciiname          
 IS ' name of geographical point in plain ascii characters, varchar(200)';
COMMENT ON COLUMN geonames.alternatenames     
 IS ' alternatenames, comma separated, ascii names automatically transliterated, 
    convenience attribute from alternatename table, varchar(10000)';
COMMENT ON COLUMN geonames.latitude           
 IS ' latitude in decimal degrees (wgs84)';
COMMENT ON COLUMN geonames.longitude          
 IS ' longitude in decimal degrees (wgs84)';
COMMENT ON COLUMN geonames.feature_class      
 IS ' http://www.geonames.org/export/codes.html, char(1)';
COMMENT ON COLUMN geonames.feature_code       
 IS ' http://www.geonames.org/export/codes.html, varchar(10)';
COMMENT ON COLUMN geonames.country            
 IS ' ISO-3166 2-letter country code, 2 characters';
COMMENT ON COLUMN geonames.cc2                
 IS ' alternate country codes, comma separated, ISO-3166 2-letter country code, 
    200 characters';
COMMENT ON COLUMN geonames.admin1             
 IS ' fipscode (subject to change to iso code), see exceptions below, 
    see file admin1Codes.txt for display names of this code; varchar(20)';
COMMENT ON COLUMN geonames.admin2             
 IS ' code for the second administrative division, a county in the US, 
    see file admin2Codes.txt; varchar(80) ';
COMMENT ON COLUMN geonames.admin3             
 IS ' code for third level administrative division, varchar(20)';
COMMENT ON COLUMN geonames.admin4             
 IS ' code for fourth level administrative division, varchar(20)';
COMMENT ON COLUMN geonames.population         
 IS ' bigint (8 byte int) ';
COMMENT ON COLUMN geonames.elevation          
 IS ' in meters, integer';
COMMENT ON COLUMN geonames.dem                
 IS ' digital elevation model, srtm3 or gtopo30, average elevation of 3''x3'' 
    (ca 90mx90m) or 30''x30'' (ca 900mx900m) area in meters, integer. 
    srtm processed by cgiar/ciat.';
COMMENT ON COLUMN geonames.timezone           
 IS ' the iana timezone id (see file timeZone.txt) varchar(40)';
COMMENT ON COLUMN geonames.modification       
 IS ' date of last modification in yyyy-MM-dd format';

"  #<-- Don't forget the closing quote

ETL

wget http://download.geonames.org/export/dump/allCountries.zip
unzip allCountries.zip

# do this, and go get a coffee.  This took nearly an hour
#   there will be a few lines that fail, they don't really matter much
IFS=$'\n'

for line in $(<allCountries.txt)
do

    echo -n "$line" | 
        psql -qtAc

    "COPY geonames FROM STDIN WITH CSV DELIMITER E'\t';"

2> errors.txt
done

クリーンアップとセットアップ

psql内から行う他のすべて：

psql

-- This command requires the installation
--  of postgis2 from your OS package manager.
-- For OS/X that was `port install postgresql12-postgis2`
-- it will be something similar on most platforms.
-- (e.g. apt-get install postgresql12-postgis2, 
--  yum -y install postgresql12-postgis2, etc.)
CREATE EXTENSION postgis;
CREATE EXTENSION postgis_topology;

ALTER TABLE geonames ADD COLUMN location point;

-- Go get another cup of coffee, this is going to rewrite the entire table with the new geo column.
UPDATE geonames SET location = ('(' || latitude || ', ' || longitude || ')')::point;

DELETE FROM geonames WHERE latitude IS NULL or longitude IS NULL;
-- DELETE 32   -- In my case, this ETL anomoly was too small
--  to bother fixing the records

-- Bloat removal from the update and delete operations
CLUSTER geonames USING geonames_pkey;