pg_trgm `'term'％ANY（array_column）`クエリの文字列配列列にインデックスを付ける方法は？

これが機能しない理由

インデックスタイプ（つまり、演算子クラス）gin_trgm_ops %に基づいています 2つのtextで機能する演算子引数：

CREATE OPERATOR trgm.%(
  PROCEDURE = trgm.similarity_op,
  LEFTARG = text,
  RIGHTARG = text,
  COMMUTATOR = %,
  RESTRICT = contsel,
  JOIN = contjoinsel);

gin_trgm_opsは使用できません配列の場合。配列列に定義されたインデックスは、any(array[...])では機能しません。配列の個々の要素にはインデックスが付けられていないためです。配列にインデックスを付けるには、異なるタイプのインデックス、つまりgin配列インデックスが必要になります。

幸いなことに、インデックスgin_trgm_ops 非常に巧妙に設計されているため、likeの演算子を使用できます。およびilike 、代替ソリューションとして使用できます（以下で説明する例）。

テストテーブル

2つの列があります(id serial primary key, names text[]) 配列要素に分割された100000のラテン語の文が含まれています。

select count(*), sum(cardinality(names))::int words from test;

 count  |  words  
--------+---------
 100000 | 1799389

select * from test limit 1;

 id |                                                     names                                                     
----+---------------------------------------------------------------------------------------------------------------
  1 | {fugiat,odio,aut,quis,dolorem,exercitationem,fugiat,voluptates,facere,error,debitis,ut,nam,et,voluptatem,eum}

単語フラグメントの検索praesent 2400ミリ秒で7051行を生成します：

explain analyse
select count(*)
from test
where 'praesent' % any(names);

                                                  QUERY PLAN                                                   
---------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=5479.49..5479.50 rows=1 width=0) (actual time=2400.866..2400.866 rows=1 loops=1)
   ->  Seq Scan on test  (cost=0.00..5477.00 rows=996 width=0) (actual time=1.464..2400.271 rows=7051 loops=1)
         Filter: ('praesent'::text % ANY (names))
         Rows Removed by Filter: 92949
 Planning time: 1.038 ms
 Execution time: 2400.916 ms

マテリアライズドビュー

1つの解決策は、モデルを正規化することです。これには、1つの行に単一の名前を持つ新しいテーブルの作成が含まれます。このような再構築は、実装が困難な場合があり、既存のクエリ、ビュー、関数、またはその他の依存関係のために不可能な場合があります。マテリアライズドビューを使用すると、テーブル構造を変更せずに同様の効果を実現できます。

create materialized view test_names as
    select id, name, name_id
    from test
    cross join unnest(names) with ordinality u(name, name_id)
    with data;

With ordinality 必須ではありませんが、メインテーブルと同じ順序で名前を集約する場合に役立ちます。 test_namesのクエリ同時にメインテーブルと同じ結果が得られます。

インデックスを作成した後、実行時間は繰り返し減少します：

create index on test_names using gin (name gin_trgm_ops);

explain analyse
select count(distinct id)
from test_names
where 'praesent' % name

                                                                QUERY PLAN                                                                 
-------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=4888.89..4888.90 rows=1 width=4) (actual time=56.045..56.045 rows=1 loops=1)
   ->  Bitmap Heap Scan on test_names  (cost=141.95..4884.39 rows=1799 width=4) (actual time=10.513..54.987 rows=7230 loops=1)
         Recheck Cond: ('praesent'::text % name)
         Rows Removed by Index Recheck: 7219
         Heap Blocks: exact=8122
         ->  Bitmap Index Scan on test_names_name_idx  (cost=0.00..141.50 rows=1799 width=0) (actual time=9.512..9.512 rows=14449 loops=1)
               Index Cond: ('praesent'::text % name)
 Planning time: 2.990 ms
 Execution time: 56.521 ms

このソリューションにはいくつかの欠点があります。ビューが実体化されているため、データはデータベースに2回保存されます。メインテーブルを変更した後は、ビューを更新することを忘れないでください。また、ビューをメインテーブルに結合する必要があるため、クエリはより複雑になる可能性があります。

`ilike`の使用

ilikeを使用できますテキストとして表される配列。配列全体にインデックスを作成するには、不変の関数が必要です。

create function text(text[])
returns text language sql immutable as
$$ select $1::text $$

create index on test using gin (text(names) gin_trgm_ops);

クエリで関数を使用します：

explain analyse
select count(*)
from test
where text(names) ilike '%praesent%' 

                                                           QUERY PLAN                                                            
---------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=117.06..117.07 rows=1 width=0) (actual time=60.585..60.585 rows=1 loops=1)
   ->  Bitmap Heap Scan on test  (cost=76.08..117.03 rows=10 width=0) (actual time=2.560..60.161 rows=7051 loops=1)
         Recheck Cond: (text(names) ~~* '%praesent%'::text)
         Heap Blocks: exact=2899
         ->  Bitmap Index Scan on test_text_idx  (cost=0.00..76.08 rows=10 width=0) (actual time=2.160..2.160 rows=7051 loops=1)
               Index Cond: (text(names) ~~* '%praesent%'::text)
 Planning time: 3.301 ms
 Execution time: 60.876 ms

60ミリ秒対2400ミリ秒、追加の関係を作成する必要なしに非常に良い結果。

ilikeの場合、このソリューションはよりシンプルで、必要な作業も少なくて済みます。、これはtrgm %よりも精度の低いツールです。演算子で十分です。

ilikeを使用する必要があるのはなぜですか %ではなく配列全体をテキストとして使用する場合類似性はテキストの長さに大きく依存します。さまざまな長さの長いテキストで単語を検索するための適切な制限を選択することは非常に困難です。 limit = 0.3の場合結果があります：

with data(txt) as (
values
    ('praesentium,distinctio,modi,nulla,commodi,tempore'),
    ('praesentium,distinctio,modi,nulla,commodi'),
    ('praesentium,distinctio,modi,nulla'),
    ('praesentium,distinctio,modi'),
    ('praesentium,distinctio'),
    ('praesentium')
)
select length(txt), similarity('praesent', txt), 'praesent' % txt "matched?"
from data;

 length | similarity | matched? 
--------+------------+----------
     49 |   0.166667 | f           <--!
     41 |        0.2 | f           <--!
     33 |   0.228571 | f           <--!
     27 |   0.275862 | f           <--!
     22 |   0.333333 | t
     11 |   0.615385 | t
(6 rows)

pg_trgm `'term'％ANY（array_column）`クエリの文字列配列列にインデックスを付ける方法は？

これが機能しない理由

テストテーブル

マテリアライズドビュー

ilikeの使用

`ilike`の使用