Libpuzzle何百万もの写真のインデックスを作成しますか？

それでは、彼らが提供する例を見て、拡張してみましょう。

各画像に関連する情報（パス、名前、説明など）を格納するテーブルがあると仮定します。そのテーブルには、データベースに最初にデータを入力するときに計算および保存される、圧縮された署名のフィールドが含まれます。このようにそのテーブルを定義しましょう：

CREATE TABLE images (
    image_id INTEGER NOT NULL PRIMARY KEY,
    name TEXT,
    description TEXT,
    file_path TEXT NOT NULL,
    url_path TEXT NOT NULL,
    signature TEXT NOT NULL
);

最初に署名を計算するときに、署名からいくつかの単語も計算します。

// this will be run once for each image:
$cvec = puzzle_fill_cvec_from_file('img1.jpg');
$words = array();
$wordlen = 10; // this is $k from the example
$wordcnt = 100; // this is $n from the example
for ($i=0; $i<min($wordcnt, strlen($cvec)-$wordlen+1); $i++) {
    $words[] = substr($cvec, $i, $wordlen);
}

これで、これらの単語を次のように定義されたテーブルに入れることができます。

CREATE TABLE img_sig_words (
    image_id INTEGER NOT NULL,
    sig_word TEXT NOT NULL,
    FOREIGN KEY (image_id) REFERENCES images (image_id),
    INDEX (image_id, sig_word)
);

次に、そのテーブルに挿入し、単語が見つかった場所の位置インデックスを付加して、単語が一致したときに、署名の同じ場所で一致したことを確認します。

// the signature, along with all other data, has already been inserted into the images
// table, and $image_id has been populated with the resulting primary key
foreach ($words as $index => $word) {
    $sig_word = $index.'__'.$word;
    $dbobj->query("INSERT INTO img_sig_words (image_id, sig_word) VALUES ($image_id,
        '$sig_word')"); // figure a suitably defined db abstraction layer...
}

このようにデータが初期化されると、一致する単語を含む画像を比較的簡単に取得できます。

// $image_id is set to the base image that you are trying to find matches to
$dbobj->query("SELECT i.*, COUNT(isw.sig_word) as strength FROM images i JOIN img_sig_words
    isw ON i.image_id = isw.image_id JOIN img_sig_words isw_search ON isw.sig_word =
    isw_search.sig_word AND isw.image_id != isw_search.image_id WHERE
    isw_search.image_id = $image_id GROUP BY i.image_id, i.name, i.description,
    i.file_path, i.url_path, i.signature ORDER BY strength DESC");

HAVINGを追加することで、クエリを改善できます。最小のstrengthを必要とする句、したがって、一致するセットがさらに削減されます。

これが最も効率的な設定であることを保証するものではありませんが、探していることを達成するために大まかに機能するはずです。

基本的に、この方法で単語を分割して保存すると、署名に対して特別な機能を実行しなくても、大まかな距離チェックを実行できます。