MongoDBテキスト検索に一致するフィールドのみを表示する

長い間考えていたので、思い通りの実装が可能だと思います。ただし、これは非常に大規模なデータベースには適しておらず、インクリメンタルアプローチはまだ検討していません。語幹がなく、ストップワードは手動で定義する必要があります。

アイデアは、mapReduceを使用して、元のドキュメントと検索ワードの元のフィールドへの参照を含む検索ワードのコレクションを作成することです。次に、オートコンプリートの実際のクエリは、インデックスを利用する単純な集計を使用して実行されるため、かなり高速である必要があります。

したがって、次の3つのドキュメントを使用します

{
  "name" : "John F. Kennedy",
  "address" : "Kenson Street 1, 12345 Footown, TX, USA",
  "note" : "loves Kendo and Sushi"
}

および

{
  "name" : "Robert F. Kennedy",
  "address" : "High Street 1, 54321 Bartown, FL, USA",
  "note" : "loves Ethel and cigars"
}

および

{
  "name" : "Robert F. Sushi",
  "address" : "Sushi Street 1, 54321 Bartown, FL, USA",
  "note" : "loves Sushi and more Sushi"
}

textsearchというコレクション内。

マップ/リデュースステージ

基本的には、3つのフィールドのいずれかですべての単語を処理し、ストップワードと数字を削除して、ドキュメントの_idですべての単語を保存します。および中間テーブルのオカレンスのフィールド。

注釈付きコード：

db.textsearch.mapReduce(
  function() {

    // We need to save this in a local var as per scoping problems
    var document = this;

    // You need to expand this according to your needs
    var stopwords = ["the","this","and","or"];

    // This denotes the fields which should be processed
    var fields = ["name","address","note"];

    // For each field...
    fields.forEach(

      function(field){

        // ... we split the field into single words...
        var words = (document[field]).split(" ");

        words.forEach(

          function(word){
            // ...and remove unwanted characters.
            // Please note that this regex may well need to be enhanced
            var cleaned = word.replace(/[;,.]/g,"")

            // Next we check...
            if(
              // ...wether the current word is in the stopwords list,...
              (stopwords.indexOf(word)>-1) ||

              // ...is either a float or an integer... 
              !(isNaN(parseInt(cleaned))) ||
              !(isNaN(parseFloat(cleaned))) ||

              // or is only one character.
              cleaned.length < 2
            )
            {
              // In any of those cases, we do not want to have the current word in our list.
              return
            }
              // Otherwise, we want to have the current word processed.
              // Note that we have to use a multikey id and a static field in order
              // to overcome one of MongoDB's mapReduce limitations:
              // it can not have multiple values assigned to a key.
              emit({'word':cleaned,'doc':document._id,'field':field},1)

          }
        )
      }
    )
  },
  function(key,values) {

    // We sum up each occurence of each word
    // in each field in every document...
    return Array.sum(values);
  },
    // ..and write the result to a collection
  {out: "searchtst" }
)

これを実行すると、コレクションsearchtstが作成されます。。すでに存在している場合は、その内容がすべて置き換えられます。

次のようになります：

{ "_id" : { "word" : "Bartown", "doc" : ObjectId("544b9811fd9270c1492f5835"), "field" : "address" }, "value" : 1 }
{ "_id" : { "word" : "Bartown", "doc" : ObjectId("544bb320fd9270c1492f583c"), "field" : "address" }, "value" : 1 }
{ "_id" : { "word" : "Ethel", "doc" : ObjectId("544b9811fd9270c1492f5835"), "field" : "note" }, "value" : 1 }
{ "_id" : { "word" : "FL", "doc" : ObjectId("544b9811fd9270c1492f5835"), "field" : "address" }, "value" : 1 }
{ "_id" : { "word" : "FL", "doc" : ObjectId("544bb320fd9270c1492f583c"), "field" : "address" }, "value" : 1 }
{ "_id" : { "word" : "Footown", "doc" : ObjectId("544b7e44fd9270c1492f5834"), "field" : "address" }, "value" : 1 }
[...]
{ "_id" : { "word" : "Sushi", "doc" : ObjectId("544bb320fd9270c1492f583c"), "field" : "name" }, "value" : 1 }
{ "_id" : { "word" : "Sushi", "doc" : ObjectId("544bb320fd9270c1492f583c"), "field" : "note" }, "value" : 2 }
[...]

ここで注意すべきことがいくつかあります。まず、単語は、たとえば「FL」で複数回出現する可能性があります。ただし、ここの場合のように、別のドキュメントに含まれている場合があります。一方、単語は、単一のドキュメントの単一のフィールドに複数回出現することもあります。これは後で有利に使用します。

次に、すべてのフィールド、特にwordがあります。 _idの複合インデックスのフィールド、これにより、今後のクエリがかなり高速になります。ただし、これは、インデックスが非常に大きくなり、すべてのインデックスに関して、RAMを消費する傾向があることも意味します。

集約段階

そこで、単語のリストを減らしました。次に、（サブ）文字列をクエリします。ユーザーがこれまでに入力した文字列で始まるすべての単語を検索し、その文字列に一致する単語のリストを返します。これを実行し、適切な形式で結果を取得できるようにするために、集計を使用します。

クエリに必要なすべてのフィールドは複合インデックスの一部であるため、この集計はかなり高速である必要があります。

これは、ユーザーが文字Sを入力した場合の注釈付き集計です。：

db.searchtst.aggregate(
  // We match case insensitive ("i") as we want to prevent
  // typos to reduce our search results
  { $match:{"_id.word":/^S/i} },
  { $group:{
      // Here is where the magic happens:
      // we create a list of distinct words...
      _id:"$_id.word",
      occurrences:{
        // ...add each occurrence to an array...
        $push:{
          doc:"$_id.doc",
          field:"$_id.field"
        } 
      },
      // ...and add up all occurrences to a score
      // Note that this is optional and might be skipped
      // to speed up things, as we should have a covered query
      // when not accessing $value, though I am not too sure about that
      score:{$sum:"$value"}
    }
  },
  {
    // Optional. See above
    $sort:{_id:-1,score:1}
  }
)

このクエリの結果は次のようになり、かなり自明であるはずです：

{
  "_id" : "Sushi",
  "occurences" : [
    { "doc" : ObjectId("544b7e44fd9270c1492f5834"), "field" : "note" },
    { "doc" : ObjectId("544bb320fd9270c1492f583c"), "field" : "address" },
    { "doc" : ObjectId("544bb320fd9270c1492f583c"), "field" : "name" },
    { "doc" : ObjectId("544bb320fd9270c1492f583c"), "field" : "note" }
  ],
  "score" : 5
}
{
  "_id" : "Street",
  "occurences" : [
    { "doc" : ObjectId("544b7e44fd9270c1492f5834"), "field" : "address" },
    { "doc" : ObjectId("544b9811fd9270c1492f5835"), "field" : "address" },
    { "doc" : ObjectId("544bb320fd9270c1492f583c"), "field" : "address" }
  ],
  "score" : 3
}

寿司のスコア5は、1つのドキュメントのメモフィールドに寿司という単語が2回出現することから得られます。これは意図された動作です。

これは貧弱な解決策かもしれませんが、考えられる無数のユースケースに合わせて最適化する必要があり、本番環境で中途半端に役立つようにインクリメンタルmapReduceを実装する必要がありますが、期待どおりに機能します。 hth。

編集

もちろん、$matchを削除することもできますステージングして$outを追加します結果を前処理するための集計フェーズの段階：

db.searchtst.aggregate(
  {
    $group:{
      _id:"$_id.word",
      occurences:{ $push:{doc:"$_id.doc",field:"$_id.field"}},
      score:{$sum:"$value"}
     }
   },{
     $out:"search"
   })

これで、結果のsearchをクエリできます。物事をスピードアップするためのコレクション。基本的に、リアルタイムの結果を速度と交換します。

編集2 ：前処理アプローチを採用する場合、searchtst 例のコレクションは、ディスクスペースと、さらに重要な貴重なRAMの両方を節約するために、集約の終了後に削除する必要があります。