パンダを使用した大規模なデータワークフロー

私は日常的にこの方法で数十ギガバイトのデータを使用しています。ディスク上にテーブルがあり、クエリを介して読み取り、データを作成して追加し直します。

データを保存する方法に関するいくつかの提案については、ドキュメントを読み、このスレッドの後半で読む価値があります。

データの保存方法に影響する詳細。たとえば、次のようになります。
できるだけ詳細に説明します。構造の開発をお手伝いします。

データのサイズ、行数、列数、列数。行を追加しますか、それとも単に列を追加しますか？
一般的な操作はどのようになりますか。例えば。列に対してクエリを実行して一連の行と特定の列を選択してから、操作（メモリ内）を実行し、新しい列を作成して保存します。
（おもちゃの例を示すと、より具体的な推奨事項を提供できる可能性があります。）
その処理の後、あなたはどうしますか？ステップ2はアドホックですか、それとも繰り返し可能ですか？
フラットファイルの入力：Gb単位の大まかな合計サイズ。これらはどのように編成されていますか？記録によって？それぞれに異なるフィールドが含まれていますか、それとも各ファイルのすべてのフィールドを含むファイルごとにいくつかのレコードがありますか？
基準に基づいて行（レコード）のサブセットを選択したことはありますか（たとえば、フィールドA> 5の行を選択します）？次に何かを実行しますか、それともすべてのレコードでフィールドA、B、Cを選択しますか（そして何かを実行します）？
すべての列を（グループで）「作業」しますか、それともレポートにのみ使用できる適切な割合がありますか（たとえば、データを保持したいが、それを取り込む必要はありません）最終結果の時間まで明示的に列）？

ソリューション

パンダが少なくとも0.10.1あることを確認してくださいインストールされています。

チャンクごとの反復ファイルと複数のテーブルクエリを読み取ります。

pytablesは行単位（クエリの対象）で動作するように最適化されているため、フィールドのグループごとにテーブルを作成します。このようにして、フィールドの小さなグループを簡単に選択できます（大きなテーブルで機能しますが、この方法で行う方が効率的です...将来この制限を修正できる可能性があります...これはとにかくもっと直感的です）：
（以下は擬似コードです。）

import numpy as np
import pandas as pd

# create a store
store = pd.HDFStore('mystore.h5')

# this is the key to your storage:
#    this maps your fields to a specific group, and defines 
#    what you want to have as data_columns.
#    you might want to create a nice class wrapping this
#    (as you will want to have this map and its inversion)  
group_map = dict(
    A = dict(fields = ['field_1','field_2',.....], dc = ['field_1',....,'field_5']),
    B = dict(fields = ['field_10',......        ], dc = ['field_10']),
    .....
    REPORTING_ONLY = dict(fields = ['field_1000','field_1001',...], dc = []),

)

group_map_inverted = dict()
for g, v in group_map.items():
    group_map_inverted.update(dict([ (f,g) for f in v['fields'] ]))

ファイルを読み込み、ストレージを作成します（基本的にappend_to_multipleを実行しますする）：

for f in files:
   # read in the file, additional options may be necessary here
   # the chunksize is not strictly necessary, you may be able to slurp each 
   # file into memory in which case just eliminate this part of the loop 
   # (you can also change chunksize if necessary)
   for chunk in pd.read_table(f, chunksize=50000):
       # we are going to append to each table by group
       # we are not going to create indexes at this time
       # but we *ARE* going to create (some) data_columns

       # figure out the field groupings
       for g, v in group_map.items():
             # create the frame for this group
             frame = chunk.reindex(columns = v['fields'], copy = False)    

             # append it
             store.append(g, frame, index=False, data_columns = v['dc'])

これで、ファイルにすべてのテーブルが作成されました（実際には、必要に応じて別々のファイルに保存できます。ファイル名をgroup_mapに追加する必要がありますが、おそらくこれは必要ありません）。

これは、列を取得して新しい列を作成する方法です。

frame = store.select(group_that_I_want)
# you can optionally specify:
# columns = a list of the columns IN THAT GROUP (if you wanted to
#     select only say 3 out of the 20 columns in this sub-table)
# and a where clause if you want a subset of the rows

# do calculations on this frame
new_frame = cool_function_on_frame(frame)

# to 'add columns', create a new group (you probably want to
# limit the columns in this new_group to be only NEW ones
# (e.g. so you don't overlap from the other tables)
# add this info to the group_map
store.append(new_group, new_frame.reindex(columns = new_columns_created, copy = False), data_columns = new_columns_created)

post_processingの準備ができたら：

# This may be a bit tricky; and depends what you are actually doing.
# I may need to modify this function to be a bit more general:
report_data = store.select_as_multiple([groups_1,groups_2,.....], where =['field_1>0', 'field_1000=foo'], selector = group_1)

data_columnsについては、実際に ANYを定義する必要はありません。 data_columns;列に基づいて行をサブ選択できます。例えば。次のようなもの：

store.select(group, where = ['field_1000=foo', 'field_1001>0'])

これらは、レポート生成の最終段階で最も興味深い場合があります（基本的に、データ列は他の列から分離されているため、多くを定義すると効率に多少影響する可能性があります）。

次のこともお勧めします：

フィールドのリストを取得し、groups_mapでグループを検索し、これらを選択して結果を連結する関数を作成します。これにより、結果のフレームが取得されます（これは基本的にselect_as_multipleが行うことです）。 このようにすると、構造はかなり透過的になります。
特定のデータ列のインデックス（行のサブセット化がはるかに高速になります）。
圧縮を有効にします。

ご不明な点がありましたらお知らせください。