PyTables provides a file-based database format:
In [98]: filename = path + ‘tab.h5’
h5 = tb.open_file(filename, ‘w’)
For our example case, we generate a table with 2,000,000 rows of data: In [99]: rows = 2000000
The table itself has a datetime column, two int columns, and two float columns:
In [100]: row_des = {
‘Date’: tb.StringCol(26, pos=1), ‘No1’: tb.IntCol(pos=2),
‘No2’: tb.IntCol(pos=3), ‘No3’: tb.Float64Col(pos=4), ‘No4’: tb.Float64Col(pos=5) }
When creating the table, we choose no compression. A later example will add compression as well:
In [101]: filters = tb.Filters(complevel=0) # no compression
tab = h5.create_table(‘/’, ‘ints_floats’, row_des, title=‘Integers and Floats’,
expectedrows=rows, filters=filters)
In [102]: tab
Out[102]: /ints_floats (Table(0,)) ‘Integers and Floats’ description := {
“Date”: StringCol(itemsize=26, shape=(), dflt=”, pos=0), “No1”: Int32Col(shape=(), dflt=0, pos=1),
“No2”: Int32Col(shape=(), dflt=0, pos=2), “No3”: Float64Col(shape=(), dflt=0.0, pos=3), “No4”: Float64Col(shape=(), dflt=0.0, pos=4)} byteorder := ‘little’
chunkshape := (2621,)
In [103]: pointer = tab.row Now we generate the sample data:
In [104]: ran_int = np.random.randint(0, 10000, size=(rows, 2)) ran_flo = np.random.standard_normal((rows, 2)).round(5)
In [105]: %%time
for i in range(rows):
pointer[‘Date’] = dt.datetime.now() pointer[‘No1’] = ran_int[i, 0] pointer[‘No2’] = ran_int[i, 1] pointer[‘No3’] = ran_flo[i, 0] pointer[‘No4’] = ran_flo[i, 1] pointer.append()
# this appends the data and
# moves the pointer one row forward
tab.flush()
Out[105]: CPU times: user 15.7 s, sys: 3.53 s, total: 19.2 s Wall time: 19.4 s
Always remember to commit your changes. What the commit method is for the SQLite3
database, the flush method is for PyTables. We can now inspect the data on disk, first
logically via our Table object and second physically via the file information:
In [106]: tab
Out[106]: /ints_floats (Table(2000000,)) ‘Integers and Floats’ description := {
“Date”: StringCol(itemsize=26, shape=(), dflt=”, pos=0), “No1”: Int32Col(shape=(), dflt=0, pos=1),
“No2”: Int32Col(shape=(), dflt=0, pos=2), “No3”: Float64Col(shape=(), dflt=0.0, pos=3), “No4”: Float64Col(shape=(), dflt=0.0, pos=4)} byteorder := ‘little’
chunkshape := (2621,)
In [107]: ll $path*
Out[107]: -rw-r—r— 1 root 100156256 28. Sep 15:18 /flash/data/tab.h5
There is a more performant and Pythonic way to accomplish the same result, by the use of NumPy structured arrays:
In [108]: dty = np.dtype([(‘Date’, ‘S26’), (‘No1’, ‘<i4’), (‘No2’, ‘<i4’), (‘No3’, ‘<f8’), (‘No4’, ‘<f8’)]) sarray = np.zeros(len(ran_int), dtype=dty)
In [109]: sarray
Out[109]: array([(”, 0, 0, 0.0, 0.0), (”, 0, 0, 0.0, 0.0), (”, 0, 0, 0.0, 0.0),
…, (”, 0, 0, 0.0, 0.0), (”, 0, 0, 0.0, 0.0), (”, 0, 0, 0.0, 0.0)],
dtype=[(‘Date’, ‘S26’), (‘No1’, ‘<i4’), (‘No2’, ‘<i4’), (‘No3’, ‘<f8’), (‘No4’, ‘<f8’)])
In [110]: %%time
sarray[‘Date’] = dt.datetime.now() sarray[‘No1’] = ran_int[:, 0] sarray[‘No2’] = ran_int[:, 1] sarray[‘No3’] = ran_flo[:, 0] sarray[‘No4’] = ran_flo[:, 1]
Out[110]: CPU times: user 113 ms, sys: 18 ms, total: 131 ms Wall time: 131 ms
Equipped with the complete data set now stored in the structured array, the creation of the table boils down to the following line of code. Note that the row description is not needed anymore; PyTables uses the NumPydtype instead:
In [111]: %%time
h5.create_table(‘/’, ‘ints_floats_from_array’, sarray, title=‘Integers and Floats’,
expectedrows=rows, filters=filters) Out[111]: CPU times: user 38 ms, sys: 117 ms, total: 155 ms
Wall time: 154 ms
/ints_floats_from_array (Table(2000000,)) ‘Integers and Floats’ description := {
“Date”: StringCol(itemsize=26, shape=(), dflt=”, pos=0), “No1”: Int32Col(shape=(), dflt=0, pos=1),
“No3”: Float64Col(shape=(), dflt=0.0, pos=3), “No4”: Float64Col(shape=(), dflt=0.0, pos=4)} byteorder := ‘little’
chunkshape := (2621,)
Being an order of magnitude faster than the previous approach, this approach achieves the same result and also needs less code:
In [112]: h5
Out[112]: File(filename=/flash/data/tab.h5, title=u”, mode=‘w’, root_uep=’/’, filters=Filters(complevel=0, shuffle=False, fletcher32=False, least_significant_digit=None))
/ (RootGroup) u”
/ints_floats (Table(2000000,)) ‘Integers and Floats’ description := {
“Date”: StringCol(itemsize=26, shape=(), dflt=”, pos=0), “No1”: Int32Col(shape=(), dflt=0, pos=1),
“No2”: Int32Col(shape=(), dflt=0, pos=2), “No3”: Float64Col(shape=(), dflt=0.0, pos=3), “No4”: Float64Col(shape=(), dflt=0.0, pos=4)} byteorder := ‘little’
chunkshape := (2621,)
/ints_floats_from_array (Table(2000000,)) ‘Integers and Floats’ description := {
“Date”: StringCol(itemsize=26, shape=(), dflt=”, pos=0), “No1”: Int32Col(shape=(), dflt=0, pos=1),
“No2”: Int32Col(shape=(), dflt=0, pos=2), “No3”: Float64Col(shape=(), dflt=0.0, pos=3), “No4”: Float64Col(shape=(), dflt=0.0, pos=4)} byteorder := ‘little’
chunkshape := (2621,)
We can now delete the duplicate table, since it is no longer needed: In [113]: h5.remove_node(‘/’, ‘ints_floats_from_array’)
The Table object behaves like typical Python and NumPy objects when it comes to slicing,
for example:
In [114]: tab[:3]
Out[114]: array([(‘2014-09-28 15:17:57.631234’, 4342, 1672, -0.9293, 0.06343), (‘2014-09-28 15:17:57.631368’, 3839, 1563, -2.02808, 0.3964), (‘2014-09-28 15:17:57.631383’, 5100, 1326, 0.03401, 0.46742)], dtype=[(‘Date’, ‘S26’), (‘No1’, ‘<i4’), (‘No2’, ‘<i4’), (‘No3’, ‘<f8’), (‘No4’, ‘<f8’)])
Similarly, we can select single columns only: In [115]: tab[:4][‘No4’]
Out[115]: array([ 0.06343, 0.3964 , 0.46742, -0.56959])
Even more convenient and important: we can apply NumPy universal functions to tables or
subsets of the table:
In [116]: %time np.sum(tab[:][‘No3’])
Out[116]: CPU times: user 31 ms, sys: 58 ms, total: 89 ms Wall time: 88.3 ms
-115.34513999999896
In [117]: %time np.sum(np.sqrt(tab[:][‘No1’]))
Out[117]: CPU times: user 53 ms, sys: 48 ms, total: 101 ms Wall time: 101 ms
133360523.08794475
When it comes to plotting, the Table object also behaves very similarly to an ndarray
object (cf. Figure 7-5): In [118]: %%time
print len(tab[:][‘No3’]) Out[118]: 2000000
CPU times: user 396 ms, sys: 89 ms, total: 485 ms Wall time: 485 ms
Figure 7-5. Histogram of data
And, of course, we have rather flexible tools to query data via typical SQL-like statements,
as in the following example (the result of which is neatly illustrated in Figure 7-6; compare it with Figure 7-2, based on a pandas query):
In [119]: %%time
res = np.array([(row[‘No3’], row[‘No4’]) for row in
tab.where(‘((No3 < -0.5) | (No3 > 0.5)) \ & ((No4 < -1) | (No4 > 1))’)])[::100] Out[119]: CPU times: user 530 ms, sys: 52 ms, total: 582 ms Wall time: 469 ms
In [120]: plt.plot(res.T[0], res.T[1], ‘ro’) plt.grid(True)
Figure 7-6. Scatter plot of query result
FAST COMPLEX QUERIES
Both pandas and PyTables are able to process complex, SQL-like queries and selections. They are both optimized
for speed when it comes to such operations.
As the following examples show, working with data stored in PyTables as a Table object
makes you feel like you are working with NumPy and in-memory, both from a syntax and a
performance point of view: In [121]: %%time
print “Max %18.3f” % values.max() print “Ave %18.3f” % values.mean() print “Min %18.3f” % values.min() print “Std %18.3f” % values.std() Out[121]: Max 5.152
Ave -0.000 Min -5.537 Std 1.000
CPU times: user 44 ms, sys: 39 ms, total: 83 ms Wall time: 82.6 ms
In [122]: %%time
results = [(row[‘No1’], row[‘No2’]) for row in
tab.where(‘((No1 > 9800) | (No1 < 200)) \ & ((No2 > 4500) & (No2 < 5500))’)] for res in results[:4]:
print res
Out[122]: (9987, 4965) (9934, 5263) (9960, 4729) (130, 5023)
CPU times: user 167 ms, sys: 37 ms, total: 204 ms Wall time: 118 ms
In [123]: %%time
results = [(row[‘No1’], row[‘No2’]) for row in
tab.where(‘(No1 == 1234) & (No2 > 9776)’)] for res in results:
print res
Out[123]: (1234, 9805) (1234, 9785) (1234, 9821)
CPU times: user 93 ms, sys: 40 ms, total: 133 ms Wall time: 90.1 ms