• No results found

PyTables provides a file-based database format:

In [98]: filename = path + ‘tab.h5’

h5 = tb.open_file(filename, ‘w’)

For our example case, we generate a table with 2,000,000 rows of data: In [99]: rows = 2000000

The table itself has a datetime column, two int columns, and two float columns:

In [100]: row_des = {

‘Date’: tb.StringCol(26, pos=1), ‘No1’: tb.IntCol(pos=2),

‘No2’: tb.IntCol(pos=3), ‘No3’: tb.Float64Col(pos=4), ‘No4’: tb.Float64Col(pos=5) }

When creating the table, we choose no compression. A later example will add compression as well:

In [101]: filters = tb.Filters(complevel=0) # no compression

tab = h5.create_table(‘/’, ‘ints_floats’, row_des, title=‘Integers and Floats’,

expectedrows=rows, filters=filters)

In [102]: tab

Out[102]: /ints_floats (Table(0,)) ‘Integers and Floats’ description := {

“Date”: StringCol(itemsize=26, shape=(), dflt=”, pos=0), “No1”: Int32Col(shape=(), dflt=0, pos=1),

“No2”: Int32Col(shape=(), dflt=0, pos=2), “No3”: Float64Col(shape=(), dflt=0.0, pos=3), “No4”: Float64Col(shape=(), dflt=0.0, pos=4)} byteorder := ‘little’

chunkshape := (2621,)

In [103]: pointer = tab.row Now we generate the sample data:

In [104]: ran_int = np.random.randint(0, 10000, size=(rows, 2)) ran_flo = np.random.standard_normal((rows, 2)).round(5)

In [105]: %%time

for i in range(rows):

pointer[‘Date’] = dt.datetime.now() pointer[‘No1’] = ran_int[i, 0] pointer[‘No2’] = ran_int[i, 1] pointer[‘No3’] = ran_flo[i, 0] pointer[‘No4’] = ran_flo[i, 1] pointer.append()

# this appends the data and

# moves the pointer one row forward

tab.flush()

Out[105]: CPU times: user 15.7 s, sys: 3.53 s, total: 19.2 s Wall time: 19.4 s

Always remember to commit your changes. What the commit method is for the SQLite3

database, the flush method is for PyTables. We can now inspect the data on disk, first

logically via our Table object and second physically via the file information:

In [106]: tab

Out[106]: /ints_floats (Table(2000000,)) ‘Integers and Floats’ description := {

“Date”: StringCol(itemsize=26, shape=(), dflt=”, pos=0), “No1”: Int32Col(shape=(), dflt=0, pos=1),

“No2”: Int32Col(shape=(), dflt=0, pos=2), “No3”: Float64Col(shape=(), dflt=0.0, pos=3), “No4”: Float64Col(shape=(), dflt=0.0, pos=4)} byteorder := ‘little’

chunkshape := (2621,)

In [107]: ll $path*

Out[107]: -rw-r—r— 1 root 100156256 28. Sep 15:18 /flash/data/tab.h5

There is a more performant and Pythonic way to accomplish the same result, by the use of NumPy structured arrays:

In [108]: dty = np.dtype([(‘Date’, ‘S26’), (‘No1’, ‘<i4’), (‘No2’, ‘<i4’), (‘No3’, ‘<f8’), (‘No4’, ‘<f8’)]) sarray = np.zeros(len(ran_int), dtype=dty)

In [109]: sarray

Out[109]: array([(”, 0, 0, 0.0, 0.0), (”, 0, 0, 0.0, 0.0), (”, 0, 0, 0.0, 0.0),

…, (”, 0, 0, 0.0, 0.0), (”, 0, 0, 0.0, 0.0), (”, 0, 0, 0.0, 0.0)],

dtype=[(‘Date’, ‘S26’), (‘No1’, ‘<i4’), (‘No2’, ‘<i4’), (‘No3’, ‘<f8’), (‘No4’, ‘<f8’)])

In [110]: %%time

sarray[‘Date’] = dt.datetime.now() sarray[‘No1’] = ran_int[:, 0] sarray[‘No2’] = ran_int[:, 1] sarray[‘No3’] = ran_flo[:, 0] sarray[‘No4’] = ran_flo[:, 1]

Out[110]: CPU times: user 113 ms, sys: 18 ms, total: 131 ms Wall time: 131 ms

Equipped with the complete data set now stored in the structured array, the creation of the table boils down to the following line of code. Note that the row description is not needed anymore; PyTables uses the NumPydtype instead:

In [111]: %%time

h5.create_table(‘/’, ‘ints_floats_from_array’, sarray, title=‘Integers and Floats’,

expectedrows=rows, filters=filters) Out[111]: CPU times: user 38 ms, sys: 117 ms, total: 155 ms

Wall time: 154 ms

/ints_floats_from_array (Table(2000000,)) ‘Integers and Floats’ description := {

“Date”: StringCol(itemsize=26, shape=(), dflt=”, pos=0), “No1”: Int32Col(shape=(), dflt=0, pos=1),

“No3”: Float64Col(shape=(), dflt=0.0, pos=3), “No4”: Float64Col(shape=(), dflt=0.0, pos=4)} byteorder := ‘little’

chunkshape := (2621,)

Being an order of magnitude faster than the previous approach, this approach achieves the same result and also needs less code:

In [112]: h5

Out[112]: File(filename=/flash/data/tab.h5, title=u”, mode=‘w’, root_uep=’/’, filters=Filters(complevel=0, shuffle=False, fletcher32=False, least_significant_digit=None))

/ (RootGroup) u”

/ints_floats (Table(2000000,)) ‘Integers and Floats’ description := {

“Date”: StringCol(itemsize=26, shape=(), dflt=”, pos=0), “No1”: Int32Col(shape=(), dflt=0, pos=1),

“No2”: Int32Col(shape=(), dflt=0, pos=2), “No3”: Float64Col(shape=(), dflt=0.0, pos=3), “No4”: Float64Col(shape=(), dflt=0.0, pos=4)} byteorder := ‘little’

chunkshape := (2621,)

/ints_floats_from_array (Table(2000000,)) ‘Integers and Floats’ description := {

“Date”: StringCol(itemsize=26, shape=(), dflt=”, pos=0), “No1”: Int32Col(shape=(), dflt=0, pos=1),

“No2”: Int32Col(shape=(), dflt=0, pos=2), “No3”: Float64Col(shape=(), dflt=0.0, pos=3), “No4”: Float64Col(shape=(), dflt=0.0, pos=4)} byteorder := ‘little’

chunkshape := (2621,)

We can now delete the duplicate table, since it is no longer needed: In [113]: h5.remove_node(‘/’, ‘ints_floats_from_array’)

The Table object behaves like typical Python and NumPy objects when it comes to slicing,

for example:

In [114]: tab[:3]

Out[114]: array([(‘2014-09-28 15:17:57.631234’, 4342, 1672, -0.9293, 0.06343), (‘2014-09-28 15:17:57.631368’, 3839, 1563, -2.02808, 0.3964), (‘2014-09-28 15:17:57.631383’, 5100, 1326, 0.03401, 0.46742)], dtype=[(‘Date’, ‘S26’), (‘No1’, ‘<i4’), (‘No2’, ‘<i4’), (‘No3’, ‘<f8’), (‘No4’, ‘<f8’)])

Similarly, we can select single columns only: In [115]: tab[:4][‘No4’]

Out[115]: array([ 0.06343, 0.3964 , 0.46742, -0.56959])

Even more convenient and important: we can apply NumPy universal functions to tables or

subsets of the table:

In [116]: %time np.sum(tab[:][‘No3’])

Out[116]: CPU times: user 31 ms, sys: 58 ms, total: 89 ms Wall time: 88.3 ms

-115.34513999999896

In [117]: %time np.sum(np.sqrt(tab[:][‘No1’]))

Out[117]: CPU times: user 53 ms, sys: 48 ms, total: 101 ms Wall time: 101 ms

133360523.08794475

When it comes to plotting, the Table object also behaves very similarly to an ndarray

object (cf. Figure 7-5): In [118]: %%time

print len(tab[:][‘No3’]) Out[118]: 2000000

CPU times: user 396 ms, sys: 89 ms, total: 485 ms Wall time: 485 ms

Figure 7-5. Histogram of data

And, of course, we have rather flexible tools to query data via typical SQL-like statements,

as in the following example (the result of which is neatly illustrated in Figure 7-6; compare it with Figure 7-2, based on a pandas query):

In [119]: %%time

res = np.array([(row[‘No3’], row[‘No4’]) for row in

tab.where(‘((No3 < -0.5) | (No3 > 0.5)) \ & ((No4 < -1) | (No4 > 1))’)])[::100] Out[119]: CPU times: user 530 ms, sys: 52 ms, total: 582 ms Wall time: 469 ms

In [120]: plt.plot(res.T[0], res.T[1], ‘ro’) plt.grid(True)

Figure 7-6. Scatter plot of query result

FAST COMPLEX QUERIES

Both pandas and PyTables are able to process complex, SQL-like queries and selections. They are both optimized

for speed when it comes to such operations.

As the following examples show, working with data stored in PyTables as a Table object

makes you feel like you are working with NumPy and in-memory, both from a syntax and a

performance point of view: In [121]: %%time

print “Max %18.3f” % values.max() print “Ave %18.3f” % values.mean() print “Min %18.3f” % values.min() print “Std %18.3f” % values.std() Out[121]: Max 5.152

Ave -0.000 Min -5.537 Std 1.000

CPU times: user 44 ms, sys: 39 ms, total: 83 ms Wall time: 82.6 ms

In [122]: %%time

results = [(row[‘No1’], row[‘No2’]) for row in

tab.where(‘((No1 > 9800) | (No1 < 200)) \ & ((No2 > 4500) & (No2 < 5500))’)] for res in results[:4]:

print res

Out[122]: (9987, 4965) (9934, 5263) (9960, 4729) (130, 5023)

CPU times: user 167 ms, sys: 37 ms, total: 204 ms Wall time: 118 ms

In [123]: %%time

results = [(row[‘No1’], row[‘No2’]) for row in

tab.where(‘(No1 == 1234) & (No2 > 9776)’)] for res in results:

print res

Out[123]: (1234, 9805) (1234, 9785) (1234, 9821)

CPU times: user 93 ms, sys: 40 ms, total: 133 ms Wall time: 90.1 ms