• No results found

Exporting plots is simple usingsavefig(’filename.ext’)where ext determines the type of exported file to produce. ext can be one of png, pdf, ps, eps or svg.

>>> plot(randn(10,2))

>>> savefig(’figure.pdf’) # PDF export

>>> savefig(’figure.png’) # PNG export

>>> savefig(’figure.svg’) # Scalable Vector Graphics export

savefighas a number of useful keyword arguments. In particular,dpiis useful when exporting png files. The defaultdpiis 100.

>>> plot(randn(10,2))

>>> savefig(’figure.png’, dpi = 600) # High resolution PNG export

15.7

Exercises

1. Download data for the past 20 years for the S&P 500 from Yahoo!. Plot the price against dates, and ensure the date display is reasonable.

2. Compute Friday-to-Friday returns using the log difference of closing prices and produce a histogram. Experiment with the number of bins.

3. Compute the percentage of weekly returns and produce a pie chart containing the percentage of weekly returns in each of:

(a) r ≤ −2% (b) −2%< r ≤0% (c) 0< r ≤2% (d) r > 2%

4. Download 20 years of FTSE data, and compute Friday-to-Friday returns. Produce a scatter plot of the FTSE returns against the S&P 500 returns. Be sure to label the axes and provide a title.

5. Repeat exercise 4, but add in the fit line showing is the OLS fit of regressing FTSE on the S&P plus a constant.

6. Compute EWMA variance for both the S&P 500 and FTSE and plot against dates. An EWMA variance hasσ2

t ‘= (1−λ) rt −12 + σt −12 where r02 = σ20is the full sample variance andλ = 0.97.

Chapter 16

Structured Arrays

pandas, the topic of Chapter17, has substantially augmented the structured arrays provided by NumPy. The pandas Series and DataFrame types are the preferred method to handle heterogeneous data and/or data sets which have useful metadata. This chapter has been retained since the NumPy data structures may be encountered when using some functions, or in legacy code produced by others.

The standard, homogeneous NumPy array is a highly optimized data structure where all elements have the same data type (e.g. float) and can be accessed using slicing in many dimensions. These data structures are essential for high-performance numerical computing – especially for linear algebra. Unfor- tunately, actual data is often heterogeneous (e.g. mixtures of dates, strings and numbers) and it is useful to track series by meaningful names, not just “column 0”. These features are not available in a homogeneous NumPy array. However, NumPy also supports mixed arrays which solve both of these issues and so are a useful data structures for managing data prior to statistical analysis. Conceptually, a mixed array with named columns is similar to a spreadsheet where each column can have its own name and data type.

16.1

Mixed Arrays with Column Names

A mixed NumPy array can be initialized usingarray,zerosor other functions which create arrays and allow the data type to be directly specified. Mixed arrays are in many ways similar to standard NumPy arrays, except that thedtypeinput to the function is specified either using tuples of the form(name,type), or using a dictionary of the form{’names’:names,’formats’:formats)where names is a tuple of column names and formats is a tuple of NumPy data types.

>>> x = zeros(4,[(’date’,’int’),(’ret’,’float’)])

>>> x = zeros(4,{’names’: (’date’,’ret’), ’formats’: (’int’, ’float’)}) >>> x

array([(0, 0.0), (0, 0.0), (0, 0.0), (0, 0.0)], dtype=[(’date’, ’<i4’), (’ret’, ’<f8’)])

These two command are identical, and illustrate the two methods to create an array which contain a named column “date”, for integer data, and a named column “ret” for floats. Named columns allows for access using dictionary-type syntax.

>>> x[’date’] array([0, 0, 0, 0])

>>> x[’ret’]

array([0.0, 0.0, 0.0, 0.0])

Standard multidimensional slice notation is not available since heterogeneous arrays behave like nested lists and not homogeneous NumPy arrays.

>>> x[0] # Data tuple 0

(0, 0.0)

>>> x[:3] # Data tuples 0, 1 and 2

array([(0, 0.0), (0, 0.0), (0, 0.0)],

dtype=[(’date’, ’<i4’), (’ret’, ’<f8’)])

>>> x[:,1] # Error

IndexError: too many indices

The first two commands show that the array is composed of tuples and so differs from standard homo- geneous NumPy arrays. The error in the third command occurs since columns are accessed using names and not multidimensional slices.

16.1.1 Data Types

A large number of primitive data types are available in NumPy.

Type Syntax Description

Boolean b True/False

Integers i1,i2,i4,i8 1 to 8 byte signed integers (−2B −1, . . . 2B −1−1) Unsigned Integers u1,u2,u4,u8 1 to 8 byte signed integers (0, . . . 2B)

Floating Point f4,f8 Single (4) and double (8) precision float Complex c8,c16 Single (8) and double (16) precision complex

Object On Generic n -byte object

String Sn ,an n -letter string

Unicode String Un n -letter unicode string

The majority of data types are for numeric data, and are simple to understand. The n in the string data type indicates the maximum length of a string. Attempting to insert a string with more than n characters will truncate the string. The object data type is somewhat abstract, but allows for storing Python objects such asdatetimes.

Custom data types can be built usingdtype. The constructed data type can then be used in the con- struction of a mixed array.

>>> type = dtype([(’var1’,’f8’), (’var2’,’i8’), (’var3’,’u8’)]) >>> type

dtype([(’var1’, ’<f8’), (’var2’, ’<i8’), (’var3’, ’<u8’)])

Data types can even be nested to create a structured environment where one of the “variables” has mul- tiple values. Consider this example which uses a nested data type to contain the bid and ask price of a stock, along with the time of the transaction.

>>> t = dtype([(’date’, ’O8’), (’prices’, ba_type)]) >>> data = zeros(2,t)

>>> data

array([(0, (0.0, 0.0)), (0, (0.0, 0.0))],

dtype=[(’date’, ’O’), (’prices’, [(’bid’, ’<f8’), (’ask’, ’<f8’)])])

>>> data[’prices’]

array([(0.0, 0.0), (0.0, 0.0)],

dtype=[(’bid’, ’<f8’), (’ask’, ’<f8’)])

>>> data[’prices’][’bid’] array([ 0., 0.])

In this example, data is an array where each item has 2 elements, the date and the price. Price is also an ar- ray with 2 elements. Names can also be used to access values in nested arrays (e.g.data[’prices’][’bid’]

returns an array containing all bid prices). In practice nested arrays can almost always be expressed as a non-nested array without loss of fidelity.

Determining the size of object NumPy arrays can store objects which are anything which fall outside of the usual data types. One example of a useful, but abstract, data type isdatetime. One method to determine the size of an object is to create a plain array containing the object – which will automatically determine the data type – and then to query the size from the array.

>>> import datetime as dt

>>> x = array([dt.datetime.now()]) >>> x.dtype.itemsize # The size in bytes

>>> x.dtype.descr # The name and description

16.1.2 Example: TAQ Data

TAQ is the NYSE Trade and Quote database which contains all trades and quotes of US listed equities which trade on major US markets (not just the NYSE). A record from a trade contains a number of fields:

• Date - The Date in YYYYMMDD format stored as a 4-byte unsigned integer

• Time - Time in HHMMSS format, stored as a 4-byte unsigned integer

• Size - Number of shares trades, stores as a 4 byte unsigned integer

• G127 rule indicator - Numeric value, stored as a 2 byte unsigned integer

• Correction - Numeric indicator of a correction, stored as a 2 byte unsigned integer

• Condition - Market condition, a 2 character string

• Exchange - The exchange where the trade occurred, a 1-character string

>>> t = dtype([(’date’, ’u4’), (’time’, ’u4’), ... (’size’, ’u4’), (’price’, ’f8’), ... (’g127’, ’u2’), (’corr’, ’u2’), ... (’cond’, ’S2’), (’ex’, ’S2’)]) >>> taqData = zeros(10, dtype=t)

>>> taqData[0] = (20120201,120139,1,53.21,0,0,’’,’N’)

An alternative is to store the date and time as adatetime, which is an 8-byte object.

>>> import datetime as dt

>>> t = dtype([(’datetime’, ’O8’), (’size’, ’u4’), (’price’, ’f8’), \

... (’g127’, ’u2’), (’corr’, ’u2’), (’cond’, ’S2’), (’ex’, ’S2’)]) >>> taqData = zeros(10, dtype=t)

>>> taqData[0] = (dt.datetime(2012,2,1,12,01,39),1,53.21,0,0,’’,’N’)