Regression Analysis - Python for Finance Analyze Big Financial Data {TechieAce} pdf

The previous section introduces the leverage effect as a stylized fact of equity market returns. So far, the support that we provided is based on the inspection of financial data plots only. Using pandas, we can also base such analysis on a more formal, statistical

ground. The simplest approach is to use (linear) ordinary least-squares regression (OLS). In what follows, the analysis uses two different data sets available on the Web:

EURO STOXX 50

Historical daily closing values of the EURO STOXX 50 index, composed of European blue-chip stocks

VSTOXX

Historical daily closing data for the VSTOXX volatility index, calculated on the basis of volatilities implied by options on the EURO STOXX 50 index

It is noteworthy that we now (indirectly) use implied volatilities, which relate to

expectations with regard to the future volatility development, while the previous DAX analysis used historical volatility measures. For details, see the “VSTOXX Advanced Services” tutorial pages provided by Eurex.

We begin with a few imports: In [62]: import pandas as pd

from urllib import urlretrieve

For the analysis, we retrieve files from the Web and save them in a folder called data. If

there is no such folder already, you might want to create one first via mkdir data. We

proceed by retrieving the most current available information with regard to both indices: In [63]: es_url = ‘http://www.stoxx.com/download/historical_values/hbrbcpe.txt’

vs_url = ‘http://www.stoxx.com/download/historical_values/h_vstoxx.txt’

urlretrieve(es_url, ‘./data/es.txt’) urlretrieve(vs_url, ‘./data/vs.txt’) !ls -o ./data/*.txt

# Windows: use dir

Out[63]: -rw––- 1 yhilpisch 0 Sep 28 11:14 ./data/es50.txt -rw––- 1 yhilpisch 641180 Sep 28 11:14 ./data/es.txt -rw––- 1 yhilpisch 330564 Sep 28 11:14 ./data/vs.txt

Reading the EURO STOXX 50 data directly with pandas is not the best route in this case.

A little data cleaning beforehand will give a better data structure for the import. Two issues have to be addressed, relating to the header and the structure:

There are a couple of additional header lines that we do not need for the import. From December 27, 2001 onwards, the data set “suddenly” has an additional semicolon at the end of each data row.

The following code reads the whole data set and removes all blanks:[26] In [64]: lines = open(‘./data/es.txt’, ‘r’).readlines()

lines = [line.replace(‘ ‘, ”) for line in lines]

With regard to the header, we can inspect it easily by printing the first couple of lines of the downloaded data set:

In [65]: lines[:6] Out[65]: [‘PriceIndices-EUROCurrency\n’, ‘Date;Blue-Chip;Blue-Chip;Broad;Broad;ExUK;ExEuroZone;Blue-Chip;Broad\ n’, ‘;Europe;Euro-Zone;Europe;Euro-Zone;;;Nordic;Nordic\n’, ‘;SX5P;SX5E;SXXP;SXXE;SXXF;SXXA;DK5F;DKXF\n’, ‘31.12.1986;775.00;900.82;82.76;98.58;98.06;69.06;645.26;65.56\n’, ‘01.01.1987;775.00;900.82;82.76;98.58;98.06;69.06;645.26;65.56\n’]

The above-mentioned format change can be seen between lines 3,883 and 3,990 of the file. From December 27, there suddenly appears an additional semicolon at the end of each data row:

In [66]: for line in lines[3883:3890]: print line[41:], Out[66]: 317.10;267.23;5268.36;363.19 322.55;272.18;5360.52;370.94 322.69;272.95;5360.52;370.94 327.57;277.68;5479.59;378.69; 329.94;278.87;5585.35;386.99; 326.77;272.38;5522.25;380.09; 332.62;277.08;5722.57;396.12;

To make the data set easier to import, we do the following: 1. Generate a new text file.

2. Delete unneeded header lines.

3. Write an appropriate new header line to the new file.

4. Add a helper column, DEL (to catch the trailing semicolons).

5. Write all data rows to the new file.

With these adjustments, the data set can be imported and the helper column deleted after the import. But first, the cleaning code:

In [67]: new_file = open(‘./data/es50.txt’, ‘w’) # opens a new file

new_file.writelines(‘date’ + lines[3][:-1] + ‘;DEL’ + lines[3][-1])

# writes the corrected third line of the original file

# as first line of new file

new_file.writelines(lines[4:])

# writes the remaining lines of the orignial file

new_file.close()

Let us see how the new header looks:

In [68]: new_lines = open(‘./data/es50.txt’, ‘r’).readlines() new_lines[:5] Out[68]: [‘date;SX5P;SX5E;SXXP;SXXE;SXXF;SXXA;DK5F;DKXF;DEL\n’, ‘31.12.1986;775.00;900.82;82.76;98.58;98.06;69.06;645.26;65.56\n’, ‘01.01.1987;775.00;900.82;82.76;98.58;98.06;69.06;645.26;65.56\n’, ‘02.01.1987;770.89;891.78;82.57;97.80;97.43;69.37;647.62;65.81\n’, ‘05.01.1987;771.89;898.33;82.82;98.60;98.19;69.16;649.94;65.82\n’]

It looks appropriate for the import with the read_csv function of pandas, so we continue:

In [69]: es = pd.read_csv(‘./data/es50.txt’, index_col=0,

parse_dates=True, sep=‘;’, dayfirst=True)

In [70]: np.round(es.tail())

Out[70]: SX5P SX5E SXXP SXXE SXXF SXXA DK5F DKXF DEL date

2014-09-22 3096 3257 347 326 403 357 9703 565 NaN 2014-09-23 3058 3206 342 321 398 353 9602 558 NaN 2014-09-24 3086 3244 344 323 401 355 9629 560 NaN 2014-09-25 3059 3202 341 320 397 353 9538 556 NaN

The helper column has fulfilled its purpose and can now be deleted: In [71]: del es[‘DEL’]

es.info()

Out[71]: <class ‘pandas.core.frame.DataFrame’>

DatetimeIndex: 7153 entries, 1986-12-31 00:00:00 to 2014-09-26 00:00:00 Data columns (total 8 columns):

SX5P 7153 non-null float64 SX5E 7153 non-null float64 SXXP 7153 non-null float64 SXXE 7153 non-null float64 SXXF 7153 non-null float64 SXXA 7153 non-null float64 DK5F 7153 non-null float64 DKXF 7153 non-null float64 dtypes: float64(8)

Equipped with the knowledge about the structure of the EURO STOXX 50 data set, we can also use the advanced capabilities of the read_csv function to make the import more

compact and efficient:

In [72]: cols = [‘SX5P’, ‘SX5E’, ‘SXXP’, ‘SXXE’, ‘SXXF’, ‘SXXA’, ‘DK5F’, ‘DKXF’]

es = pd.read_csv(es_url, index_col=0, parse_dates=True, sep=‘;’, dayfirst=True, header=None, skiprows=4, names=cols)

In [73]: es.tail()

Out[73]: SX5P SX5E SXXP SXXE SXXF SXXA DK5F DKXF 2014-09-22 3096.02 3257.48 346.69 325.68 403.16 357.08 9703.33 564.81 2014-09-23 3057.89 3205.93 341.89 320.72 397.96 352.56 9602.32 558.35 2014-09-24 3086.12 3244.01 344.35 323.42 400.58 354.72 9628.84 559.83 2014-09-25 3059.01 3202.31 341.44 319.77 396.90 352.58 9537.95 555.51 2014-09-26 3063.71 3219.58 342.30 321.39 398.33 352.71 9558.51 556.57

Fortunately, the VSTOXX data set is already in a form such that it can be imported a bit more easily into a DataFrame object:

In [74]: vs = pd.read_csv(‘./data/vs.txt’, index_col=0, header=2, parse_dates=True, sep=‘,’, dayfirst=True) vs.info()

Out[74]: <class ‘pandas.core.frame.DataFrame’>

DatetimeIndex: 4010 entries, 1999-01-04 00:00:00 to 2014-09-26 00:00:00 Data columns (total 9 columns):

V2TX 4010 non-null float64 V6I1 3591 non-null float64 V6I2 4010 non-null float64 V6I3 3960 non-null float64 V6I4 4010 non-null float64 V6I5 4010 non-null float64 V6I6 3995 non-null float64 V6I7 4010 non-null float64 V6I8 3999 non-null float64 dtypes: float64(9)

Table 6-6 contains the parameters of this important import function. There are a multitude of parameters, the majority of which default to None; object, of course, is nondefault and

has to be specified in any case.

Table 6-6. Parameters of read_csv function

object String File path, URL, or other source

sep String, default “,” Delimiter to use

lineterminator String (one character) String for line breaks

quotechar String Character for quotes

quoting Integer Controls recognition of quotes

escapechar String String for escaping

dtpye dtype/dict dict of dtype(s) for column(s)

compression “gzip”/“bz2” For decompression of data

dialect String/ csv.Dialect CSV dialect, default Excel

header Integer Number of header rows

skiprows Integer Number of rows to skip

index_col Integer Number of index columns (sequence for multi-index)

names Array-like Column names if no header rows

prefix String String to add to column numbers if no header names

na_values List/ dict Additional strings to recognize as NA, NaN

true_values List Values to consider as True

false_values List Values to consider as False

parse_dates

Boolean/list, default FalseWhether to parse dates in index columns or multiple columns

keep_date_col Boolean, default False Keeps original date columns

dayfirst Boolean, default False For European date convention DD/MM

thousands String Thousands operator

comment String Rest of line as comment (not to be parsed)

decimal String String to indicate decimal, e.g., “.” or “,”

nrows Integer Number of rows of file to read

iterator Boolean, default False Return TextFileReader object

chunksize Integer Return TextFileReader object for iteration

skipfooter Integer Number of lines to skip at bottom

converters Dictionary Function to convert/translate column data

verbose Boolean, default False Report number of NA values in nonnumeric columns

delimiter String Alternative to sep, can contain regular expressions

encoding String Encoding to use, e.g., “ UTF-8”

squeeze Boolean, default False Return one-column data sets as Series

na_filter Boolean, default False Detect missing value markers automatically

usecols Array-like Selection of columns to use

tupleize_cols Boolean, default False Leave a list of tuples on columns as is

To implement the regression analysis, we only need one column from each data set. We therefore generate a new DataFrame object within which we combine the two columns of

interest, namely those for the major indexes. Since VSTOXX data is only available from the beginning of January 1999, we only take data from that date on:

In [75]: import datetime as dt

data = pd.DataFrame({‘EUROSTOXX’ :

es[‘SX5E’][es.index > dt.datetime(1999, 1, 1)]}) data = data.join(pd.DataFrame({‘VSTOXX’ :

vs[‘V2TX’][vs.index > dt.datetime(1999, 1, 1)]}))

We also fill missing values with the last available values from the time series. We call the

fillna method, providing ffill (for forward fill) as the method parameter. Another

option would be bfill (for backward fill), which would however lead to a “foresight”

issue:

In [76]: data = data.fillna(method=‘ffill’) data.info()

Out[76]: <class ‘pandas.core.frame.DataFrame’>

DatetimeIndex: 4034 entries, 1999-01-04 00:00:00 to 2014-09-26 00:00:00 Data columns (total 2 columns):

EUROSTOXX 4034 non-null float64 VSTOXX 4034 non-null float64 dtypes: float64(2)

In [77]: data.tail()

Out[77]: EUROSTOXX VSTOXX 2014-09-22 3257.48 15.8303 2014-09-23 3205.93 17.7684 2014-09-24 3244.01 15.9504 2014-09-25 3202.31 17.5658 2014-09-26 3219.58 17.6012

Again, a graphical representation of the new data set might provide some insights. Indeed, as Figure 6-7 shows, there seems to be a negative correlation between the two indexes:

In [78]: data.plot(subplots=True, grid=True, style=‘b’, figsize=(8, 6))

However, to put this on more formal ground, we want to work again with the log returns of the two financial time series. Figure 6-8 shows these graphically:

In [79]: rets = np.log(data / data.shift(1)) rets.head()

Out[79]: EUROSTOXX VSTOXX 1999-01-04 NaN NaN 1999-01-05 0.017228 0.489248 1999-01-06 0.022138 -0.165317 1999-01-07 -0.015723 0.256337 1999-01-08 -0.003120 0.021570

In [80]: rets.plot(subplots=True, grid=True, style=‘b’, figsize=(8, 6))

Figure 6-8. Log returns of EURO STOXX 50 and VSTOXX

We have everything together to implement the regression analysis. In what follows, the EURO STOXX 50 returns are taken as the independent variable while the VSTOXX returns are taken as the dependent variable:

In [81]: xdat = rets[‘EUROSTOXX’] ydat = rets[‘VSTOXX’]

model = pd.ols(y=ydat, x=xdat) model

Out[81]: ––––––––-Summary of Regression Analysis–––––- –––

Formula: Y ~ <x> + <intercept> Number of Observations: 4033 Number of Degrees of Freedom: 2 R-squared: 0.5322

Adj R-squared: 0.5321 Rmse: 0.0389

F-stat (1, 4031): 4586.3942, p-value: 0.0000 Degrees of Freedom: model 1, resid 4031

–––––––—Summary of Estimated Coefficients––––– –––

Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5%

–––––––––––––––––––––––— –––

x -2.7529 0.0406 -67.72 0.0000 -2.8326 -2.6732

intercept -0.0001 0.0006 -0.12 0.9043 -0.0013 0.0011

–––––––––––End of Summary–––––––– –––

Obviously, there is indeed a highly negative correlation. We can access the results as follows:

In [82]: model.beta

Out[82]: x -2.752894 intercept -0.000074 dtype: float64

This input, in combination with the raw log return data, is used to generate the plot in Figure 6-9, which provides strong support for the leverage effect:

In [83]: plt.plot(xdat, ydat, ‘r.’)

ax = plt.axis() # grab axis values

x = np.linspace(ax[0], ax[1] + 0.01)

plt.plot(x, model.beta[1] + model.beta[0] * x, ‘b’, lw=2) plt.grid(True)

plt.axis(‘tight’)

plt.xlabel(‘EURO STOXX 50 returns’) plt.ylabel(‘VSTOXX returns’)

Figure 6-9. Scatter plot of log returns and regression line

As a final cross-check, we can calculate the correlation between the two financial time series directly:

In [84]: rets.corr()

Out[84]: EUROSTOXX VSTOXX EUROSTOXX 1.000000 -0.729538 VSTOXX -0.729538 1.000000

Although the correlation is strongly negative on the whole data set, it varies considerably over time, as shown in Figure 6-10. The figure uses correlation on a yearly basis, i.e., for 252 trading days:

In [85]: pd.rolling_corr(rets[‘EUROSTOXX’], rets[‘VSTOXX’], window=252).plot(grid=True, style=‘b’)

In document Python for Finance Analyze Big Financial Data {TechieAce} pdf (Page 152-161)