6.2 Classification Results and Model Adherence
6.2.2 Regression Analysis
To further understand the adherence of the classification results with the real world observations, a statistical regression analysis was performed in this thesis, to evaluate the correlation in-between the observed behavior of the stock price and the results, using the same pairs of variables as in the previous sections.
The objective in this section is to go beyond the simple visual analysis of the plotted data, providing some statistical certainty of the conclusions inferred.
6.2.2.1 Microsoft Inc.
As performed in the previous section, the same sets of variables will be tested for the regression, using the Minitab 16 software.
The first one to be tested, in this sense, is the significance of the net positive tweets as independent explanatory variable for the daily change, in percent, of the stock closing price.
Initially, looking at the Residual Analysis below, it is possible to infer that the dataset entered has a suitable dataset, with no observations with large residuals.
0,03 0,02 0,01 0,00 -0,01 0,02 0,01 0,00 -0,01 -0,02 Fitted Value R es id u al
Examples of patterns that may indicate problems with the fit of the model:
Unequal variation
problem.
unequal variation is severe, get help to address the points increases as the fitted values increase. If the Uneven variability, such as when the spread of
Clusters
regression model. Get help to address the problem. important X variables that were not included in the Groups of points that suggest there may be
Strong curvature
fitting model, get help to address the problem. regression model. If you are already using the best Curve in the data that is not well explained by the
Large residuals
removing data that have special causes. measurement or data entry errors and consider understand why the points are unusual. Correct Points that are not well fit by the model. Try to
Residuals vs Fitted Values
Look for large residuals (marked in red) and patterns. Regression for Price Change (%) vs Net Positive
Diagnostic Report 1
Figure 9 Residual analysis for Price change and Net positive, Microsoft Inc.
The regression results, on the other hand, presented the values one could expect observing the chart 2, determining a non-statistically significant relationship in-between the net positive value and the stock price variation, at the given threshold of 5% significance.
Positive is not statistically significant (p > 0,05). The relationship between Price Change (%) and Net
> 0,5 0,1 0,05 0 No Yes P = 0,064
accounted for by the regression model.
15,83% of the variation in Price Change (%) can be 100% 0% R-sq (adj) = 15,83% 200 150 100 50 0 4,00% 2,00% 0,00% Net Positive P ri ce C h an g e (% ) causes Y.
A statistically significant relationship does not imply that X desired value or range of values for Price Change (%). find the settings for Net Positive that correspond to a to predict Price Change (%) for a value of Net Positive, or If the model fits the data well, this equation can be used X**3
Y = 0,01910 - 0,000909 X + 0,000014 X**2 - 0,000000 relationship between Y and X is:
The fitted equation for the cubic model that describes the Y: Price Change (%)
X: Net Positive
Is there a relationship between Y and X?
Fitted Line Plot for Cubic Model
Y = 0,01910 - 0,000909 X + 0,000014 X**2 - 0,000000 X**3
Comments
Regression for Price Change (%) vs Net Positive Summary Report
% of variation accounted for by model
Figure 10 Regression for Net positive and Price change, Microsoft Inc.
Nonetheless, the equation provided was a cubic one, which can be found in the image above.
As for the accumulated net positive value versus the closing price behavior, the results were remarkably different. Even if the Residual Analysis presented 2 observations with somewhat large residuals, the points were on the other hand better distributed along the fitted value axis, which is positive in consideration to the quality of the regression (Ramos, 2010 [80]). The outliers may refer to unusual price variations that were caused by singular events in the sample.
35 34 33 32 31 30 29 1,0 0,5 0,0 -0,5 -1,0 Fitted Value R es id u al
Examples of patterns that may indicate problems with the fit of the model:
Unequal variation
problem.
unequal variation is severe, get help to address the points increases as the fitted values increase. If the Uneven variability, such as when the spread of
Clusters
regression model. Get help to address the problem. important X variables that were not included in the Groups of points that suggest there may be
Strong curvature
fitting model, get help to address the problem. regression model. If you are already using the best Curve in the data that is not well explained by the
Large residuals
removing data that have special causes. measurement or data entry errors and consider understand why the points are unusual. Correct Points that are not well fit by the model. Try to
Residuals vs Fitted Values
Look for large residuals (marked in red) and patterns.
Regression for Closing Price vs Net Positive Accumulated Diagnostic Report 1
Figure 11 Residual analysis for Closing price and Accumulated net positive, Microsoft Inc.
For this study, the analysis provided a quadratic equation which presented excellent statistical significance and R² value, meaning that it is possible to infer that the accumulated net positive value presents a good correlation with the closing price observed, as well as that the calculated equation was able to explain a large portion (over 90%) of the variation observed.
Accumulated is statistically significant (p < 0,05). The relationship between Closing Price and Net Positive
> 0,5 0,1 0,05 0 No Yes P = 0,000
accounted for by the regression model. 90,72% of the variation in Closing Price can be
100% 0% R-sq (adj) = 90,72% 2000 1500 1000 500 0 36 34 32 30
Net Positive Accumulated
C lo si n g P ri ce causes Y.
A statistically significant relationship does not imply that X of values for Closing Price.
Accumulated that correspond to a desired value or range Accumulated, or find the settings for Net Positive to predict Closing Price for a value of Net Positive If the model fits the data well, this equation can be used Y = 28,81 + 0,005878 X - 0,000001 X**2 the relationship between Y and X is:
The fitted equation for the quadratic model that describes Y: Closing Price
X: Net Positive Accumulated
Is there a relationship between Y and X?
Fitted Line Plot for Quadratic Model
Y = 28,81 + 0,005878 X - 0,000001 X**2
Comments
Regression for Closing Price vs Net Positive Accumulated Summary Report
% of variation accounted for by model
Figure 12 Regression for Closing price and Accumulated net positive, Microsoft Inc.
Finally, when observing the volume change according to the number of tweets for a given day, Minitab provided us the following results:
100000000 90000000 80000000 70000000 60000000 50000000 40000000 30000000 50000000 25000000 0 -25000000 -50000000 Fitted Value R es id u al
Examples of patterns that may indicate problems with the fit of the model:
Unequal variation
problem.
unequal variation is severe, get help to address the points increases as the fitted values increase. If the Uneven variability, such as when the spread of
Clusters
regression model. Get help to address the problem. important X variables that were not included in the Groups of points that suggest there may be
Strong curvature
fitting model, get help to address the problem. regression model. If you are already using the best Curve in the data that is not well explained by the
Large residuals
removing data that have special causes. measurement or data entry errors and consider understand why the points are unusual. Correct Points that are not well fit by the model. Try to
Residuals vs Fitted Values
Look for large residuals (marked in red) and patterns. Regression for Volume vs Total
Diagnostic Report 1
Figure 13 Residual analysis for Traded volume and Total number of tweets, Microsoft Inc.
An observation must be made referring to the outliers that were identified in the analysis: The point which is in the bottom, with far less volume than it would be expected for the number of tweets concerns 21st of May, date in which Microsoft presented its new video-game console, Xbox One. Even if such fact could impact the traded volumes for the stocks, it certainly attracted as well the attention of people who do not operate anyhow stocks, creating this unusual observation.
statistically significant (p < 0,05).
The relationship between Volume and Total is
> 0,5 0,1 0,05 0 No Yes P = 0,001
by the regression model.
31,21% of the variation in Volume can be accounted for 100% 0%
R-sq (adj) = 31,21%
Total increases, Volume also tends to increase. The positive correlation (r = 0,58) indicates that when
1 0 -1 0,58 600 450 300 150 0 125000000 100000000 75000000 50000000 Total V o lu m e causes Y.
A statistically significant relationship does not imply that X values for Volume.
for Total that correspond to a desired value or range of to predict Volume for a value of Total, or find the settings If the model fits the data well, this equation can be used Y = 29325108 + 107660 X
relationship between Y and X is:
The fitted equation for the linear model that describes the Y: Volume
X: Total
Is there a relationship between Y and X?
Fitted Line Plot for Linear Model
Y = 29325108 + 107660 X
Comments
Regression for Volume vs Total Summary Report
% of variation accounted for by model
Correlation between Y and X
Negative No correlation Positive
Figure 14 Regression for Traded volume and Total number of tweets, Microsoft Inc.
Even taking into account such point, the statistical significance of the relationship in- between the two variables was confirmed, providing a linear equation to relate the two variables. As for the R² value, it was found to be lower than expected, even if it reached a higher value (over 47%) when the 21/05/2013 outlier was removed from the dataset, as it can be seen below:
statistically significant (p < 0,05).
The relationship between Volume_1 and Total_1 is > 0,5 0,1 0,05 0 No Yes P = 0,000
for by the regression model.
47,31% of the variation in Volume_1 can be accounted 100% 0%
R-sq (adj) = 47,31%
Total_1 increases, Volume_1 also tends to increase. The positive correlation (r = 0,70) indicates that when
1 0 -1 0,70 600 450 300 150 0 150000000 100000000 50000000 Total_1 V o lu m e_ 1 causes Y.
A statistically significant relationship does not imply that X range of values for Volume_1.
settings for Total_1 that correspond to a desired value or to predict Volume_1 for a value of Total_1, or find the If the model fits the data well, this equation can be used Y = 20562074 + 146317 X
relationship between Y and X is:
The fitted equation for the linear model that describes the Y: Volume_1
X: Total_1
Is there a relationship between Y and X?
Fitted Line Plot for Linear Model
Y = 20562074 + 146317 X
Comments
Regression for Volume_1 vs Total_1 Summary Report
% of variation accounted for by model
Correlation between Y and X
Negative No correlation Positive
Figure 15 Regression for Traded volume and Total number of tweets (without an outlier of 21/05/2013), Microsoft Inc.
6.2.2.2 Google Inc.
Following an analog procedure to the one that has been applied for Microsoft Inc., the first regression analysis to be carried out for Google was the one using the net positive value per each day as the explanatory variable for the price daily change for the Google stock.
0,03 0,02 0,01 0,00 -0,01 0,030 0,015 0,000 -0,015 -0,030 Fitted Value R es id u al
Examples of patterns that may indicate problems with the fit of the model:
Unequal variation
problem.
unequal variation is severe, get help to address the points increases as the fitted values increase. If the Uneven variability, such as when the spread of
Clusters
regression model. Get help to address the problem. important X variables that were not included in the Groups of points that suggest there may be
Strong curvature
fitting model, get help to address the problem. regression model. If you are already using the best Curve in the data that is not well explained by the
Large residuals
removing data that have special causes. measurement or data entry errors and consider understand why the points are unusual. Correct Points that are not well fit by the model. Try to
Residuals vs Fitted Values
Look for large residuals (marked in red) and patterns. Regression for Price Change (%) vs Net Positive
Diagnostic Report 1
Figure 16 Residual analysis for Price change and Net positive, Google Inc.
The residual plot exhibits 2 outliers, but in general, the regression results, as shown below, were more satisfactory than the ones in the Microsoft equivalent experiment:
Positive is statistically significant (p < 0,05). The relationship between Price Change (%) and Net
> 0,5 0,1 0,05 0 No Yes P = 0,001
accounted for by the regression model.
32,37% of the variation in Price Change (%) can be 100% 0%
R-sq (adj) = 32,37%
increase.
Net Positive increases, Price Change (%) also tends to The positive correlation (r = 0,59) indicates that when
1 0 -1 0,59 400 300 200 100 0 4,00% 2,00% 0,00% -2,00% Net Positive P ri ce C h an g e (% ) causes Y.
A statistically significant relationship does not imply that X desired value or range of values for Price Change (%). find the settings for Net Positive that correspond to a to predict Price Change (%) for a value of Net Positive, or If the model fits the data well, this equation can be used Y = - 0,004668 + 0,000081 X
relationship between Y and X is:
The fitted equation for the linear model that describes the Y: Price Change (%)
X: Net Positive
Is there a relationship between Y and X?
Fitted Line Plot for Linear Model
Y = - 0,004668 + 0,000081 X
Comments
Regression for Price Change (%) vs Net Positive Summary Report
% of variation accounted for by model
Correlation between Y and X
Negative No correlation Positive
The obtained p-value denotes a significant linear regression in-between the two variables adopted, with positive correlation, as expected. However, the R² value of 32,37% denotes a low level of explanatory capacity of the model, who was able to predict only1/3 of the observed variation in the Y (change in price) variable.
As for the accumulated net positive value as an explanatory variable for the closing price value, once more the results presented were remarkably good.
920 900 880 860 840 820 800 780 10 5 0 -5 -10 Fitted Value R es id u al
Examples of patterns that may indicate problems with the fit of the model:
Unequal variation
problem.
unequal variation is severe, get help to address the points increases as the fitted values increase. If the Uneven variability, such as when the spread of
Clusters
regression model. Get help to address the problem. important X variables that were not included in the Groups of points that suggest there may be
Strong curvature
fitting model, get help to address the problem. regression model. If you are already using the best Curve in the data that is not well explained by the
Large residuals
removing data that have special causes. measurement or data entry errors and consider understand why the points are unusual. Correct Points that are not well fit by the model. Try to
Residuals vs Fitted Values
Look for large residuals (marked in red) and patterns.
Regression for Closing Price vs Net Positive Accumulated Diagnostic Report 1
Figure 18 Residual analysis for Closing price and Accumulated net positive, Google Inc.
Only one outlier and well distributed plot of residuals indicate a good set of observations for performing the regression analysis.
Accumulated is statistically significant (p < 0,05). The relationship between Closing Price and Net Positive
> 0,5 0,1 0,05 0 No Yes P = 0,000
accounted for by the regression model. 97,56% of the variation in Closing Price can be
100% 0% R-sq (adj) = 97,56% 3000 2000 1000 0 900 850 800 750
Net Positive Accumulated
C lo si n g P ri ce causes Y.
A statistically significant relationship does not imply that X of values for Closing Price.
Accumulated that correspond to a desired value or range Accumulated, or find the settings for Net Positive to predict Closing Price for a value of Net Positive If the model fits the data well, this equation can be used X**3
Y = 770,7 + 0,06772 X + 0,000025 X**2 - 0,000000 relationship between Y and X is:
The fitted equation for the cubic model that describes the Y: Closing Price
X: Net Positive Accumulated
Is there a relationship between Y and X?
Fitted Line Plot for Cubic Model
Y = 770,7 + 0,06772 X + 0,000025 X**2 - 0,000000 X**3
Comments
Regression for Closing Price vs Net Positive Accumulated Summary Report
% of variation accounted for by model
Figure 19 Regression for Closing price and Accumulated net positive, Google Inc.
The cubic equation relating the Closing Price to the Net Positive Accumulated has a near- perfect adherence to the observations, with a p-value=0 and a 97,56% R² value.
It is, however, unexpected to see what appears in the upper-right part of the plotted results, where the accumulated net positive grows and the price moves down. In order to solve this issue, an analysis of linear regression was also carried out, presenting a slightly worse residual plot (in the sense that the observations were less spread apart) and a still high level of significance of the regression and a good R² level of 76%.
920 900 880 860 840 820 800 40 20 0 -20 -40 Fitted Value R es id u al
Examples of patterns that may indicate problems with the fit of the model:
Unequal variation
problem.
unequal variation is severe, get help to address the points increases as the fitted values increase. If the Uneven variability, such as when the spread of
Clusters
regression model. Get help to address the problem. important X variables that were not included in the Groups of points that suggest there may be
Strong curvature
fitting model, get help to address the problem. regression model. If you are already using the best Curve in the data that is not well explained by the
Large residuals
removing data that have special causes. measurement or data entry errors and consider understand why the points are unusual. Correct Points that are not well fit by the model. Try to
Residuals vs Fitted Values
Look for large residuals (marked in red) and patterns.
Regression for Closing Price vs Net Positive Accumulated Diagnostic Report 1
Figure 20 Residual analysis for Closing price and Accumulated net positive, Linear regression, Google Inc.
Accumulated is statistically significant (p < 0,05). The relationship between Closing Price and Net Positive
> 0,5 0,1 0,05 0 No Yes P = 0,000