1. Overview and objectives

(1)

EU-ASEAN Virtual Capacity Building Workshops

Advanced Statistics (workshop B3)

Wednesday, 9th_{June 2021}

Expert presenters: Selina Patel, Richard Potter Chair: Paul Foley

1. Overview and objectives

The need to become a digital economy and society requires the bringing together of evidence to measure aspects of society and the economy and also the statistical skills required to implement and analyse.

This workshop discussed is the third in the series on how information, policy and strategy can work together through the use of statistics to help describe the data in ADIX. These are ways in which a variety of methods can be used to show aspects such as similarities and differences and how to present information to help explain these. It examines the role that data can play in this (for example through data modelling) and the how ADIX

could contribute.

The workshop was structured to:

• Examine ways in which existing data can be used; • Investigate various forms of presenting data using

visualisation methods;

• Understand the process and usefulness of variance testing and statistical analysis;

• Look at factors that will affect future ADIX data collection and analysis.

2. Examining data sets, Richard Potter

The first section of the webinar examined what simple techniques can be used to gain the most value from the type of data which might initially be in ADIX.

Data is examined from two variables which measure the

proportions of graduates (one in ICT1_{and one in STEM}2_{). There} are differences in the data in from each nation in the number of years this has been made available. Comparisons can be made which take these differences into account, though the level of uncertainty should be acknowledged.

A second point made is on how data is defined. Data labelled “proportion of employment in knowledge-intensive services” is on various roles in employment (through the International Standard Classification of Occupations) which can be defined as

1_{Information and Communications Technology}

(2)

management roles. The relationship with a digital society is not clear (e.g. the increased use of computers in a company can lead to a reduction in the number of managers).

The third aspect examined is how different variables can be brought together into an index. This looked at ranking variables which measure different things and how these can be brought together to into an overall score.

3. Data Presentation and Visualisations

This section looked at various forms of data visualisation techniques to not only present data but to compare and draw meaningful conclusions. Clear and visually pleasing visualisations are crucial to attract attention and make the presented data easy to internalise.

Aims of using visualisation techniques include:

• To be able to easily extract key descriptive statistics • To identify trends or patterns

• To make comparisons • To identify outliers.

Suggested software to create visualisations would be either Microsoft Excel or Tableau for simple and straightforward solutions. Both have in built features such that the method is simply to input data and select the desired visualisation and personalise it to suit the purpose. Within business, Microsoft BI is popular and often used. Again, this one is fairly straightforward and presents data in a dashboard. There also exists ways of coding visualisations through languages such as python and R. These require downloadable packages and some training or a quick tutorial that can easily be found online.

Some types of visually pleasing and informative data visualisations discussed:

• Bar Charts • Scatter Plots • Line Graphs • Box Plots • Mapping3

(3)

After discussing data visualisation, one of the participants was kind enough to share her first attempt at programming in R to create a graph. Hopefully today’s session helped to improve her skills and/or suggest other data visualisation software that creates graphics automatically.

4. Variance Testing

This section investigates differences between nation’s data and explains why these differences occur using statistical analysis methods. Variance testing or variance analysis is the deviation of actual behaviour to planned or predicted behaviour to identify variability to give explanations for different outcomes.

Key elements of analysis include:

• Cluster analysis: techniques used in cluster analysis endeavour to group together similar objects. The main aim of performing cluster analysis on these data sets is to get an idea of how similar each country is - based on the dependent and

independent data.

• Principal Component analysis: as a visualisation tool it’s a form of elastic mapping as it visually reduces dimensionality to 2 principal components on the x and y axes • Significance testing: also called hypothesis testing, is testing whether a claim has

significant results. In terms of variance testing the claim will be whether the

variance between two groups (countries) is equal or significantly different (not due to randomness). Analysis of significance is done by conducting ANOVA tests and interpreting the results.

• Regression analysis: a regression model is a method of estimating the relationship between a dependent variable and independent variables. The most common forms of regression are linear (in the case of only one independent variable) and multiple linear (in the case of having more than one independent variable).

• Residual analysis: residuals are the difference between the observed and predicted values in a model as shown with this formula. B y plotting this difference, we can visually see where we would expect countries to be performing based on the model in comparison to how they are actually performing.

• Residual Ranking: a table that presents the results of the residual analysis in order of residual value. The residual analysis can work as an evaluation tool in this sense by using residuals to determine an error score.

• Variance Explained: the result of this technique gives percentages corresponding to each independent variable indicating how much of the variance in the dependent variable can be explained by that specific independent variable.

5. Future Work

This section focused on potential work in the future of ADIX.

Covering key elements to be considered in future data collection and analysis and ways to combat these:

• Inflation: account for percentage changes in such data by implementing an adjustment factor.

(4)

• Units: Ii changes in data collection occur and causes a change in units. A conversion rate should be implemented to account for this.

• Obsolete Variables: over time collection of certain variables may stop. Hence, a suitable replacement that is a close as possible to the original needs to be found. Once multiple years of ADIX data has been collected the idea of forecasting may seem attractive. Forecasting does not always mean predicting into the future. It can also mean predicting based on different circumstances or predicting over a period of time.

Three techniques for three different types of forecasting were discussed:

• Straight Line: the straight line method is the simplest form of forecasting. The negatives to this are the assumptions that growth is constant and that growth over time follows a linear progression. This form of forecasting predicts into the future. • Moving Average: the moving average takes into account past data. This method is

used to smooth out the data and give information about the average over time. • Regression: regression analysis for forecasting is the most popular method. By

creating a regression equation, the independent variables can be toggled to investigate a new dependent variable value. This form of forecasting predicts an outcome based on different circumstances.

6. In Conclusion

• Data presentation and visualisation techniques must be attractive and clear. Be aware of the audience and what the visualisation wants to show.

• Cluster and principal component analysis are not necessary but nice add-ons to better explain and understand the raw data.

• Variance testing is used to compare actual collected data against what is expected or predicted and also provide reasons if there exists a difference

• Bear in mind changes between data collections and methods to combat these. • Forecasting can be a useful innovative activity once a significant amount of data

has been collected.

7. Key questions raised in the workshop

Participants raised a large number of excellent questions. Key questions and an overview of responses are provided below. Thanks to all those that asked the questions.

7.1. Question 1 A number of data visualisation gurus dislike pie graphs.

Your example showed some of their limitations. Is there ever a place

for pie graphs?

Pie graphs often obscure differences in data presented. The image below from Selina’s presentation demonstrates the limitations.

(5)

7.2. Question 2 Data visualisation sometimes promotes the idea of

causality rather than correlation. Are there ways to avoid this?

Correlation examines links between two or more data sets. Causality shows a causative link between two variables. It is always advisable to stress in presentations the difference between correlation and causality ie because a relationship is seen it does not mean that one change causes the other change.

7.3. Question 3 Most countries have to provide the UN, World Bank, ILO

and other organisations with quite a lot of information. Is there

somebody in your Ministry or organisations that co-ordinates this task

or is each request considered separately?

Respondents generally stated that their National Statistics Agency dealt with most information requests from international organisations. Telecommunications Ministries and organisations generally responded to telecoms and technology queries directly.

7.4. Question 4 What data visualisation software have you found most

useful?

Discussions identified several software tools, several can be used for free, others provide a short period (usually a week) for appraisal prior to purchase.

The most mentioned software tools were:

• Static visualisation software - Power BI, Spotfire, Tableau. Raw Graphics. Also

illustrator for infographics. Template sites such as Canva and Venngage

• Interactive software – Datawrapper, Flourish

• Story telling - Short hand, Strikingly

• Dynamic charts. Highcharts, D3

7.5. Question 5 One problem often encountered is incomplete data and

missing values. What methods do you use to try and find missing

values?

Linear interpolation is good, but this method only works if your data has a linear form or a single line of best fit.

(6)

Amelia II is generally regarded as the best program for computing missing values, it works with Windows and runs R, the statistical package. The software can be downloaded at

https://gking.harvard.edu/amelia. We have found it is always worth ‘eyeballing’ the predicted

missing values because we have found sometimes rogue or strange values can be created. Responses surveys be smaller for certain groups e.g. a group is 10% of the population but only gives 5% of the survey replies. In these cases the answers for the groups with lower

responses rates can be increased (weighted) so the number / proportion of answers fits with the relative size of those groups.

7.6. Question 6. What forecasting tools or methods have you found most

helpful?

The most common method is straight line or linear methods, but this assumes constant rate of growth and it assumes a linear progression in all

future years.

Moving Average methods are simple to use and explain. Regression provides a useful

mathematical way of predicting trends.

For non-linear cases, such as the adoption of technology by citizens and businesses an

s-shaped adoption curve is often more appropriate.

8. Workshop overview and conclusions

More than 40 people attended the workshop, many have been kind enough to say how useful and topical the review of regulatory activities was to them. Key elements from participant’s feedback are provided below.

8.1. Most interesting things learnt

Participants were asked about the most interesting thing they learnt from the workshop. The most popular areas were:

• The variety of data visualisation methods that are available; • The need to clarify differences between correlation and causality; • The recommendations for data visualisation software;

8.2. Thing learnt that will be most useful for participants in their jobs

• Data visualisation software tools. I look forward to trying them;

• The use of explanations of variance methods to understand why different things happen in different areas or countries;

8.3. Topics for further investigation

(7)

• Revising indices with new data (and may be dropping some of the older data)>

Selina Patel, Richard Potter and Paul Foley 9th June 2021