• No results found

1. Overview and objectives

N/A
N/A
Protected

Academic year: 2021

Share "1. Overview and objectives"

Copied!
7
0
0

Loading.... (view fulltext now)

Full text

(1)

EU-ASEAN Virtual Capacity Building Workshops

Advanced Statistics (workshop B3)

Wednesday, 9th June 2021

Expert presenters: Selina Patel, Richard Potter Chair: Paul Foley

1.

Overview and objectives

The need to become a digital economy and society requires the bringing together of evidence to measure aspects of society and the economy and also the statistical skills required to implement and analyse.

This workshop discussed is the third in the series on how information, policy and strategy can work together through the use of statistics to help describe the data in ADIX. These are ways in which a variety of methods can be used to show aspects such as similarities and differences and how to present information to help explain these. It examines the role that data can play in this (for example through data modelling) and the how ADIX

could contribute.

The workshop was structured to:

• Examine ways in which existing data can be used; • Investigate various forms of presenting data using

visualisation methods;

• Understand the process and usefulness of variance testing and statistical analysis;

• Look at factors that will affect future ADIX data collection and analysis.

2.

Examining data sets, Richard Potter

The first section of the webinar examined what simple techniques can be used to gain the most value from the type of data which might initially be in ADIX.

Data is examined from two variables which measure the

proportions of graduates (one in ICT1 and one in STEM2). There are differences in the data in from each nation in the number of years this has been made available. Comparisons can be made which take these differences into account, though the level of uncertainty should be acknowledged.

A second point made is on how data is defined. Data labelled “proportion of employment in knowledge-intensive services” is on various roles in employment (through the International Standard Classification of Occupations) which can be defined as

1 Information and Communications Technology

(2)

management roles. The relationship with a digital society is not clear (e.g. the increased use of computers in a company can lead to a reduction in the number of managers).

The third aspect examined is how different variables can be brought together into an index. This looked at ranking variables which measure different things and how these can be brought together to into an overall score.

3.

Data Presentation and Visualisations

This section looked at various forms of data visualisation techniques to not only present data but to compare and draw meaningful conclusions. Clear and visually pleasing visualisations are crucial to attract attention and make the presented data easy to internalise.

Aims of using visualisation techniques include:

• To be able to easily extract key descriptive statistics • To identify trends or patterns

• To make comparisons • To identify outliers.

Suggested software to create visualisations would be either Microsoft Excel or Tableau for simple and straightforward solutions. Both have in built features such that the method is simply to input data and select the desired visualisation and personalise it to suit the purpose. Within business, Microsoft BI is popular and often used. Again, this one is fairly straightforward and presents data in a dashboard. There also exists ways of coding visualisations through languages such as python and R. These require downloadable packages and some training or a quick tutorial that can easily be found online.

Some types of visually pleasing and informative data visualisations discussed:

• Bar Charts • Scatter Plots • Line Graphs • Box Plots • Mapping3

(3)

After discussing data visualisation, one of the participants was kind enough to share her first attempt at programming in R to create a graph. Hopefully today’s session helped to improve her skills and/or suggest other data visualisation software that creates graphics automatically.

4.

Variance Testing

This section investigates differences between nation’s data and explains why these differences occur using statistical analysis methods. Variance testing or variance analysis is the deviation of actual behaviour to planned or predicted behaviour to identify variability to give explanations for different outcomes.

Key elements of analysis include:

• Cluster analysis: techniques used in cluster analysis endeavour to group together similar objects. The main aim of performing cluster analysis on these data sets is to get an idea of how similar each country is - based on the dependent and

independent data.

• Principal Component analysis: as a visualisation tool it’s a form of elastic mapping as it visually reduces dimensionality to 2 principal components on the x and y axes • Significance testing: also called hypothesis testing, is testing whether a claim has

significant results. In terms of variance testing the claim will be whether the

variance between two groups (countries) is equal or significantly different (not due to randomness). Analysis of significance is done by conducting ANOVA tests and interpreting the results.

• Regression analysis: a regression model is a method of estimating the relationship between a dependent variable and independent variables. The most common forms of regression are linear (in the case of only one independent variable) and multiple linear (in the case of having more than one independent variable).

• Residual analysis: residuals are the difference between the observed and predicted values in a model as shown with this formula. B y plotting this difference, we can visually see where we would expect countries to be performing based on the model in comparison to how they are actually performing.

• Residual Ranking: a table that presents the results of the residual analysis in order of residual value. The residual analysis can work as an evaluation tool in this sense by using residuals to determine an error score.

• Variance Explained: the result of this technique gives percentages corresponding to each independent variable indicating how much of the variance in the dependent variable can be explained by that specific independent variable.

5.

Future Work

This section focused on potential work in the future of ADIX.

Covering key elements to be considered in future data collection and analysis and ways to combat these:

• Inflation: account for percentage changes in such data by implementing an adjustment factor.

(4)

• Units: Ii changes in data collection occur and causes a change in units. A conversion rate should be implemented to account for this.

• Obsolete Variables: over time collection of certain variables may stop. Hence, a suitable replacement that is a close as possible to the original needs to be found. Once multiple years of ADIX data has been collected the idea of forecasting may seem attractive. Forecasting does not always mean predicting into the future. It can also mean predicting based on different circumstances or predicting over a period of time.

Three techniques for three different types of forecasting were discussed:

• Straight Line: the straight line method is the simplest form of forecasting. The negatives to this are the assumptions that growth is constant and that growth over time follows a linear progression. This form of forecasting predicts into the future. • Moving Average: the moving average takes into account past data. This method is

used to smooth out the data and give information about the average over time. • Regression: regression analysis for forecasting is the most popular method. By

creating a regression equation, the independent variables can be toggled to investigate a new dependent variable value. This form of forecasting predicts an outcome based on different circumstances.

6.

In Conclusion

• Data presentation and visualisation techniques must be attractive and clear. Be aware of the audience and what the visualisation wants to show.

• Cluster and principal component analysis are not necessary but nice add-ons to better explain and understand the raw data.

• Variance testing is used to compare actual collected data against what is expected or predicted and also provide reasons if there exists a difference

• Bear in mind changes between data collections and methods to combat these. • Forecasting can be a useful innovative activity once a significant amount of data

has been collected.

7.

Key questions raised in the workshop

Participants raised a large number of excellent questions. Key questions and an overview of responses are provided below. Thanks to all those that asked the questions.

7.1.

Question 1 A number of data visualisation gurus dislike pie graphs.

Your example showed some of their limitations. Is there ever a place

for pie graphs?

Pie graphs often obscure differences in data presented. The image below from Selina’s presentation demonstrates the limitations.

(5)

7.2.

Question 2 Data visualisation sometimes promotes the idea of

causality rather than correlation. Are there ways to avoid this?

Correlation examines links between two or more data sets. Causality shows a causative link between two variables. It is always advisable to stress in presentations the difference between correlation and causality ie because a relationship is seen it does not mean that one change causes the other change.

7.3.

Question 3 Most countries have to provide the UN, World Bank, ILO

and other organisations with quite a lot of information. Is there

somebody in your Ministry or organisations that co-ordinates this task

or is each request considered separately?

Respondents generally stated that their National Statistics Agency dealt with most information requests from international organisations. Telecommunications Ministries and organisations generally responded to telecoms and technology queries directly.

7.4.

Question 4 What data visualisation software have you found most

useful?

Discussions identified several software tools, several can be used for free, others provide a short period (usually a week) for appraisal prior to purchase.

The most mentioned software tools were:

Static visualisation software - Power BI, Spotfire, Tableau. Raw Graphics. Also

illustrator for infographics. Template sites such as Canva and Venngage

Interactive software – Datawrapper, Flourish

Story telling - Short hand, Strikingly

Dynamic charts. Highcharts, D3

7.5.

Question 5 One problem often encountered is incomplete data and

missing values. What methods do you use to try and find missing

values?

Linear interpolation is good, but this method only works if your data has a linear form or a single line of best fit.

(6)

Amelia II is generally regarded as the best program for computing missing values, it works with Windows and runs R, the statistical package. The software can be downloaded at

https://gking.harvard.edu/amelia. We have found it is always worth ‘eyeballing’ the predicted

missing values because we have found sometimes rogue or strange values can be created. Responses surveys be smaller for certain groups e.g. a group is 10% of the population but only gives 5% of the survey replies. In these cases the answers for the groups with lower

responses rates can be increased (weighted) so the number / proportion of answers fits with the relative size of those groups.

7.6.

Question 6. What forecasting tools or methods have you found most

helpful?

The most common method is straight line or linear methods, but this assumes constant rate of growth and it assumes a linear progression in all

future years.

Moving Average methods are simple to use and explain. Regression provides a useful

mathematical way of predicting trends.

For non-linear cases, such as the adoption of technology by citizens and businesses an

s-shaped adoption curve is often more appropriate.

8.

Workshop overview and conclusions

More than 40 people attended the workshop, many have been kind enough to say how useful and topical the review of regulatory activities was to them. Key elements from participant’s feedback are provided below.

8.1.

Most interesting things learnt

Participants were asked about the most interesting thing they learnt from the workshop. The most popular areas were:

• The variety of data visualisation methods that are available; • The need to clarify differences between correlation and causality; • The recommendations for data visualisation software;

8.2.

Thing learnt that will be most useful for participants in their jobs

• Data visualisation software tools. I look forward to trying them;

• The use of explanations of variance methods to understand why different things happen in different areas or countries;

8.3.

Topics for further investigation

(7)

• Revising indices with new data (and may be dropping some of the older data)>

Selina Patel, Richard Potter and Paul Foley 9th June 2021

References

Related documents

1) Determine the dorsal (back) side of the squid by looking for darker coloration and the presence of fins. This is called counter- shading: one side of the body is darker than

As suggested from the prediction results shown in the 4 × 4 and 5 × 5 supercell systems, the prediction accuracy is decreased as the relative size of training data shrinks..

To verify whether there is an expres- sion signature associated with the pres- ence of the p.Ser252Trp mutation, which could explain the altered mutant cell be- havior, we

Human marrow stromal cells reduce microglial activation to protect motor neurons in a transgenic mouse model of amyotrophic lateral sclerosis. Hefferan MP, Galik J, Kakinohana

The patients randomized to the intervention group of the study received psychosocial interventions according to their meeting problems included symptoms and side effects of

India is 6th largest grocery market in the world. Indian grocery retail is dominated by unorganized retail, like market stalls, kirana, specialist store etc, nearly to

Cool compresses (ie, wash cloths dipped in ice water or cool gel packs) may be applied to the face and neck for 15-minutes at a time to help reduce swelling and discoloration.. Do

The commercial sector switch rates attained in the month of December 2002 were 51 percent in the Toledo Edison Company service territory, 50 percent in the Cleveland