• No results found

Computation of consumer spatial price indexes over time using Natural Language Processing and web scraping techniques

N/A
N/A
Protected

Academic year: 2021

Share "Computation of consumer spatial price indexes over time using Natural Language Processing and web scraping techniques"

Copied!
17
0
0

Loading.... (view fulltext now)

Full text

(1)

Computation of consumer spatial price indexes over time using Natural Language Processing and web

scraping techniques

Laureti T.1, Benedetti I.1, Palumbo L.1, and Rose B.2

1 University of Tuscia, 2 Starsift LLC

(2)

Outline

Background and aims

Web-scraped data and price statistics

Current status

Developments for Spatial Price Index computation

Methodological approach and data set

NLP for product categorization

Web-scraped data from US multichannel supermarkets

Adding a time dimension: TiRPD models

Estimation results

Conclusion

New Techniques and Technologies for Statistics 2021

#NTTS2021

(3)

New Techniques and Technologies for Statistics 2021

#NTTS2021

Ø The share of electronic sales to consumers in total turnover has greatly increased in most EU countries.

Ø The COVID-19 crisis has accelerated an expansion of e-commerce towards new firms, customers and types of products (OECD, 2020).

Ø A recent international survey showed that, following the pandemic, more than half of the respondents shop online more frequently than before (UNCTAD, 2020). Some of these changes in purchasing patterns will likely be of a long-term nature.

Ø The official statistics providers should not ignore this rich data source especially in the price statistics domain

Background and aims

(4)

Background and aims

Several EU countries have started to collect data from online retailers and use them for official CPI calculation.

As yet, to my knowledge, only few studies has been carried out on using web- scraped data for compiling consumer spatial prices indexes at international level (Cavallo, 2017, Cavallo et al, 2018)

Ø In this context web-scraped data may enable countries to construct regional or sub-national spatial prices indexes (SPIs)

The aims of this paper are to:

Ø Exploring the potential advantages of using web-scraped data for constructing sub-national SPIs

Ø Presenting first results obtained using TiRPD model and US multichannel supermarkets .

New Techniques and Technologies for Statistics 2021

#NTTS2021

(5)

Web-scraped data in official price statistics

New Techniques and Technologies for Statistics 2021

#NTTS2021 Python API

interfaces (Amadeus)

Web Browser Automation

(IMacros)

R packages (Rvest, Rselenium)

Python Selenium Browser

automation

Air tickets Air Tickets, Train tickets, Consumer

Electronics

Over 60 categories:

Clothing, Air tickets, Hotels,

Supermarkets,…

Laptops, Train tickets

Other NSIs leveraging web-scraping techniques for price statistics:

Statistics Netherlands, Statistics Austria , Statistics Slovenia, Statistics Luxemburg and Statistics Norway Ø Eurostat 2020, Practical guidelines on web scraping for the HICP

(6)

New Techniques and Technologies for Statistics 2021

#NTTS2021

ü Feasible alternative to scanner data (Chessa and Griffioen,2019) ü Daily frequency

ü Collecting additional metadata (e.g. product characteristics) ü Aligned with offline prices (Cavallo, 2017)

v Product matching (ID codes may not be available) v No quantities sold (lack of weights)

v Anti-scraping measures implemented by webmasters

Using Web Scraped Data for computing Spatial Price Indexes

Advantages of using web-scraped price data

The importance of spatial price comparisons within a country has been acknowledged in literature over the last decades (Laureti and Rao, 2018)

Drawbacks:

(7)

New Techniques and Technologies for Statistics 2021

#NTTS2021

Methodological Approach and data set

Adding a time dimension

Ø Time-interaction-Region Product Dummy model (TiRPD)

Reconciling consumer price indexes across space and time (Aizcorbe and Aten, 2004; Dikanov, 2010) and using all the information available

Product classification and matching Ø Natural Language Processing

Word and sentences vectorization for product classification and product

matching

(8)

New Techniques and Technologies for Statistics 2021

#NTTS2021

NLP for product classification

Product categories

(n>700)

Product names

(n>50k)

ECOICOP L4 and L5

Vectorization Vectorization Vectorization

Cosine distance Cosine distance Categorization

Pre-trained word vector

model*

* Vector model details at https://spacy.io/models/en-starters#en_vectors_web_lg

(9)

New Techniques and Technologies for Statistics 2021

#NTTS2021

~120 Mn data points 10.93% CPI-U Coverage 4 retailer chains

11 US cities

Daily prices from January 2017 to May 2018

Monthly average prices using unweighted arithmetic mean

Methodological Approach and data set

Food at home

Alcoholic beverages (at home) Housekeeping supplies

Medicinal drugs (non-prescription) Cigarettes & Tobacco products

Personal care products

US web-scraping dataset provided by Starsift LLC

(10)

New Techniques and Technologies for Statistics 2021

#NTTS2021

Methodology: Time-interaction-Region Product Dummy model

where, for each BH, p

nrt

denotes the price of product n in area r at time t (n = 1, 2,…N; r = 1, 2,…, R; t=1,…,T )

are dummies for product n in area r at time t.

are dummy variables for each combination of area and time period with n=1,…,N;

r=1,…,R and t=1,…,T.

The intra-national SPI for the city r at time t is given by

*

D

rnt

ln𝑝

!"#

= %

"$%

&

%

#$%

'

𝛿

"#

𝐷

!"

𝑇

!#

+ %

!$%

(

𝑏

!

𝐷

"!#

+ 𝑣

!"#

𝐷

!"

𝑇

!#

𝑒𝑥𝑝 𝛿

"#

− 𝛿

+",-!./,#$%

(11)

RESULTS: Product overlap across cities

New Techniques and Technologies for Statistics 2021

#NTTS2021

Chocolate Apples

(12)

SPI estimation results - Apples

New Techniques and Technologies for Statistics 2021

#NTTS2021

70.00 80.00 90.00 100.00 110.00 120.00 130.00

FEB-17 MAR

-17

APR-17

MAY-17

JUN-17

JUL-17

AUG-17

SEP-17

OCT-17

NOV-17

DEC-17

JAN-18

FEB-18 MAR

-18

MAY-18 Boise DC Honolulu Houston LA LV Phoenix Portland Seattle SF

(13)

SPI estimation results - Chocolate

New Techniques and Technologies for Statistics 2021

#NTTS2021

70.00 75.00 80.00 85.00 90.00 95.00 100.00 105.00 110.00 115.00 120.00

FEB-17 MAR

-17

APR-17

MAY-17

JUN-17

JUL-17

AUG-17

SEP-17

OCT-17

NOV-17

DEC-17

JAN-18

FEB-18 MAR

-18

MAY-18 Boise DC Honolulu Houston LA LV Phoenix Portland Seattle SF

(14)

Concluding remarks and ongoing research

q On line data together with the use of TiRPD methods could allow the computation of sub- national SPI

qWeb-scraped data may reduce data collection burden and supplement information on items.

ØMore IT resources are required due to the huge amount of data obtained ØNot all countries may have access to this type of data

ØOn line data my cover a limited number of product categories

Further research is underway for

q Web-scrabing prices for other product aggregates and in other countries q Testing the use of other multilateral methods

New Techniques and Technologies for Statistics 2021

#NTTS2021

(15)

Thank you for your attention

Prof. Tiziana Laureti ([email protected]) Dr. Ilaria Benedetti ([email protected])

New Techniques and Technologies for Statistics 2021

#NTTS2021

Luigi Palumbo ([email protected]) Brandon Rose([email protected])

(16)

References

Virgillito, A. Polidoro, F. (2019) “Big Data Techniques for Supporting Official Statistics: The Use of Web Scraping for Collecting Price Data”. In Web Services: Concepts, Methodologies, Tools, and Applications (pp. 728-744). IGI Global.

Eurostat (2020). Practical guidelines on web scraping for the HICP

Cavallo, A., Diewert, W. E., Feenstra, R. C., Inklaar, R., & Timmer, M. P. (2018). Using online prices for measuring real consumption across countries. In AEA Papers and Proceedings (Vol. 108, pp.

483-87).

Cavallo, A. (2017) “Are Online and Offline Prices Similar? Evidence from Multi- Channel Retailers,” American Economic Review 107, 283–303.

Gábor, K. Zargayouna, H. Tellier, I. Buscaldi, D. Charnois, T.(2017) “Exploring Vector Spaces for Semantic Relations,” Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1814–1823. Copenhagen, Denmark, September 7–11.

Laureti, T. D.S. Prasada. Rao, (2019) “Measuring spatial price level differences within a country:

Current status and future developments”. Studies of Applied Economics, 36(1), 119-148.

Aizcorbe, A., and Aten, B. (2004). An Approach to Pooled Time and Space Comparisons. In SSHRC Conference on Index Number Theory and the Measurement of Prices and Productivity,

Vancouver, Canada.

New Techniques and Technologies for Statistics 2021

#NTTS2021

(17)

• BEA publishes RPPs yearly using several data sources

Base City for our study: Orlando

Methodological Approach and data set

References

Related documents

Having simulated the WOMAC pain and physical function values together with the PGA score (forming the complete trial data set), patient ’ s data were deleted according to a prede fi

Furthermore, as identified by Laux and Senchack (1992) failure to capture an asymmetric information component might not be a problem in futures markets because prices are

Key Success Factors and Strengths, Weakness, Opportunity and Threat Analysis External/Industry Analysis. (Five

By conducting proper on-site training programs for their construction personnel, contractors were able to reduce direct cost of waste up to 90%, with the average reduction cost

The Secure Remote Access Session window is displayed as shown in the following figure, and the plug-in begins to download.. When the download has completed, the Secure Remote

We can also conclude that the results that become insignificant when adding the FSI       are the ones attributed to oil price changes causing financial stress described by Balke,

In the end, this research interviewed five groups: individuals who were in the Human Trafficking Network , individuals familiar with the claim of human trafficking increasing

The choice of this area to characterise deadwood was made taking into account the main attributes that are generally considered to be the most useful proxy indicators of