• No results found

Efficient and effective management of big databases in Stata

N/A
N/A
Protected

Academic year: 2021

Share "Efficient and effective management of big databases in Stata"

Copied!
14
0
0

Loading.... (view fulltext now)

Full text

(1)

Storage Importing Combining Reshaping Tips

Efficient and effective management of big

databases in Stata

Giovanni Marin1,2

1CERIS-CNR, Milano, Italy

2SEEDS Sustainability Environmental Economics and Dynamics Studies, Italy

2014 Italian Stata Users Group meeting Milano, November 2014

(2)

Storage Importing Combining Reshaping Tips

Outline of the presentation

Storage

Memory limits in Stata Variable formats

Importing

insheet, infix, import excel

Combining

Relational databases 1.0 Joining and merging databases

Reshaping

Long and wide formats Smart reshaping

Tips

Slides are complemented with practical exercises in

(3)

Storage Importing Combining Reshaping Tips Memory limits in Stata Variable formats

Memory limits in Stata

▸ Statauses the RAMto store data and other information to be used ▸ Up toStata 11the user was required todecide the amount of

memory to be allocated⇒ startingfrom Stata 12memory is

allocateddynamically (the commandset memoryis now obsolete)

▸ Memory limits in Stata

(4)

Storage Importing Combining Reshaping Tips Memory limits in Stata Variable formats

Variable formats

▸ Each variableisstoredin itsown formathelp format ▸ Numericformats for numbers are substantially moreefficientin

terms of memory usage⇒ destring

▸ Potential issue with strings even if only one (or few) observation

is a long string (e.g. 200 characters),all otherobservations are stored as 200 characters information⇒ beware with long string!

(5)

Storage Importing Combining Reshaping Tips insheet, infix, import excel

Importing data stored in specific formats

▸ I here describe some possible commands toimport datastored in

formats different from .dta

▸ The description isnot comprehensive File Import ...

insheet

▸ To import data stored with aseparator (tab, comma, semicolon,

space, etc)

▸ Many formats are allowed (.txt, .csv, .raw, .tsv, etc)

infix

▸ To import data stored infixed format(with dictionary or with

manual specification)

import excel

▸ Starting fromStata 12it is possible to import data fromMS Excel

files (both .xls and .xlsx)

▸ It is possible (and sometimes needed) to indicate thecell rangeand

(6)

Storage Importing Combining Reshaping Tips Relational databases 1.0 Joining and merging databases

(7)

Storage Importing Combining Reshaping Tips Relational databases 1.0 Joining and merging databases

The role of identifiers

▸ Tablesarejoinedby means of identifiers

▸ For example, if the dataset is a panelinlong format, theunique

identifier of each observation will be thecombinationof firm id

(e.g. the registry code) andyear(1:1identifiers)

▸ However, it could be the case that two tables should be joined by

attributing many observations for each identifier(m:1)⇒for

example time-invariant information on the firm for a panel of firms

▸ Finally, it could be useful to havem:m joins to generateall

possible pairwise matches between multiple identifiers

(8)

Storage Importing Combining Reshaping Tips Relational databases 1.0 Joining and merging databases

The command

merge

▸ merge is the basic command tojoin twoStata datasets

▸ It accommodates1:1,1:mandobservation-by-observationjoin ▸ The variable(s) that should be used as identifiersneed to have the

same variable name and need to be in thesame‘broad’ format

(e.g. either string or numeric) in both datasets

▸ Beware that if you have variables (excluding the identifiers) with the

same namein bothdatasets, thedefaultoptionkeeps the information recorded in the ‘master’ (in memory) dataset (also missing information!)

▸ Bydefault,mergekeeps all observationsfrom both the master

and the using dataset

(9)

Storage Importing Combining Reshaping Tips Relational databases 1.0 Joining and merging databases

The command

joinby

▸ joinbyis verysimilartomergebut:

▸ It accommodatesm:mmerge

(10)

Storage Importing Combining Reshaping Tips Relational databases 1.0 Joining and merging databases

The command

append

▸ appendsimply addsobservation at thebottomof the dataset in

memory

▸ Variables in the master and using dataset should have thesame

(11)

Storage Importing Combining Reshaping Tips Long and wide formats Smart reshaping

Examples of long and wide formats

Table : Long format id t var 1 1990 33 1 1991 11 2 1990 55 2 1991 77 3 1990 99 3 1991 91

Table : Wide format id var1990 var1991

1 33 11

2 55 77

(12)

Storage Importing Combining Reshaping Tips Long and wide formats Smart reshaping

The use of

reshape wide

and

reshape long

▸ The commandreshape wide transforms thelongformatintothe

wideformat

▸ The commandreshape long transforms thewideformatintothe

longformat

▸ Some tips

▸ If the ‘t’ variable is astring, remember to use theoptionstring ▸ When transforming thewideformatintothelongformat, remember

that all variables with thesame prefixshould be of thesame format and that all variablesnotincluded invarlistshould beconstant within ‘i’

▸ If your dataset has a wideformat but manymissingvalues, it could

be useful toreshapeitinto alongformat anddrop missing

observations

▸ You can always returnto theoriginalsetting by using fillinand

(13)

Storage Importing Combining Reshaping Tips

Some additional useful tips

▸ compresstooptimizevariables format

▸ Encodestrings or leave unused strings aside (but ready to be

re-joined if needed)

▸ Use the cyclesforeachor forvaluesto do repeated commands ▸ Compress andextractzipped datasets directly from Stata

(zipfileandunzipfile)

▸ fillinto rectangularize a dataset

▸ Usepreserveandrestoreto make temporary changesto

(14)

Storage Importing Combining Reshaping Tips

THANK YOU FOR YOUR ATTENTION

For questions, doubts, comments, please contact me at

[email protected]

References

Related documents

With this system as we know, the electric motor of a car becomes a generator when the brake pedal is applied. The kinetic energy of the car is used to generate

This is the predicament in which the two best Victorian women poets, Elizabeth Barrett Browning and Christina Rossetti, found themselves.. Th e Damsel, the Knight, and the

• Author cmdlets, providers and filters in Visual Studio • Create your own custom shell (typically for integration. with

AMVA needs to be parameterized with the service times of jobs at each queue s ir. As a problem of our traces, there was no information about the placement of threads available. We

5 As regards the construction service sector, calls for tender procedures have jointly been developed by those bidding and the public clients: fully electronic procedures make it

For example, reconsider the benchmarks on importing data into memory: Figure 6 shows that (using Stata) you can reduce the time to import one week of MiFID data by 73

 may or may not have column headings (If there are no column headings, Stata assigns variable names v1, v2, v3 , etc. If you have not saved the data in memory, it is

44108 9 BCV-ORG COMMUNITY CONTROL