Storage Importing Combining Reshaping Tips
Efficient and effective management of big
databases in Stata
Giovanni Marin1,2
1CERIS-CNR, Milano, Italy
2SEEDS Sustainability Environmental Economics and Dynamics Studies, Italy
2014 Italian Stata Users Group meeting Milano, November 2014
Storage Importing Combining Reshaping Tips
Outline of the presentation
StorageMemory limits in Stata Variable formats
Importing
insheet, infix, import excel
Combining
Relational databases 1.0 Joining and merging databases
Reshaping
Long and wide formats Smart reshaping
Tips
Slides are complemented with practical exercises in
Storage Importing Combining Reshaping Tips Memory limits in Stata Variable formats
Memory limits in Stata
▸ Statauses the RAMto store data and other information to be used ▸ Up toStata 11the user was required todecide the amount of
memory to be allocated⇒ startingfrom Stata 12memory is
allocateddynamically (the commandset memoryis now obsolete)
▸ Memory limits in Stata
Storage Importing Combining Reshaping Tips Memory limits in Stata Variable formats
Variable formats
▸ Each variableisstoredin itsown format⇒help format ▸ Numericformats for numbers are substantially moreefficientin
terms of memory usage⇒ destring
▸ Potential issue with strings⇒ even if only one (or few) observation
is a long string (e.g. 200 characters),all otherobservations are stored as 200 characters information⇒ beware with long string!
Storage Importing Combining Reshaping Tips insheet, infix, import excel
Importing data stored in specific formats
▸ I here describe some possible commands toimport datastored in
formats different from .dta
▸ The description isnot comprehensive ⇒File → Import →...
insheet
▸ To import data stored with aseparator (tab, comma, semicolon,
space, etc)
▸ Many formats are allowed (.txt, .csv, .raw, .tsv, etc)
infix
▸ To import data stored infixed format(with dictionary or with
manual specification)
import excel
▸ Starting fromStata 12it is possible to import data fromMS Excel
files (both .xls and .xlsx)
▸ It is possible (and sometimes needed) to indicate thecell rangeand
Storage Importing Combining Reshaping Tips Relational databases 1.0 Joining and merging databases
Storage Importing Combining Reshaping Tips Relational databases 1.0 Joining and merging databases
The role of identifiers
▸ Tablesarejoinedby means of identifiers
▸ For example, if the dataset is a panelinlong format, theunique
identifier of each observation will be thecombinationof firm id
(e.g. the registry code) andyear(1:1identifiers)
▸ However, it could be the case that two tables should be joined by
attributing many observations for each identifier(m:1)⇒for
example time-invariant information on the firm for a panel of firms
▸ Finally, it could be useful to havem:m joins⇒ to generateall
possible pairwise matches between multiple identifiers
Storage Importing Combining Reshaping Tips Relational databases 1.0 Joining and merging databases
The command
merge
▸ merge is the basic command tojoin twoStata datasets
▸ It accommodates1:1,1:mandobservation-by-observationjoin ▸ The variable(s) that should be used as identifiersneed to have the
same variable name and need to be in thesame‘broad’ format
(e.g. either string or numeric) in both datasets
▸ Beware that if you have variables (excluding the identifiers) with the
same namein bothdatasets, thedefaultoptionkeeps the information recorded in the ‘master’ (in memory) dataset (also missing information!)
▸ Bydefault,mergekeeps all observationsfrom both the master
and the using dataset
Storage Importing Combining Reshaping Tips Relational databases 1.0 Joining and merging databases
The command
joinby
▸ joinbyis verysimilartomergebut:
▸ It accommodatesm:mmerge
Storage Importing Combining Reshaping Tips Relational databases 1.0 Joining and merging databases
The command
append
▸ appendsimply addsobservation at thebottomof the dataset in
memory
▸ Variables in the master and using dataset should have thesame
Storage Importing Combining Reshaping Tips Long and wide formats Smart reshaping
Examples of long and wide formats
Table : Long format id t var 1 1990 33 1 1991 11 2 1990 55 2 1991 77 3 1990 99 3 1991 91
Table : Wide format id var1990 var1991
1 33 11
2 55 77
Storage Importing Combining Reshaping Tips Long and wide formats Smart reshaping
The use of
reshape wide
and
reshape long
▸ The commandreshape wide transforms thelongformatintothe
wideformat
▸ The commandreshape long transforms thewideformatintothe
longformat
▸ Some tips
▸ If the ‘t’ variable is astring, remember to use theoptionstring ▸ When transforming thewideformatintothelongformat, remember
that all variables with thesame prefixshould be of thesame format and that all variablesnotincluded invarlistshould beconstant within ‘i’
▸ If your dataset has a wideformat but manymissingvalues, it could
be useful toreshapeitinto alongformat anddrop missing
observations
▸ You can always returnto theoriginalsetting by using fillinand
Storage Importing Combining Reshaping Tips
Some additional useful tips
▸ compresstooptimizevariables format
▸ Encodestrings or leave unused strings aside (but ready to be
re-joined if needed)
▸ Use the cyclesforeachor forvaluesto do repeated commands ▸ Compress andextractzipped datasets directly from Stata
(zipfileandunzipfile)
▸ fillinto rectangularize a dataset
▸ Usepreserveandrestoreto make temporary changesto
Storage Importing Combining Reshaping Tips