Sequential Data Analysis: Issues With Sequential Data
Sequential Data Analysis
Issues With Sequential Data
Gilbert Ritschard
Alexis Gabadinho, Matthias Studer
Institute for Demographic and Life Course Studies, University of Geneva
and NCCR LIVES: Overcoming vulnerability, life course perspectives
http://mephisto.unige.ch/traminer
September - November, 2012
©G. Ritschard (2012), 1/44. Distributed under licence CC BY-NC-ND 3.0
Sequential Data Analysis: Issues With Sequential Data
Outline
1
Missing data and sequences of unequal lengths
2
Time alignment and time granularity
3
State codings
4
Weights
5
Data size
6
Conclusion
©G. Ritschard (2012), 2/44. Distributed under licence CC BY-NC-ND 3.0
Sequential Data Analysis: Issues With Sequential Data Missing data and sequences of unequal lengths
Coding the missing states
Missing data in sequences
Missing values in the expanded (STS) form of a sequence
occur, for example, when:
Sequences do not start on the same date while using a
calendar time axis;
The follow-up time is shorter for some individuals than for
others yielding sequences that do not end up at the same
position;
The observation at some positions is missing due to
nonresponse, yielding internal gaps in the sequences.
©G. Ritschard (2012), 5/44. Distributed under licence CC BY-NC-ND 3.0
Sequential Data Analysis: Issues With Sequential Data Missing data and sequences of unequal lengths
Coding the missing states
How shall we handle missing values?
Handling may be different for each of the listed situations.
In case of
different start times
,
maintain the starting missing values to preserve alignment
across sequences,
or possibly left-align sequences by switching to a process time
axis.
In case of
different end times
,
ending missing terms could just be ignored.
In case of information missing due to
non response
,
add an explicit ‘non-response’ state to the alphabet;
or maintain missing values to preserve alignment.
Sequential Data Analysis: Issues With Sequential Data Missing data and sequences of unequal lengths
Coding the missing states
Coding left, gaps and right missing states
To allow such differentiated treatments, TraMineR
distinguishes
left
,
in-between
and
right
missing values.
Use the
left
,
gaps
and
right
arguments of
seqdef()
to specify
how each of the missing types should be encoded.
By default, gaps and left-missing states are coded as
NA
,
while all missing values encountered after the last valid
(rightmost) state in a sequence are considered void elements
(
right="DEL"
); i.e., the sequence is considered to end after the
last valid state.
©G. Ritschard (2012), 7/44. Distributed under licence CC BY-NC-ND 3.0
Sequential Data Analysis: Issues With Sequential Data Missing data and sequences of unequal lengths
Uncomplete sequences
Uncomplete sequences
Uncomplete sequences (sequences with missing states) is more
the rule than the exception.
Unlike Event History Analysis (Survival analysis), which can
handle censored data, no universal elegant way of handling
censored data in sequences.
©G. Ritschard (2012), 9/44. Distributed under licence CC BY-NC-ND 3.0
Sequential Data Analysis: Issues With Sequential Data Missing data and sequences of unequal lengths
Uncomplete sequences
Strategies in presence of uncomplete sequences
What can we do in presence of uncomplete sequences?
Delete all
uncomplete sequences.
Delete
sequences with
more than an acceptable
number of
missing states.
Consider the
NA
state as an
element of the alphabet
.
Impute
some missing states
Not too restrictive assumptions often permit to guess the
value of some missing state.
For example, we can assume that people leaving with their
both parents at 20, leaved with them since their birthday.
...
A mix of the previous solutions
©G. Ritschard (2012), 10/44. Distributed under licence CC BY-NC-ND 3.0
Sequential Data Analysis: Issues With Sequential Data Missing data and sequences of unequal lengths
Uncomplete sequences
Reliability of analysis with uncomplete sequences
When states are
missing at random
,
global picture given by the sequences remains satisfactory
whatever the handling strategy for the missing states.
Sequential Data Analysis: Issues With Sequential Data Missing data and sequences of unequal lengths
Uncomplete sequences
Illustration: randomly turning states into NA in mvad
To illustrate we randomly insert missing states into the
mvad
data,
1
Randomly select a proportion
p
of sequences to be modified.
2In each selected sequence
insert a random proportion
<
pG
of gaps,
set as missing a random proportion
<
pL
of states from the
left,
set as missing a random proportion
<
pR
of states from the
right.
©G. Ritschard (2012), 12/44. Distributed under licence CC BY-NC-ND 3.0
Sequential Data Analysis: Issues With Sequential Data Missing data and sequences of unequal lengths
Uncomplete sequences
Randomly turning states into
NA
in mvad
For the next examples, we used
p
=
.
6
,
p
G
=
.
2
,
p
L
=
.
4
,
p
R
=
.
5
Missings where introduced with
segen.missing()
, from
TraMineRextras
R> mvadm.seq <- seqgen.missing(mvad.seq, p.cases = 0.6, p.left = 0.4,
p.gaps = 0.2, p.right = 0.5, mt.gaps = "nr", mt.right = "nr")
©G. Ritschard (2012), 13/44. Distributed under licence CC BY-NC-ND 3.0
Sequential Data Analysis: Issues With Sequential Data Missing data and sequences of unequal lengths
Uncomplete sequences
Rendering with and without missing states
I-plot
©G. Ritschard (2012), 14/44. Distributed under licence CC BY-NC-ND 3.0
Sequential Data Analysis: Issues With Sequential Data Missing data and sequences of unequal lengths
Uncomplete sequences
Rendering with and without missing states
d-plot,
with.missing=TRUE
Sequential Data Analysis: Issues With Sequential Data Missing data and sequences of unequal lengths
Uncomplete sequences
Rendering with and without missing states
d-plot,
with.missing=FALSE
©G. Ritschard (2012), 16/44. Distributed under licence CC BY-NC-ND 3.0
Sequential Data Analysis: Issues With Sequential Data Time alignment and time granularity
Time alignment
Time alignment
A crucial point when analyzing state sequences is to chose a
relevant time alignment
Calendar date
Same date start date for each sequence.
Process time
, i.e., time since a event of interest
birth date (position defined by age)
date when starting to live with a partner, first childbirth, ...
start of first job, first unemployment month, immigration
date, ...
©G. Ritschard (2012), 19/44. Distributed under licence CC BY-NC-ND 3.0
Sequential Data Analysis: Issues With Sequential Data Time alignment and time granularity
Time alignment
Loading the
srh
data
We illustrate with sequences of self reported health from the
SHP
(30% sample data in
srh30.Rdata
)
R> source(paste(scriptdir, "extractSeqFromW.R", sep = ""))
R> load(paste(datadir, "srh30.Rdata", sep = ""))
R> srh <- srh30
R> srh.shortlab <- c("B2", "B1", "M", "G1", "G2")
R> srh.longlab <- c("not well at all", "not very well", "so, so",
"well", "very well")
R> srh.alph <- c("not well at all", "not very well", "so, so (average)",
"well", "very well")
R> var <- getColumnIndex(srh, "P$$C01")
R> xtlab <- 1999:(1999 + length(var) - 1)
R> mycol5 <- brewer.pal(5, "RdYlGn")
R> srh.seq <- seqdef(srh[, var], right = NA, alphabet = srh.alph,
states = srh.shortlab, labels = srh.longlab, cnames = xtlab,
cpal = mycol5)
R> x <- apply(is.na(srh[, var]), 1, sum)
R> sel <- (x < seqlength(srh.seq) - 1)
R> srh <- srh[sel, ]
R> srh.seq <- srh.seq[sel, ]
©G. Ritschard (2012), 20/44. Distributed under licence CC BY-NC-ND 3.0
Sequential Data Analysis: Issues With Sequential Data Time alignment and time granularity
Time alignment
Illustration: Self-reported health, SHP 1999/2010
Sequences aligned on calendar year
Sequential Data Analysis: Issues With Sequential Data Time alignment and time granularity
Time alignment
Changing alignment
Changing alignment with
seqstart()
from TraMineRextras.
R> startyear <- 1999
R> birthyear <- srh$BIRTHY
R> agesrh <- seqstart(srh[, var], data.start = startyear,
new.start = birthyear)
R> colnames(agesrh) <- 1:ncol(agesrh)
R> agesrh <- agesrh[, 10:90]
R> agesrh.seq <- seqdef(agesrh, alphabet = srh.alph,
states = srh.shortlab, labels = srh.longlab,
cpal = mycol5, right = NA, xtstep = 10)
©G. Ritschard (2012), 22/44. Distributed under licence CC BY-NC-ND 3.0
Sequential Data Analysis: Issues With Sequential Data Time alignment and time granularity
Time alignment
Illustration: Self-reported health, SHP 1999/2010
Sequences aligned on age
©G. Ritschard (2012), 23/44. Distributed under licence CC BY-NC-ND 3.0
Sequential Data Analysis: Issues With Sequential Data Time alignment and time granularity
Time alignment
Illustration: Self-reported health, SHP 1999/2010
Sequences aligned on age, with ignored right missing positions,
right="DEL"
©G. Ritschard (2012), 24/44. Distributed under licence CC BY-NC-ND 3.0
Sequential Data Analysis: Issues With Sequential Data Time alignment and time granularity
Time alignment
Illustration: Self-reported health, SHP 1999/2010
Focus on people born between 1930 and 1934
Sequential Data Analysis: Issues With Sequential Data Time alignment and time granularity
Time granularity
Time granularity
Time
granularity
: density of state positions within a given
time length.
defined by the
duration
of the used
unit of time
examples: year, quarter, month, week, day, hour, ...
Can switch from a fine granularity to a more rough one.
But, cannot switch to a finer granularity than available in the
data.
Change granularity with
seqgranularity()
from
TraMineRextras
R> mvadg.seq <- seqgranularity(mvad.seq, tspan = 12)
©G. Ritschard (2012), 27/44. Distributed under licence CC BY-NC-ND 3.0
Sequential Data Analysis: Issues With Sequential Data Time alignment and time granularity
Time granularity
Changing time granularity of the mvad data
Monthly vs yearly states
©G. Ritschard (2012), 28/44. Distributed under licence CC BY-NC-ND 3.0
Sequential Data Analysis: Issues With Sequential Data Time alignment and time granularity
Time granularity
Changing time granularity of the mvad data
Monthly vs yearly states
©G. Ritschard (2012), 29/44. Distributed under licence CC BY-NC-ND 3.0
Sequential Data Analysis: Issues With Sequential Data State codings
State codings: What is the optimal alphabet size?
The larger the alphabet, the less clear the results.
Similarly to time aggregation, we can also
merge together
elements of the alphabet
.
Useful
when different states reflect similar situations
For example: in
mvad
, the distinction between ‘further
education’ (FE) and ’school’ (SC) is not so clear.
Merging those categories improves readability of the outcomes.
Avoid
merging dissimilar states.
Do not hide useful distinction such as ‘Full time’ and ‘Part
time’.
Sequential Data Analysis: Issues With Sequential Data State codings
Merging two states
Merging ‘Further education’ with ‘School’ in mvad
R> mvadr.seq <- seqrecode(mvad.seq, recodes = list(FS = c("FE", "SC")))
R> seqdplot(mvadr.seq, group = mvad$gcse5eq, border = NA)
©G. Ritschard (2012), 32/44. Distributed under licence CC BY-NC-ND 3.0
Sequential Data Analysis: Issues With Sequential Data State codings
Merging two states
Merging ‘Further education’ with ‘School’ in mvad
©G. Ritschard (2012), 33/44. Distributed under licence CC BY-NC-ND 3.0
Sequential Data Analysis: Issues With Sequential Data State codings
Merging two states
Merging ‘Further education’ with ‘School’ in mvad
©G. Ritschard (2012), 34/44. Distributed under licence CC BY-NC-ND 3.0
Sequential Data Analysis: Issues With Sequential Data Weights
Weights
Weights serve to
improve sample representativeness
Weights also useful for reducing the sequence data size by
retaining only unique sequences
.
weight reflect the number of cases sharing the same unique
sequence
In any case, when
weights are present
, they should be
accounted for.
In TraMineR with the
weights=
argument of
seqdef()
When assigned to the state sequence object, weights are
automatically accounted for.
in produced plots, distributions, statistics, ...
Sequential Data Analysis: Issues With Sequential Data Weights
Results may be quite different
R> layout(matrix(c(1, 2, 3, 3), 2, 2, byrow = TRUE), heights = c(2,
1.3))
R> seqdplot(mvad.seq, border = NA, withlegend = FALSE, weighted = FALSE,
title = "Non Weighed")
R> seqdplot(mvad.seq, border = NA, withlegend = FALSE, title = "Weighed")
R> seqlegend(mvad.seq, ncol = 2, position = "top")
©G. Ritschard (2012), 37/44. Distributed under licence CC BY-NC-ND 3.0
Sequential Data Analysis: Issues With Sequential Data Weights
Which weights to use with panel data?
Each wave of a panel survey usually includes 2 weights:
a transversal weight (representativeness of current population)
a longitudinal weight (representativeness of initial population),
applies to full trajectories.
Which weights should be used for uncomplete trajectories?
For sequences over a subinterval of time?
No evident solution.
Weights lose their meaning when cases are filtered out!
In SHP there are weights for cases for Sample I (1999) and for
Sample I+II (2004). See
http://www.swisspanel.ch/IMG/pdf/User_guide_E_short.pdf
©G. Ritschard (2012), 38/44. Distributed under licence CC BY-NC-ND 3.0
Sequential Data Analysis: Issues With Sequential Data Data size
Data size, scalability
Three types of size limitations:
Number of sequences
: no problem up to about
10 000
.
Main problem is matrix of pairwise dissimilarities!
Sequence length
: no problem up to a few hundreds (
∼
300
)
In some functions default limit set as 100 should be increased
Size of alphabet
: not a too big problem for computation, but
rendering becomes difficult with more than say
20
elements
Default colors only for
|
A
| ≤
12
©G. Ritschard (2012), 40/44. Distributed under licence CC BY-NC-ND 3.0
Sequential Data Analysis: Issues With Sequential Data Data size
Size limitations: What can we do?
For number of sequences:
Work on a representative sample of the sequences.
For sequence length:
Change time granularity.
Split position (time) scale and work on subintervals
For size of alphabet
Merge elements of the alphabet.
Sequential Data Analysis: Issues With Sequential Data Conclusion
Conclusion
Many issues in sequence analysis
Solutions necessitate trade-offs
Losing sequences (cases) vs allowing for missing states
Losing sequences (cases) vs restricting time coverage
...
Holistic view provided by sequence analysis
Cost: cannot account for most recent cohorts with yearly data.
For example: Studying life course until 45 years with SHP
biographical survey of 2002, means, if we want only complete
trajectories, that younger people are born in 1957.
The finer the granularity, the less constrained we are.
©G. Ritschard (2012), 43/44. Distributed under licence CC BY-NC-ND 3.0
Sequential Data Analysis: Issues With Sequential Data Conclusion