Outline. Sequential Data Analysis Issues With Sequential Data. How shall we handle missing values? Missing data in sequences

(1)

Sequential Data Analysis: Issues With Sequential Data

Sequential Data Analysis

Issues With Sequential Data

Gilbert Ritschard

Alexis Gabadinho, Matthias Studer

Institute for Demographic and Life Course Studies, University of Geneva

and NCCR LIVES: Overcoming vulnerability, life course perspectives

http://mephisto.unige.ch/traminer

September - November, 2012

©G. Ritschard (2012), 1/44. Distributed under licence CC BY-NC-ND 3.0

Sequential Data Analysis: Issues With Sequential Data

Outline

1 Missing data and sequences of unequal lengths

2 Time alignment and time granularity

3 State codings

4 Weights

5 Data size

6 Conclusion

Sequential Data Analysis: Issues With Sequential Data Missing data and sequences of unequal lengths

Coding the missing states

Missing data in sequences

Missing values in the expanded (STS) form of a sequence

occur, for example, when:

Sequences do not start on the same date while using a

calendar time axis;

The follow-up time is shorter for some individuals than for

others yielding sequences that do not end up at the same

position;

The observation at some positions is missing due to

nonresponse, yielding internal gaps in the sequences.

How shall we handle missing values?

Handling may be different for each of the listed situations.

In case of

different start times

,

maintain the starting missing values to preserve alignment

across sequences,

or possibly left-align sequences by switching to a process time

axis.

In case of

different end times

,

ending missing terms could just be ignored.

In case of information missing due to

non response

,

add an explicit ‘non-response’ state to the alphabet;

or maintain missing values to preserve alignment.

(2)

Coding left, gaps and right missing states

To allow such differentiated treatments, TraMineR

distinguishes

left

,

in-between

and

right

missing values.

Use the

left

,

gaps

and

right

arguments of

seqdef()

to specify

how each of the missing types should be encoded.

By default, gaps and left-missing states are coded as

NA

,

while all missing values encountered after the last valid

(rightmost) state in a sequence are considered void elements

(

right="DEL"

); i.e., the sequence is considered to end after the

last valid state.

Uncomplete sequences

Uncomplete sequences (sequences with missing states) is more

the rule than the exception.

Unlike Event History Analysis (Survival analysis), which can

handle censored data, no universal elegant way of handling

censored data in sequences.

Strategies in presence of uncomplete sequences

What can we do in presence of uncomplete sequences?

Delete all

uncomplete sequences.

Delete

sequences with

more than an acceptable

number of

missing states.

Consider the

NA

state as an

element of the alphabet

.

Impute

some missing states

Not too restrictive assumptions often permit to guess the

value of some missing state.

For example, we can assume that people leaving with their

both parents at 20, leaved with them since their birthday.

...

A mix of the previous solutions

Reliability of analysis with uncomplete sequences

When states are

missing at random

,

global picture given by the sequences remains satisfactory

whatever the handling strategy for the missing states.

(3)

Illustration: randomly turning states into NA in mvad

To illustrate we randomly insert missing states into the

mvad

data,

1

Randomly select a proportion

p

of sequences to be modified.

2

In each selected sequence

insert a random proportion

<

pG

of gaps,

set as missing a random proportion

<

pL

of states from the

left,

set as missing a random proportion

<

pR

of states from the

right.

Randomly turning states into

NA

in mvad

For the next examples, we used

p

=

.

6 ,

p

G

=

.

2 ,

p

L

=

.

4 ,

p

R

=

.

5 Missings where introduced with

segen.missing()

, from

TraMineRextras

R> mvadm.seq <- seqgen.missing(mvad.seq, p.cases = 0.6, p.left = 0.4,

p.gaps = 0.2, p.right = 0.5, mt.gaps = "nr", mt.right = "nr")

Rendering with and without missing states

I-plot

Rendering with and without missing states

d-plot,

with.missing=TRUE

(4)

Rendering with and without missing states

d-plot,

with.missing=FALSE

Sequential Data Analysis: Issues With Sequential Data Time alignment and time granularity

Time alignment

A crucial point when analyzing state sequences is to chose a

relevant time alignment

Calendar date

Same date start date for each sequence.

Process time

, i.e., time since a event of interest

birth date (position defined by age)

date when starting to live with a partner, first childbirth, ...

start of first job, first unemployment month, immigration

date, ...

Time alignment

Loading the

srh

data

We illustrate with sequences of self reported health from the

SHP

(30% sample data in

srh30.Rdata

)

R> source(paste(scriptdir, "extractSeqFromW.R", sep = ""))

R> load(paste(datadir, "srh30.Rdata", sep = ""))

R> srh <- srh30

R> srh.shortlab <- c("B2", "B1", "M", "G1", "G2")

R> srh.longlab <- c("not well at all", "not very well", "so, so",

"well", "very well")

R> srh.alph <- c("not well at all", "not very well", "so, so (average)",

"well", "very well")

R> var <- getColumnIndex(srh, "P$$C01")

R> xtlab <- 1999:(1999 + length(var) - 1)

R> mycol5 <- brewer.pal(5, "RdYlGn")

R> srh.seq <- seqdef(srh[, var], right = NA, alphabet = srh.alph,

states = srh.shortlab, labels = srh.longlab, cnames = xtlab,

cpal = mycol5)

R> x <- apply(is.na(srh[, var]), 1, sum)

R> sel <- (x < seqlength(srh.seq) - 1)

R> srh <- srh[sel, ]

R> srh.seq <- srh.seq[sel, ]

Time alignment

Illustration: Self-reported health, SHP 1999/2010

Sequences aligned on calendar year

(5)

Time alignment

Changing alignment

Changing alignment with

seqstart()

from TraMineRextras.

R> startyear <- 1999

R> birthyear <- srh$BIRTHY

R> agesrh <- seqstart(srh[, var], data.start = startyear,

new.start = birthyear)

R> colnames(agesrh) <- 1:ncol(agesrh)

R> agesrh <- agesrh[, 10:90]

R> agesrh.seq <- seqdef(agesrh, alphabet = srh.alph,

states = srh.shortlab, labels = srh.longlab,

cpal = mycol5, right = NA, xtstep = 10)

Time alignment

Illustration: Self-reported health, SHP 1999/2010

Sequences aligned on age

Time alignment

Illustration: Self-reported health, SHP 1999/2010

Sequences aligned on age, with ignored right missing positions,

right="DEL"

Time alignment

Illustration: Self-reported health, SHP 1999/2010

Focus on people born between 1930 and 1934

(6)

Time granularity

Time

granularity

: density of state positions within a given

time length.

defined by the

duration

of the used

unit of time

examples: year, quarter, month, week, day, hour, ...

Can switch from a fine granularity to a more rough one.

But, cannot switch to a finer granularity than available in the

data.

Change granularity with

seqgranularity()

from

TraMineRextras

R> mvadg.seq <- seqgranularity(mvad.seq, tspan = 12)

Time granularity

Changing time granularity of the mvad data

Monthly vs yearly states

Time granularity

Changing time granularity of the mvad data

Monthly vs yearly states

Sequential Data Analysis: Issues With Sequential Data State codings

State codings: What is the optimal alphabet size?

The larger the alphabet, the less clear the results.

Similarly to time aggregation, we can also

merge together

elements of the alphabet

.

Useful

when different states reflect similar situations

For example: in

mvad

, the distinction between ‘further

education’ (FE) and ’school’ (SC) is not so clear.

Merging those categories improves readability of the outcomes.

Avoid

merging dissimilar states.

Do not hide useful distinction such as ‘Full time’ and ‘Part

time’.

(7)

Merging two states

Merging ‘Further education’ with ‘School’ in mvad

R> mvadr.seq <- seqrecode(mvad.seq, recodes = list(FS = c("FE", "SC")))

R> seqdplot(mvadr.seq, group = mvad$gcse5eq, border = NA)

Merging two states

Merging ‘Further education’ with ‘School’ in mvad

Merging two states

Merging ‘Further education’ with ‘School’ in mvad

Sequential Data Analysis: Issues With Sequential Data Weights

Weights

Weights serve to

improve sample representativeness

Weights also useful for reducing the sequence data size by

retaining only unique sequences

.

weight reflect the number of cases sharing the same unique

sequence

In any case, when

weights are present

, they should be

accounted for.

In TraMineR with the

weights=

argument of

seqdef()

When assigned to the state sequence object, weights are

automatically accounted for.

in produced plots, distributions, statistics, ...

(8)

Results may be quite different

R> layout(matrix(c(1, 2, 3, 3), 2, 2, byrow = TRUE), heights = c(2,

1.3))

R> seqdplot(mvad.seq, border = NA, withlegend = FALSE, weighted = FALSE,

title = "Non Weighed")

R> seqdplot(mvad.seq, border = NA, withlegend = FALSE, title = "Weighed")

R> seqlegend(mvad.seq, ncol = 2, position = "top")

Which weights to use with panel data?

Each wave of a panel survey usually includes 2 weights:

a transversal weight (representativeness of current population)

a longitudinal weight (representativeness of initial population),

applies to full trajectories.

Which weights should be used for uncomplete trajectories?

For sequences over a subinterval of time?

No evident solution.

Weights lose their meaning when cases are filtered out!

In SHP there are weights for cases for Sample I (1999) and for

Sample I+II (2004). See

http://www.swisspanel.ch/IMG/pdf/User_guide_E_short.pdf

Sequential Data Analysis: Issues With Sequential Data Data size

Data size, scalability

Three types of size limitations:

Number of sequences

: no problem up to about

10 000

.

Main problem is matrix of pairwise dissimilarities!

Sequence length

: no problem up to a few hundreds (

∼

300 )

In some functions default limit set as 100 should be increased

Size of alphabet

: not a too big problem for computation, but

rendering becomes difficult with more than say

20 elements

Default colors only for

|

A

| ≤

12

Sequential Data Analysis: Issues With Sequential Data Data size

Size limitations: What can we do?

For number of sequences:

Work on a representative sample of the sequences.

For sequence length:

Change time granularity.

Split position (time) scale and work on subintervals

For size of alphabet

Merge elements of the alphabet.

(9)

Sequential Data Analysis: Issues With Sequential Data Conclusion

Conclusion

Many issues in sequence analysis

Solutions necessitate trade-offs

Losing sequences (cases) vs allowing for missing states

Losing sequences (cases) vs restricting time coverage

...

Holistic view provided by sequence analysis

Cost: cannot account for most recent cohorts with yearly data.

For example: Studying life course until 45 years with SHP

biographical survey of 2002, means, if we want only complete

trajectories, that younger people are born in 1957.

The finer the granularity, the less constrained we are.

Sequential Data Analysis: Issues With Sequential Data Conclusion

Thank you!

Questions?

Thank you!

Questions?

See you next week.