• No results found

Outline. Sequential Data Analysis Issues With Sequential Data. How shall we handle missing values? Missing data in sequences

N/A
N/A
Protected

Academic year: 2021

Share "Outline. Sequential Data Analysis Issues With Sequential Data. How shall we handle missing values? Missing data in sequences"

Copied!
9
0
0

Loading.... (view fulltext now)

Full text

(1)

Sequential Data Analysis: Issues With Sequential Data

Sequential Data Analysis

Issues With Sequential Data

Gilbert Ritschard

Alexis Gabadinho, Matthias Studer

Institute for Demographic and Life Course Studies, University of Geneva

and NCCR LIVES: Overcoming vulnerability, life course perspectives

http://mephisto.unige.ch/traminer

September - November, 2012

©G. Ritschard (2012), 1/44. Distributed under licence CC BY-NC-ND 3.0

Sequential Data Analysis: Issues With Sequential Data

Outline

1

Missing data and sequences of unequal lengths

2

Time alignment and time granularity

3

State codings

4

Weights

5

Data size

6

Conclusion

©G. Ritschard (2012), 2/44. Distributed under licence CC BY-NC-ND 3.0

Sequential Data Analysis: Issues With Sequential Data Missing data and sequences of unequal lengths

Coding the missing states

Missing data in sequences

Missing values in the expanded (STS) form of a sequence

occur, for example, when:

Sequences do not start on the same date while using a

calendar time axis;

The follow-up time is shorter for some individuals than for

others yielding sequences that do not end up at the same

position;

The observation at some positions is missing due to

nonresponse, yielding internal gaps in the sequences.

©G. Ritschard (2012), 5/44. Distributed under licence CC BY-NC-ND 3.0

Sequential Data Analysis: Issues With Sequential Data Missing data and sequences of unequal lengths

Coding the missing states

How shall we handle missing values?

Handling may be different for each of the listed situations.

In case of

different start times

,

maintain the starting missing values to preserve alignment

across sequences,

or possibly left-align sequences by switching to a process time

axis.

In case of

different end times

,

ending missing terms could just be ignored.

In case of information missing due to

non response

,

add an explicit ‘non-response’ state to the alphabet;

or maintain missing values to preserve alignment.

(2)

Sequential Data Analysis: Issues With Sequential Data Missing data and sequences of unequal lengths

Coding the missing states

Coding left, gaps and right missing states

To allow such differentiated treatments, TraMineR

distinguishes

left

,

in-between

and

right

missing values.

Use the

left

,

gaps

and

right

arguments of

seqdef()

to specify

how each of the missing types should be encoded.

By default, gaps and left-missing states are coded as

NA

,

while all missing values encountered after the last valid

(rightmost) state in a sequence are considered void elements

(

right="DEL"

); i.e., the sequence is considered to end after the

last valid state.

©G. Ritschard (2012), 7/44. Distributed under licence CC BY-NC-ND 3.0

Sequential Data Analysis: Issues With Sequential Data Missing data and sequences of unequal lengths

Uncomplete sequences

Uncomplete sequences

Uncomplete sequences (sequences with missing states) is more

the rule than the exception.

Unlike Event History Analysis (Survival analysis), which can

handle censored data, no universal elegant way of handling

censored data in sequences.

©G. Ritschard (2012), 9/44. Distributed under licence CC BY-NC-ND 3.0

Sequential Data Analysis: Issues With Sequential Data Missing data and sequences of unequal lengths

Uncomplete sequences

Strategies in presence of uncomplete sequences

What can we do in presence of uncomplete sequences?

Delete all

uncomplete sequences.

Delete

sequences with

more than an acceptable

number of

missing states.

Consider the

NA

state as an

element of the alphabet

.

Impute

some missing states

Not too restrictive assumptions often permit to guess the

value of some missing state.

For example, we can assume that people leaving with their

both parents at 20, leaved with them since their birthday.

...

A mix of the previous solutions

©G. Ritschard (2012), 10/44. Distributed under licence CC BY-NC-ND 3.0

Sequential Data Analysis: Issues With Sequential Data Missing data and sequences of unequal lengths

Uncomplete sequences

Reliability of analysis with uncomplete sequences

When states are

missing at random

,

global picture given by the sequences remains satisfactory

whatever the handling strategy for the missing states.

(3)

Sequential Data Analysis: Issues With Sequential Data Missing data and sequences of unequal lengths

Uncomplete sequences

Illustration: randomly turning states into NA in mvad

To illustrate we randomly insert missing states into the

mvad

data,

1

Randomly select a proportion

p

of sequences to be modified.

2

In each selected sequence

insert a random proportion

<

pG

of gaps,

set as missing a random proportion

<

pL

of states from the

left,

set as missing a random proportion

<

pR

of states from the

right.

©G. Ritschard (2012), 12/44. Distributed under licence CC BY-NC-ND 3.0

Sequential Data Analysis: Issues With Sequential Data Missing data and sequences of unequal lengths

Uncomplete sequences

Randomly turning states into

NA

in mvad

For the next examples, we used

p

=

.

6

,

p

G

=

.

2

,

p

L

=

.

4

,

p

R

=

.

5

Missings where introduced with

segen.missing()

, from

TraMineRextras

R> mvadm.seq <- seqgen.missing(mvad.seq, p.cases = 0.6, p.left = 0.4,

p.gaps = 0.2, p.right = 0.5, mt.gaps = "nr", mt.right = "nr")

©G. Ritschard (2012), 13/44. Distributed under licence CC BY-NC-ND 3.0

Sequential Data Analysis: Issues With Sequential Data Missing data and sequences of unequal lengths

Uncomplete sequences

Rendering with and without missing states

I-plot

©G. Ritschard (2012), 14/44. Distributed under licence CC BY-NC-ND 3.0

Sequential Data Analysis: Issues With Sequential Data Missing data and sequences of unequal lengths

Uncomplete sequences

Rendering with and without missing states

d-plot,

with.missing=TRUE

(4)

Sequential Data Analysis: Issues With Sequential Data Missing data and sequences of unequal lengths

Uncomplete sequences

Rendering with and without missing states

d-plot,

with.missing=FALSE

©G. Ritschard (2012), 16/44. Distributed under licence CC BY-NC-ND 3.0

Sequential Data Analysis: Issues With Sequential Data Time alignment and time granularity

Time alignment

Time alignment

A crucial point when analyzing state sequences is to chose a

relevant time alignment

Calendar date

Same date start date for each sequence.

Process time

, i.e., time since a event of interest

birth date (position defined by age)

date when starting to live with a partner, first childbirth, ...

start of first job, first unemployment month, immigration

date, ...

©G. Ritschard (2012), 19/44. Distributed under licence CC BY-NC-ND 3.0

Sequential Data Analysis: Issues With Sequential Data Time alignment and time granularity

Time alignment

Loading the

srh

data

We illustrate with sequences of self reported health from the

SHP

(30% sample data in

srh30.Rdata

)

R> source(paste(scriptdir, "extractSeqFromW.R", sep = ""))

R> load(paste(datadir, "srh30.Rdata", sep = ""))

R> srh <- srh30

R> srh.shortlab <- c("B2", "B1", "M", "G1", "G2")

R> srh.longlab <- c("not well at all", "not very well", "so, so",

"well", "very well")

R> srh.alph <- c("not well at all", "not very well", "so, so (average)",

"well", "very well")

R> var <- getColumnIndex(srh, "P$$C01")

R> xtlab <- 1999:(1999 + length(var) - 1)

R> mycol5 <- brewer.pal(5, "RdYlGn")

R> srh.seq <- seqdef(srh[, var], right = NA, alphabet = srh.alph,

states = srh.shortlab, labels = srh.longlab, cnames = xtlab,

cpal = mycol5)

R> x <- apply(is.na(srh[, var]), 1, sum)

R> sel <- (x < seqlength(srh.seq) - 1)

R> srh <- srh[sel, ]

R> srh.seq <- srh.seq[sel, ]

©G. Ritschard (2012), 20/44. Distributed under licence CC BY-NC-ND 3.0

Sequential Data Analysis: Issues With Sequential Data Time alignment and time granularity

Time alignment

Illustration: Self-reported health, SHP 1999/2010

Sequences aligned on calendar year

(5)

Sequential Data Analysis: Issues With Sequential Data Time alignment and time granularity

Time alignment

Changing alignment

Changing alignment with

seqstart()

from TraMineRextras.

R> startyear <- 1999

R> birthyear <- srh$BIRTHY

R> agesrh <- seqstart(srh[, var], data.start = startyear,

new.start = birthyear)

R> colnames(agesrh) <- 1:ncol(agesrh)

R> agesrh <- agesrh[, 10:90]

R> agesrh.seq <- seqdef(agesrh, alphabet = srh.alph,

states = srh.shortlab, labels = srh.longlab,

cpal = mycol5, right = NA, xtstep = 10)

©G. Ritschard (2012), 22/44. Distributed under licence CC BY-NC-ND 3.0

Sequential Data Analysis: Issues With Sequential Data Time alignment and time granularity

Time alignment

Illustration: Self-reported health, SHP 1999/2010

Sequences aligned on age

©G. Ritschard (2012), 23/44. Distributed under licence CC BY-NC-ND 3.0

Sequential Data Analysis: Issues With Sequential Data Time alignment and time granularity

Time alignment

Illustration: Self-reported health, SHP 1999/2010

Sequences aligned on age, with ignored right missing positions,

right="DEL"

©G. Ritschard (2012), 24/44. Distributed under licence CC BY-NC-ND 3.0

Sequential Data Analysis: Issues With Sequential Data Time alignment and time granularity

Time alignment

Illustration: Self-reported health, SHP 1999/2010

Focus on people born between 1930 and 1934

(6)

Sequential Data Analysis: Issues With Sequential Data Time alignment and time granularity

Time granularity

Time granularity

Time

granularity

: density of state positions within a given

time length.

defined by the

duration

of the used

unit of time

examples: year, quarter, month, week, day, hour, ...

Can switch from a fine granularity to a more rough one.

But, cannot switch to a finer granularity than available in the

data.

Change granularity with

seqgranularity()

from

TraMineRextras

R> mvadg.seq <- seqgranularity(mvad.seq, tspan = 12)

©G. Ritschard (2012), 27/44. Distributed under licence CC BY-NC-ND 3.0

Sequential Data Analysis: Issues With Sequential Data Time alignment and time granularity

Time granularity

Changing time granularity of the mvad data

Monthly vs yearly states

©G. Ritschard (2012), 28/44. Distributed under licence CC BY-NC-ND 3.0

Sequential Data Analysis: Issues With Sequential Data Time alignment and time granularity

Time granularity

Changing time granularity of the mvad data

Monthly vs yearly states

©G. Ritschard (2012), 29/44. Distributed under licence CC BY-NC-ND 3.0

Sequential Data Analysis: Issues With Sequential Data State codings

State codings: What is the optimal alphabet size?

The larger the alphabet, the less clear the results.

Similarly to time aggregation, we can also

merge together

elements of the alphabet

.

Useful

when different states reflect similar situations

For example: in

mvad

, the distinction between ‘further

education’ (FE) and ’school’ (SC) is not so clear.

Merging those categories improves readability of the outcomes.

Avoid

merging dissimilar states.

Do not hide useful distinction such as ‘Full time’ and ‘Part

time’.

(7)

Sequential Data Analysis: Issues With Sequential Data State codings

Merging two states

Merging ‘Further education’ with ‘School’ in mvad

R> mvadr.seq <- seqrecode(mvad.seq, recodes = list(FS = c("FE", "SC")))

R> seqdplot(mvadr.seq, group = mvad$gcse5eq, border = NA)

©G. Ritschard (2012), 32/44. Distributed under licence CC BY-NC-ND 3.0

Sequential Data Analysis: Issues With Sequential Data State codings

Merging two states

Merging ‘Further education’ with ‘School’ in mvad

©G. Ritschard (2012), 33/44. Distributed under licence CC BY-NC-ND 3.0

Sequential Data Analysis: Issues With Sequential Data State codings

Merging two states

Merging ‘Further education’ with ‘School’ in mvad

©G. Ritschard (2012), 34/44. Distributed under licence CC BY-NC-ND 3.0

Sequential Data Analysis: Issues With Sequential Data Weights

Weights

Weights serve to

improve sample representativeness

Weights also useful for reducing the sequence data size by

retaining only unique sequences

.

weight reflect the number of cases sharing the same unique

sequence

In any case, when

weights are present

, they should be

accounted for.

In TraMineR with the

weights=

argument of

seqdef()

When assigned to the state sequence object, weights are

automatically accounted for.

in produced plots, distributions, statistics, ...

(8)

Sequential Data Analysis: Issues With Sequential Data Weights

Results may be quite different

R> layout(matrix(c(1, 2, 3, 3), 2, 2, byrow = TRUE), heights = c(2,

1.3))

R> seqdplot(mvad.seq, border = NA, withlegend = FALSE, weighted = FALSE,

title = "Non Weighed")

R> seqdplot(mvad.seq, border = NA, withlegend = FALSE, title = "Weighed")

R> seqlegend(mvad.seq, ncol = 2, position = "top")

©G. Ritschard (2012), 37/44. Distributed under licence CC BY-NC-ND 3.0

Sequential Data Analysis: Issues With Sequential Data Weights

Which weights to use with panel data?

Each wave of a panel survey usually includes 2 weights:

a transversal weight (representativeness of current population)

a longitudinal weight (representativeness of initial population),

applies to full trajectories.

Which weights should be used for uncomplete trajectories?

For sequences over a subinterval of time?

No evident solution.

Weights lose their meaning when cases are filtered out!

In SHP there are weights for cases for Sample I (1999) and for

Sample I+II (2004). See

http://www.swisspanel.ch/IMG/pdf/User_guide_E_short.pdf

©G. Ritschard (2012), 38/44. Distributed under licence CC BY-NC-ND 3.0

Sequential Data Analysis: Issues With Sequential Data Data size

Data size, scalability

Three types of size limitations:

Number of sequences

: no problem up to about

10 000

.

Main problem is matrix of pairwise dissimilarities!

Sequence length

: no problem up to a few hundreds (

300

)

In some functions default limit set as 100 should be increased

Size of alphabet

: not a too big problem for computation, but

rendering becomes difficult with more than say

20

elements

Default colors only for

|

A

| ≤

12

©G. Ritschard (2012), 40/44. Distributed under licence CC BY-NC-ND 3.0

Sequential Data Analysis: Issues With Sequential Data Data size

Size limitations: What can we do?

For number of sequences:

Work on a representative sample of the sequences.

For sequence length:

Change time granularity.

Split position (time) scale and work on subintervals

For size of alphabet

Merge elements of the alphabet.

(9)

Sequential Data Analysis: Issues With Sequential Data Conclusion

Conclusion

Many issues in sequence analysis

Solutions necessitate trade-offs

Losing sequences (cases) vs allowing for missing states

Losing sequences (cases) vs restricting time coverage

...

Holistic view provided by sequence analysis

Cost: cannot account for most recent cohorts with yearly data.

For example: Studying life course until 45 years with SHP

biographical survey of 2002, means, if we want only complete

trajectories, that younger people are born in 1957.

The finer the granularity, the less constrained we are.

©G. Ritschard (2012), 43/44. Distributed under licence CC BY-NC-ND 3.0

Sequential Data Analysis: Issues With Sequential Data Conclusion

Thank you!

Questions?

Thank you!

Questions?

See you next week.

See you next week.

©G. Ritschard (2012), 44/44. Distributed under licence CC BY-NC-ND 3.0

References

Related documents

‰ Short-term Leasing Representative, Visual Merchandiser (where available) and Tenant do final “fluff” of kiosk and/or cart prior to opening... GETTING STARTED IN YOUR SIMON MALL

The incumbent of this position performs duties of chief legal advisor for the City of West Palm Beach; advises the mayor, city commission, city depart- ments and all

 The  Mosaic  Law  contains  some  regulations  concerning  divorce...  There  are  particular  reasons  that  disallow  some

cal analysis methods on health data, such as anonymization and secure computation.. Data anonymization reduces the accuracy of the original data; hence the

Cílem je zhodnotit, jaká je cílenost program ů , jak sledované programy ovlivnily pozici nezam ě stnaných lidí na trhu práce a jaké faktory ovliv ň ují efekty program

Using pattern structures, one can compute taxonomies, ontologies, implications, implication bases, association rules, concept-based (or JSM-) hypotheses in the same way it is done

A diet that included fish oil so that the n-6/n-3 fatty acid ratio was ~1.4 reversed IR and hepatic TAG accumulation without increasing markers of oxidative stress

Table 7.1: Estimates of parameters, standard deviations and p-values for the gender by time in- teraction from the repeated measures ANOVA and from the curvilinear mixed