Testing the stability of regression parameters when some additional data sets are available

(1)

TESTING

THE

STABILITY OF REGRESSION PARAMETERS

WHEN

SOME ADDITIONAL

DATA

SETS

ARE AVAILABLE

R. RADHAKRISHNAN DONR. ROBINSON

Department

of

Manasement

andQuantitativeMethods

IllinomState University Normal,Illinois61761

(Received June 2, 1992 and in revised form September 8, 1992)

ABSTRACT. We consider the problem of testing the stability of regression parameters in

regressionlines ofdifferentpopulationswhen some additional, but unidentified, datasetsfrom those populationsare available. Thestandardtest

(To)

discardstheadditionaldataand teststhe stability,oftheregressionparametersusing only the data setsfrom identified Iopulations. We proposetwo testprocedures

(T1

and

T2)

utilizingalltheavailabledata,becausethe additional datamaycontain informationabout theparameters oftheregressionlines whicharetested for

stability. Apower comparisonamongthetestsisalsopresented. Itis shownthat

T1

alwayshas larger power than

To.

Incertain situations

T2

hasthelargestpower.

KEY WORDS AND PHRASES: leastsquares estimates,regression parameters,power of the

test.

1992AMSSUBJECT CLASSIFICATION CODES: 62J05, 62J99

1. INTRODUCTION. Consider theregression model

Yij ai +

#i(xij

)

+ij, i=1,2 k, j= 1,2,...,ni,

(1.1)

wherethe_{Yij are observations on the}response_{variable, the xij are}observationsonthepredictor

variable, a _{and #i are} the regression parameters, and the _ij are the error terms, which are

unobserved random variables. Itisassumed that the errorsareindependent,normallydistributed random variableswithmean0and commonunknown varianceo

2.

Forthemodel,_ai +

#i(xij"

)

isthe regressionlineofthe variableyon the predictorvariable xfor the th group,a isthe y-interceptwhenx

x,

and_#i is the slope. Suppose we have m

(m

_<

k)

additional data sets

correspondingtomregressionlineswhose modelisgiven by

Yij =ai

+i(xij

)+ij,

i=k+l k+m, j=1,2 hi.

(1.2)

Weassumethat the error termsil in model

(1.2)

are independent,normallydistributedrandom

variables with mean 0 and common unknown varianceo

2.

It is further assumed that the m

(2)

However,we cannotidentifythe mregression lines associated withthe additionaldata sets

(Yii,

xij, k+ k+m, 1,2 ni). Weare interestedintestingthe nullhypothesis

Ho: ,

a2

--...

Ok;

1

2=... #k against

Ha:

either_’i

*

_ai’ or_i

*

_i’ forat least onepair(i,i’), wherei, i’, i,i’=1,2 k, utilizingall the available data. The nullhypothesis impliesthat all thek regression

lines in model (1.1)arecoincidentwhereas the alternative hypothesisis that at leasttwoof the

regressionlinesare different. The standardtest

(To)

ofH0againstHausingthe k datasets

(Yij,

xij, i=1,2 k, j= 1,2 ni) is well-known in the literature and has diverse applications. A biostatistician maybe interested intestingthe equivalence of regression linesfor predictingthe

systolicblood pressure using ageas thepredictorvariable for four social groups. A testfor the

stability oftheregression parameters thatgenerated the datasets is_Ho"

’

₂ ₃ 4;fl #2 #3 #4. IfH₀ is true, we use a single regression line based upon the four data setsfor

predicting systolicbloodpressure using ageasthepredictor variable,Klienbaumand

Kupper

[1]. An economist mightbe interested in testingthe equivalence ofmultiple regression models for predictingthegrossdomesticproduct usinglabor andcapitalaspredictorvariables fordifferent timeperiods,Maddala[2].

Inthispaperweconsidertwotests

(T1

and

T2)

utilizingall the available data and make a

power comparisonbetween thesetwo testsand thestandardtestwhich is basedsolelyon the k datasetsrelatingtothe regressionlineswhose parametersaretestedfor stability. InSection2we determineleast squaresestimatesof the regression parameterstoobtaintheteststatisticsfor the problem. The noncentrality parameter of thetests is derived in Section 3. InSection 4 we derive ourproposedtests,

T

andT2. Weillustrate andcomparethepowerof all threetestsin Section

5.

2. LEASTSQUARESESTIMATES. Consider the sum

k+m n

0

i=l

E

j=IE

(Yij

"i’’i(xij

.))2,

(2.1)

where

i

and/i

are theestimatesof regression parametersa

and/i

(i=1,2

k+m).

The least squaresestimates of the regression parametersare obtainedby differentiating

₀

partiallywith

respectto

i

and/i

andthensolving the resulting normal equations for

i

and

B’i.

Itcanbe shown

that the leastsquaresestimatesofa

and/i

aregivenby

ni

ai^

j=lX:

Yij/hi

(2.2)

and

ni ni

/0’

I

lYiJ

(xij

)/1

(xij

)2.

j= j=l

(2.3)

Then

R

2,

the unconditional error sumof squares,is obtainedbysubstituting

i

and/’i

givenby

(2.2)

and

(2.3)

into0. Itcanbe shown that

k+m n k+m

I2 2

(Yij

.-)2

/i

2

Si

2

(2.4)

R2

i=l j=l

]=1

(3)

ni

Si

2 ]2

(Xij

)2,

_1,2 k+m.

j=l

(2.5)

The conditional errorsumofsquares underH0isobtainedbyminimizing

k n k+rn n

1

i=lr

j=lr"

(Yij--/(xij

.]))2

+i=k+r,

1

j=lI

(Yij-ti-/’i(xij

.))2

(2.6)

withrespectto

,

’, ti, and/i

(i k+ k+ m).

The second sum on the right-hand side of

(2.6)

is minimized with respect to

i

and

i

(i=k+1

k+m)

where

and/i

aredefined, respectively, asin equations

(2.2)

and

(2.3).

The

leastsquaresestimates oftheregression parametersa and # aregiven by

k n

ly

(2.7)

a r, ₁

ij/n=y

i=lj=

k n k

/ r, r,

Yij

(xij

)/i

r,

Si

2

(2.8)

i=l j=l "=1

where

k

n= Iz n_i.

(2.9)

i=l

The conditional error sum ofsquaresunderH0is

k n k k+rn n k+rn

R

r, r,

(Yij "y--)2./2

_I

Si

2 ₊ v,

(Yij

"’t)

2

V,/i

2

Si

2

i=1 j=l i=1 i=k+lj=l

i’=k+m

(2.10)

Thesumofsquares for testingthe nullhypothesisH₀is

SSH₀

R12

-I

k k k

I ni]-

y--)2

+

lI/’i

2

Si

2

-/2

Si

2

i=l i=l i=l,

k k

1

ni- y--)2

+ I

Si

2

i’)2,

i=l i=l

where

k k

/

Si

2/i/

Si2.

i=l i=l

Itiswell-knowninthe literature that

1

2/0

2_{is distributed}_as_chi-square_with

k+m

n’--n+ r _ni-2(k+m) i=k+l

(2.11)

(2.12)

(4)

degrees of freedom and

SSHo/o

2 _is _{distributed as}_noncentral _chi-square_with _2(k-l) _degrees _of

freedom. WhenH0is true,

SSH0/o

2isdistributedaschi-squarewith2(k-1)degrees offreedom.

Further,

l

2andSSH₀areindependent;forexampleseeKshirsagar[3].

3. NONCENTRALITY PARAMETER. Herewederivetheexpectedvalueof SSH0under the

non-nullcase. Itcanbe shownthat

k k k k

r.

ni(-y-)2

r

ni(ai--)2 ₊ _r. _ni(7--)2 ₊ ₂ _r.

ni(ai--)(7--),

i=1 i=1 i=1 i=1

(3.1)

where

k

-=

r.

niai/n

i=l

(3.2)

k

r

,ij/ni,

i=l

(3.3)

k

-=

E

ni/n.

i=l

(3.4)

Now

and

where

Taking expectations ofboth sides of

(3.1)

we obtain

k k

E(.r,

niC-y-)

2)

r, _ni(ai--)2 _{+ (k-1)a}

2.

1=1 i=1

k k

E( x:

i

2

Si

2

Si

2

(o2/Si

2

+/i2

i=l i=l

k ko2+ 1 _#i2

Si

2

i=l

k k k

E(/2

p.

Si

2)

Si

2(a2/

Si

2

+2),

i=l i=l i=l

k

=02

+-2

Si

2,

i=l

k k

=

Si23i/

Si2.

i=1 i=l

(3.5)

(3.6)

(3.7)

(3.8)

Using equations

(3.5), (3.6), (3.7),

and

(2.11)

weget

k k

E(SSH0)

2(k-1)o

2 ₊ r, _ni(ai--)2 ₊ r,

Si2(Bi--)

2

(5)

Since

SSHo/o

2 _is _{distributed as} _{noncentral chi-square} _with _2(k-l) _degrees _of_freedom, _it

follows from(3.9)that thenoncentralityparameterisgiven by

k k

,x

(1/02)[

12 ni(a

.-)2

+ r,

Si

2_(fli

.-)2

_].

(3.10)

i=1 i=1

4. TEST PROCEDURES. The standardprocedurefor testingHoagaimtHais an

F-test

based upontheteststatistic

F0

(SSHo/2(k-1))/(R0

2/(n-2k)),

(4.1)

where

k n k

I2 I2

(Yij

.)2

I2

$i2Si

2

(4.2)

RO2

i=lj=l i=l

See, for example, Kshirsagar [3]. The above test rejects the null hypothesis H0 if F0 >

F,,2(k.),,.

and acceptsH₀otherwise,where

F,f

_,f

2istheupper100a percentile point ofthe

F-distribution with

fl

numerator degrees of freedom

(ndf)

and

f2

denominator degrees of freedom

(ddf).

Wenote that the standardtestis based upon the k datasets

(Yi],

_xi], i=1,2 k,

1,2 ni)anddiscardstheadditionaldata

(Yi],

_xi], k+1 k+ m, 1,2 ni).

Considerthefollowingtestprocedure

(T1).

RejctH0if

F1

(SSHo/2(k-1))/(R0/n

’)

>

Wa,2(k.1),n,

(4.3)

and acceptH₀otherwise. Acomparison betweenTOandT1shows thatbothhavethe same ndf

but thatthe latterhaslarger ddf thanT_0. Wefurthernotethat

T1

isbaseduponall theavailable data. Under the non-null case,both test statistics havenoncentralF-distributions with the same

noncentrality parameter, asin

(3.10).

Therefore

F1

willhavelargerpowerthanF0,Graybill[4].

When the mregressionlines in

(1.2)

are anunidentifiedsubset ofthe kregressionlines in the model (1.1),testing Ho againstHais equivalenttotesting

I-

thek+m regression lines are

identicalagaimt

I-1:

atleasttwoof them are different.

Following the procedureoutlined in Section2,itcanbe shown that thesum ofsquares for testing

I-

is

where

k+m k+m

SSI-I

I

hi(8

.&,,)2

+ I

Si ./,)2,

(4.4)

i=l i=l

k+m k+m

a

.

ni

r, ni,

(4.5)

i=1 =1

k+m k+m

/’

i=

1Si2/i/i’=ll

Si

2

(4.6)

(6)

where

and

k+m k+m

,X’=

(1/o2)[

E ni(ai-a)2 + :E

Si2(fli-3)2l,

(4.7)

i=l i=l

k+m k+m

a I2

niai/

Z ni,

(4.8)

i=l i=l

k+m k+m

/ E:

Si2/i/i

I2

Si

2

(4.9)

i=l =1

We notethat

SSI-I

can beobtainedfrom SSH0by replacingk with k+m. When

I-

is true,the

samplingdistributionof

SSHo

2_is_chi-square_with_{2(k + m-l) degrees}_{of freedom. Further,}

SSI-I

and

R

2_are_independent.

Weuse anF-test

(T2)

totest

I-

against basedupontheteststatistic

F2

(SSIf’l)/(l2/n’),

(4.10)

where 2(k + m-l). Wereject

H

ifF2 >

F,,fl,

n, and accept

I

otherwise. When

I

is true, thesamplingdistributionofF2isnoncentralFwith ndfandn’ddfandnoncentrality parameter

a’. Weusenoncentral F-distribution tables tocomputethepowerof thetests. The next section illustrates andcomparesthepowerof thesethreetests.

5. POWER COMPARISONSOF

THE

TESTS. WhenH₀and

I-

are nottruetheteststatistics

(4.1),

(4.3), and

(4.10)

follow noncentralF-distributions. The non-null distributions ofthe test statistics

F0

and F₁ have the same noncentrality parameter,

a,

defined in

(3.10).

The noncentrality parameter for the non-nulldistributionof

F2

isx’ asdefined in

(4.7).

Thendffor bothT_O and

T1

is

f

2(k-l). For T2thendf is 2(k+ m-l). To hasddf

f2

n-2k, while the

ddf for

Tx

andT₂isn’as defined in

(2.13).

Tables 1, 2, and 3illustratethepowers ofT_Oandourproposedtests,

T1

andT_2.

We

chose

,

0.05 and situations involving k 4 regressionlines. Thenumberof datasets consideredfrom unidentified populationsism, where 1_< m_<k. Forsimplicityweuseequal samplesizes(ni

10)

for thek identifiedpopulations and equal samplesizes

(n)

for the munidentified populations. Fromourearlier notation

n

_nk+ (i 1

m).

Tables 1, 2,and3 differ in themagnitudeof_ni.

Inthe tableswe denote the noncentrality parameter for the poweroftest T as_i. The power ofeachtestisafunction

ofai

andthe relevantdegrees offreedom. Asindicatedabove,

a0

"

1. Form k,each,x is aspecific value. Form< k,,x0and,x are

(the same)

specific values,

but,x₂ _varies _depending_uponwhichunidentified populationsproducethem datasets. Forthis

reason wecalculatethetests’powers for selectedsetsof kregressionlines andvalues of

Si

2 The

differences between the parameters of these lines together with

Si

2 anda 2 affect ,x_i. The parameters of the k lines,

Si

2 and

2

were chosentoproducethe three valuesindicatedfor,x_0,_so

that thepower ofT_O isabout .25, .5, and.75. IfT_O hasverysmallpower, then additionaldata provide verylittleimprovement. Conversely,when the powerofT_O isvery large, thereis little

need for improvementwithadditional data.

Examinationsof the tablesproducethe followingobservations. The powers of

T

O arethe

(7)

the powerofT isalwaysgreater than the powerofTO consistent with Graybill’s conclusion [4]

thatforagivenndf the powerofthe test increases as the ddf increases. Also, ineach table the

power of

T1

increases asmincreases, becausethe ndfremainsat2(k-l)while theddfincreasesby

n-2.

Likewise, for each value ofm the power ofT increases from Table through Table 3 becausethe ddf increases asaresult of the

n

increasingfrom5to7andfinallyto10.

The powerofT2 isheavily influencedbythe choice ofregression lines for the additional

data whenm < k. Ineachtable thepowerofT2 doesnotconsistently increase as m increases.

The increasesin,x₂andddfare sometimesoffset bythe increase in of2m. Foreach valueofm thepowerofT2generallyincreases from Table through Table3because of the same increase in

ddfasfor

T,

Butas Table 1indicates,forsmall

n

relative to n thepower of

T2

may belower thanthe power ofT0,andisseldom much betterthanthepowerofT1. InTable 2 when the

n

approaches ni in size, improvements in the power ofT2 overthe power of

T

are noticeable.

Table 3 indicates that whenthe

n

equalni,

T2

issuperiortothepower of

T1

except occasionally

forsmall m.

6. APPLICATIONS. Usingadditionaldatafromunidentifiedpopulations improvesthepowerof thetestforstabilityof theparametersinkregressionlines. Theonly requirementisthat theerror termsof the regressionlines from allpopulations have a common variance. Thepower ofour

proposedtest,T1,isalwaysgreaterthan thepower of the standardtest,T0. Ifm, the number of

datasetsfromunidentified populations, isclosetokandifthe

n

are near theni,then T2 can

producealargerincrease inthepowerthanT1. Ifmissmall or if

n

issmallrelative to ni,then

T1

may beabetter choice thanT2.

noncentrality parameters Powerofthetests

m A0,A1 )2

TO

T1

T2

4 4.454 6.681 .2502 .2622 .2409

4 9.085 13.627 .5016 .5252 .5069

4 14.866 22.299 .7500 .7750 .7744

3 4.454 5.707to 6.440 .2502 .2598 .2233to.2518

3 9.085 12.012to12.770 .5016 .5205 .4803to.5105

3 14.866 19.191to21.353 .7500 .7701 .7299to.7858

2 4.454 4.895to 6.240 .2502 .2570 .2115to.2684

2 9.085 10.648to12.055 .5016 .5151 .4645to.5243

2 14.866 16.601to20.564 .7500 .7645 .6944to.8050

1 4.454 4.650to 5.248 .2502 .2539 .2256to.2536

1 9.085 9.784 to10.405 .5016 .5089 .4738to.5029

1 14.866 15.637to17.399 .7500 .7579 .7139to.7687

Table1

(8)

m

noncentralityparameters Powerof thetests

A0,

A1

"2

TO

T1

T2

4 4.454 7.572 .2502 .2674 .2836

4 9.085 15.444 .5016 .5353 .5914

4 14.866 25.273 .7500 .7851 .8522

3 4.454 6.178to 7.228 .2502 .2643 .2481to.2913

3 9.085 13.127to14.217 .5016 .5294 .5388to.5810

3 14.866 20.826to23.919 .7500 .7792 .7874to.8529 2 4.454 5.071to 6.954 .2502 .2606 .2228 to.3057 2 9.085 11.287to 13.243 .5016 .5221 .5014to.5832

2 14.866 17.295to22.843 .7500 .7718 .7271to.8616 4.454 4.717to 5.519 .2502 .2560 .2310to.2692 1 9.085 10.022to10.854 .5016 .5131 .4900to .5287 1 14.866 15.900to 18.261 .7500 .7624 .7283to.7977

Table2

Powerof theF-testsfora =.05 k=4 n 10

n

=7

m

noncentralityparameters Powerof thetests

)0,’

"2

’ro

T1

T2

4 4.454 8.908 .2502 .2730 .3498

4 9.085 18.170 .5016 .5459 .7024

4 14.866 29.732 .7500 .7955 .9270

3 4.454 6.867to 8.404 .2502 .2695 .2851 to .3522 3 9.085 14.776to16.373 .5016 .5393 .6189to.6756

3 14.866 23.220to27.749 .7500 .7891 .8539to.9207

2 4.454 5.336to 8.026 .2502 .2650 .2395 to .3632 2 9.085 12.230to15.025 .5016 .5306 .5540to.6643 2 14.866 18.336to26.262 .7500 .7805 .7703to.9210

1 4.454 4.807to 5.883 .2502 .2589 .2383to.2909

1 9.085 10.343 to 11.461 .5016 .5188 .5119to.5633

1 14.866 16.254to19.425 .7500 .7683 .7470to.8330

[1]

Table3

Powerof theF-testsfor --.05 k=4 n 10

n

10

REFERENCES

KLIENBAUM,

D. G. and L. L.

KUPPER,

Applied Regression and Other Multivariable

Methods,Chapter13,PWS-Kent,Boston,MA, 1978.

[2]

MADDAI.A,

G.S.,IntroductiontoEconometrics,Chapter 4,MacMillian, NewYork,NY,

1988.

[3] KSHIRSAGAR,A.M., ACoursein LinearModels, Chapter 3,MarcelDekker,NewYork,

NY,

1983.