Modeling Count Data from Hawk Migrations
M.S. Plan B Project ReportJanuary 12, 2011
Fengying Miao
M.S. Applied and Computational Mathematics Candidate
Dr. Ronald Regal
AdvisorUniversity of Minnesota Duluth
Table of Contents
i. ACKNOWLEDGEMENTS... 4 ii. ABSTRACT ... 5 1 INTRODUCTION... 6 2 HAWK EXAMPLE ... 8 3 THE GENERAL AND GENERALIZED LINEAR MODELS ... 10 3.1 General Linear Model ... 10 3.2 Exponential Family of Distributions... 12 3.3 Generalized Linear Models ... 13 4 DO NOT LOG‐TRANSFORM COUNT DATA... 15 5 COMPARISON OF ESTIMATION METHODS USING THE DELTA METHOD... 17 5.1 Expected Values and Variances of Nonlinear Functions of Random Variables... 17 5.2 Single Mean ... 19 5.2.1 Poisson for single mean ... 20 5.2.2 Log‐normal for Single Mean... 20 5.3 Two Means ... 21 5.3.1 Poisson for two means... 21 5.3.2 Log‐normal for two means ... 21 5.4 Alternative Nonlinear Models... 22 6 FURTHER COMPARISONS OF MODELS... 23 6.1 Single Mean ... 23 6.1.1 Exact calculation for a single mean... 23 6.1.2 Specific example of comparing exact calculation and ... 24 6.2 Two Means ... 25 6.2.1 Exact calculation for difference between means... 26 6.2.2 Comparisons of Models for Two Means by Doing Simulation... 27 6.2.3 Regression... 29 7 FITTING MODELS TO HAWK DATA ... 38 7.1 Simple introduction to some potential variables... 38 7.2 Fitting Models... 39 7.2.1 Fit Mixed Model to Data ... 407.3 Summary of Findings ... 42 8 CONCLUSION ... 46 9 REFERENCES... 50 10 APPENDICES... 51 10.1 SAS Code ... 51 10.2 R code ... 66
i.
ACKNOWLEDGEMENTS
I would like to take this opportunity to give my sincere thanks to my advisor, Dr. Ronald Regal,
for his great support and guidance, which make it possible for me to finish the project. Dr.
Ronald Regal is the best advisor I have ever had. I would never forget his great help in my study
and life and what he told me that finding the limits of our knowledge and understanding is
always important.
I also want to give my thanks to Dr. Richard Green and Dr. Gerald Niemi for being on my
degree committee, reviewing my report and providing useful suggestions.
I also thank Heidi Seeland for providing the datasets in this project.
Thanks to Dr. Zhuangyi Liu for accepting me into this good program and letting me have the
ii.
ABSTRACT
The General Linear Model (LM) with assumptions of independence, linearity and equal variance
underlies most statistical analyses. Because of its generality, many kinds of data are transformed
to satisfy its assumptions. Count data are often log-transformed using to more
nearly match the assumptions. However, adding a value of one to counts might generate biases,
so we need to choose a proper model for count data. In addition E(Ln(X)) is not the same as
Ln(E(X)) so even if the relationship is linear for Ln(E(X)), the same will not be perfectly true for
E(Ln(X)). To avoid or reduce the bias from transforming data, the Generalized Linear Model
(GLM) and nonlinear mixed (NLMIXED) model could be considered instead. This report
investigates how LM regression models, Generalized Linear Models based on Poisson and
negative binomial distributions, and approximate nonlinear models fit with NLMIXED model
compare when estimating the slope of a linear trend when analyzing count data. Implementations
of comparing models are done by the popular statistical software SAS with packages, PROC
REG, PROC GENMOD, PROC MIXED and PROC NLMIXED. A real data set from a hawk
migration is analyzed and fitted with the mixed model. The NLMIXED model is used to analyze
the variances and means of .
1
INTRODUCTION
A statistical model is used to predict the probabilistic future behavior of a system from data. The
main purpose of model building is to obtain proper estimates with small bias and little variability.
The traditional model (LM) has been widely used, since many data can be modeled this way and
there are many available theories to be applied. Different methods, such as square-root
transformation and log-transforming, are often used to transform data, usually response variables,
to meet the assumptions of LM. These methods might work well for continuous response
variable and certain discrete variable, such count data including few ‘zero’ observations which
rules out direct log transformations. For example, in a study where migrating hawks are counted
hourly, the numbers counted are often zero.
More and more methods and models have been explored to break the limits of the
assumptions. The Generalized Linear Model, GLM, an extension of LM, allows the analyst to
specify the distribution of data, which address the problem of transforming data to be normally
distributed. The NLMIXED model in which both fixed and random effects are allowed to have
nonlinear relationships with response variables has become increasingly popular and allow
flexibility of nonlinear functions as well as user specified likelihood functions. These newly born
models can be applied to a wider range of real problems. Currently, the computing statistical
software has been keeping in line with the numerical methods and making them more applicable.
To get best estimates of response variable for a particular system, it is important to fit
proper model for best describing data. In this report I describe my investigations into finding
appropriate models to analyze count data such as hawk migration counts. Model selection from
negative binomial distributions. Relative bias and relative RMSE are used to evaluate how well
the models work. The relationship between variance and means of real data is modeled by using
NLMIXED model, which cannot deal with complicated random effects in real data we are going
to study. A mixed model fit with SAS proc MIXED is used for the real hawk data to account for
2
HAWK EXAMPLE
A data set from monitoring of migrating hawks is used to illustrate the issues and conclusions in
this report. The data were collected in fall of 2008 by Heidi Seeland and Anna Peterson, graduate
students at UMD with Dr. Gerald Niemi as their advisors. In this section, the structure of hawk
data is initially described. Further details on fitting models to the data are discussed in later
chapters. The data set contains counts of hawk and eagles at three distances from the shore of
Lake Superior, over seven hours on certain days between August 29, 2008, and November 11,
2008. The sampling plan had eight sets of three sampling locations spread out along the north
shore of Lake Superior. The eight sets of three sampling sites were called transects and
numbered for 1 to 8 up the shore starting from Duluth.
One general category of hawks is accipiters, which fly lower and closer to tree cover. To make
the data set more understandable, let’s first introduce buteos, larger hawks with broad wings that
soar higher on wind currents. Figure 2 shows a plot of the average number of buteos per 7 hour
each day plotted against dates.
Fig. 2 Average Buteos Counts per day VS Dates in Original Scale
In this original scale, buteos counts on a day are dispersedly distributed over time. The huge
variation makes the form of the time trend unclear. As shown in Figure 3, log transformation of
Fig. 3 Average Buteos Counts per day VS Dates in Log-Scale
The relationship between buteos and dates follow a general linear trend, except the last two
points. For simplicity in applying models in this project which focus on estimating the slope of a
linear trend, we will leave out points after November 1 in demonstrating the fitting some models
to these data.
3
THE GENERAL AND GENERALIZED LINEAR MODELS
Recent papers including Ohara and Kotze (2010) have advocated generalized linear models with
Poisson or negative binomial distributions rather than using normal linear models in the log scale.
Before comparing these models in a wider range of situations than considered by Ohara and
Consider a situation where we are interested, for example, in describing the number of violation
tickets people get for violating traffic regulations annually as a function of their age. The average
number of violation tickets is predicted by the following equation
(3.1)
where y is the response variable, Violation Tickets, is the explanatory variable, Age, and
measures the deviation of the measured y from its expected value. It may now be asked whether,
after allowing for the effect of age, a person’s sex has any influence on the frequency of violation
tickets people get. Based on this assumption, the appropriate model might be described as
(3.2)
where and represent Age and Sex, respectively.
Each time a new variable has been introduced into the model, an additional parameter has
been added. This process is an approach by which we find a mathematical description of the
structure in the values of response variable. These two models discussed above involve a linear combination of parameters , , , , and are consequently known as linear models. For example, polynomial regression model belongs this category despite the fact
that y is a non-linear function of the explanatory variable . The general form of linear models is
described as
(3.3)
where and represents the error that explanatory variable cannot tell. By
, (3.4)
or in the following compact form
(3.5)
Besides linearity, the usual general linear model also assumes normality, independence and equal
variance of observations, which can be written as
where ,
. (3.6)
3.2 Exponential Family of Distributions
Linear models are postulated more often than non-linear ones because they are mathematically
easier to manipulate and usually easier to interpret. They appear to provide an adequate
description of many data sets. A wider class including normal distribution is called the
exponential family of distributions. Consider a single random variable Y whose probability
distribution depends on a single parameter θ. The distribution belongs to the exponential family
if it can be written in the form
(3.7)
If , the distribution is said to be in canonical form, and is sometimes is called the
exponential family includes such useful distributions as binomial, Poisson, negative binomial,
and gamma distributions, in addition to the normal distribution.
3.3 Generalized Linear Models
There are many types of data which might not be normally distributed in original scale. To
address this problem, a transformation may be used to normalize the data. Often, people deal
with the log-transformation first, before evaluating other transformation techniques. But discrete
response variables, such as birds count data, often contain many ‘zero’ observations and are
unlikely to have a normally distributed error structure. Maindonald & Braun (2007) argued that
generalized linear models (GLMs) have largely removed the need for transforming count data.
More recently, GLMs have been developed and commonly used.
A GLM is an extension of the well-known linear models to include response variables
that follow any probability distribution in the exponential family of distributions. The key idea is
that, the relationship between and a linear predictor is specified by a link function:
(3.8)
where and is a link function that links the random component, , to the
systematic component . Equation (3.8) can be written as
For example, count data could be appropriately analyzed as a Poisson random variable within the
context of the Generalized Linear Model. So, for the observation bird count , we have
. The probability function for is described as
(3.10)
If we had a covariate x for predictor days, then
(3.11)
For the Poisson distribution, the mean and variance are equal. Real data do not always follow
this, and the variance ( ) is often much larger than the mean µ . This so-called overdispersion
can be incorporated into a model in several ways. These all estimate the amount of extra
variation but make different assumptions about how this extra variation scales with the mean.
The negative binomial distribution, for example, assumes with an overdispersion
parameter and the mean . The negative binomial distribution approximates to Poisson
distribution when is much bigger than i.e. approaches to infinity. To introduce the
negative binomial distribution in a simple way, we only use one variable here. Suppose
where . Then we can describe the probability function of negative
binomial distribution as follows:
The negative binomial is also a Gamma-Poisson Mixture. Suppose and .Then we can have the following procedures:
(3.13)
4
DO NOT LOGTRANSFORM COUNT DATA
Ohara and Kotze (2010) provide a detailed discussion in their paper Do Not Log-transform Count Data. In that paper, they put forward that log-transformation of counts has the additional quandary in how to deal with zero observations. With just one zero observation (if this
observation represents a sampling unit), the whole data set is usually adjusted by adding a value
(usually 1, the lowest possible nonzero count) before transformation, so they introduced GLMs
to deal with count data.
They simulated data sets from a negative binomial distribution with different values of
that negative binomial distribution can be viewed as gamma mixture of Poisson. Low shrinks
the graph of Gamma probability function of , which pulls values to a smaller domain, thus
generating more clumping data. For each simulation, n=100 data points were simulated at each
of 20 mean values, µ = 1, 2, ..., 20. Five hundred replicate simulations were carried out for each
value of . Then they compared the outcome of fitting models that were transformed in various
ways (log, square root) with results from fitting models using overdispersed, quasi-Poisson
models and negative binomial models to untransformed count data. The simulations were
compared by calculating the mean bias and root mean-squared error in estimating log (µ).
In their results, the quasi-Poisson and negative binomial models behave similarly, having
negligible bias, whereas the models based on a normal distribution are all biased, particularly at
low means and high variances. The square-root transformation has a lower bias than any of the
log-transformations, unless the mean is low. Thus, they recommend that count data not be
transformed to be used in parametric tests. For such data, GLMs and their derivatives are more
appropriate.
However, their simulations were from negative binomial distributions. Poisson models
with extra-binomial variation still model the variation as proportional to the mean, whereas
negative binomial models include a term in the variance proportional to the mean squared. In
many data sets, Ln(Y+1) is fairly normal. For any of the discussions from here on Log or Ln are
interchangeable notations. Generally, when in statistics Log means Ln. For example in SAS,
Log(y) means Ln(y). Fitting a linear relationship to Ln(Y+1) of the daily counts of buteos
This normal plot is reasonably straight, at least close enough for normal methods to work well
enough. In later sections where we fit models to hourly counts, the normal plot is even straighter.
The results from Ohara and Kotze are limited to 1) negative binomial data, 2) estimating a single
mean and 3) very large replication, n=100. The generalized linear models worked in their
simulations, but how will they work in estimating slopes of trends if the data are normal with
variances not like Poisson or negative binomial data?
5
COMPARISON OF ESTIMATION METHODS USING THE DELTA METHOD
5.1 Expected Values and Variances of Nonlinear Functions of Random Variables
In discussions below, I will use Taylor series approximations for approximating expected values
and variances of the nonlinear functions. First, I describe these methods, commonly known as
Suppose we have a random variable , and we know and , but we are
interested in the mean and variance of for some function . For example, we might be
able to measure and determine its mean and variance, but we are really interested in , which
is related to in a known way. If is linear, then this is pretty straightforward:
(5.1.1)
(5.1.2)
(5.1.3)
However, in many cases is not linear. In many areas of mathematics we find approximations by linearizing a nonlinear problem we cannot solve exactly. In probability and statistics, this
method is called propagation of error or the delta method.
Denote as the mean of . We use a first-order Taylor series approximation around
: (5.1.4) since (5.1.5) (5.1.6)
We have , but we know that in general from Jensen’s
Inequality. Thus, we can carry out the Taylor Series expansion to the second order to get an
improved approximation of .
(5.1.7)
Taking the expectation of right-hand side, we have,
(5.1.8)
(5.1.9)
How good such approximations depends on how nonlinear is in the neighborhood of
defined by the size of , where is the standard deviation of .
In comparing disadvantages of using Log(Y+1), the Poisson case with Y from Poisson
distribution should be studied where log-normal estimation is at a great disadvantage. Using the
delta method, we can start by comparing Poisson and log-normal estimation for the simple case
of a single mean, the case considered by Ohara and Kotze and then compare two means. For the
observation from the hawk counts , we have . To make the notation
consistent through the discussions of one mean, two means, and regression, throughout I will use
or .
Most of this report focuses on estimating changes or slopes across time, for example estimating
how bird populations are changing across several years of monitoring. But first I will discuss
briefly the case of estimating a single mean. In the one mean case, we consider the average
number of hawks at where is considered as predictor day. Let .
5.2.1 Poisson for single mean
From (3.10), we have at , so for Poisson likelihood , using the
delta method, the expectation and variance of can be obtained as follows:
(5.2.1)
(5.2.2)
Note that the degree of bias depends on the number of replicates, . Ohara and Kotze used
n=100 which results in little bias if Poisson data are modeled.
5.2.2 Lognormal for Single Mean
If we use a normal distribution as an approximation to the distribution of , then
.For the Poisson model above we use the log of the average, whereas in the
normal model we use the average of log values.
(5.2.3)
Note that the Poisson model has smaller bias, expected value closer to log (µ). The smaller bias
is more pronounced for larger n such as n=100 for Ohara and Kotze. Unlike Poisson estimation,
the bias in using the mean of log values does not disappear with increasing n.
5.3 Two Means
Suppose that and correspond to the average number of hawks at and ,
respectively. In this case a regression of Y on X will give a slope that is the same as the
difference between the means. Considering the difference between means, I will use
for and for .
5.3.1 Poisson for two means
From (3.10), we know the true and . Then we get
and ,
where and . By applying the delta method, we have
(5.3.1)
(5.3.2)
5.3.2 Lognormal for two means
For Log-transformation to ,
We are more concerned about the slope , so we would like to obtain the followings by using
delta method:
(5.3.3)
(5.3.4)
Again, the primary disadvantage of using normal likelihood methods is the larger bias. The
results given above are based on approximations, but based on simulations and exact calculation
given below, the general trends are accurate.
5.4 Alternative Nonlinear Models
In the previous sections we used the delta method to find approximations for the mean and
variance for those parameter estimates, and we saw that using results in more biased
estimates. Alternatively, we could use these approximations to derive more unbiased estimators.
Since the means are no longer linear functions, we will need to use nonlinear models to
accomplish the estimation. Nonlinear mixed models in which fixed and random effects have
nonlinear relationships to the response variable are becoming more and more popular nowadays.
For using Taylor series expansions:
]
] (5.4.1)
If we assume that the variance is equal to the mean as in a Poisson distribution then a normal
approximation will use
(5.4.3)
More generally, we can assume an overdispersion models such as or
. Since both the mean and variance are nonlinear functions of the parameters,
procedures such as SAS NLMIXED is used to fit these nonlinear models, as I discuss later more.
6
FURTHER COMPARISONS OF MODELS
The final purpose is to fit a good model for data on hawk migration by modeling effects such as
date, time of day, weather and distance from shore. To check comparisons of alternative models,
I did simulations and exact calculation to investigate how Poisson, negative binomial and
log-normal models compare when the data are Poisson, negative binomial and log-log-normal for
log(Y+1). I also investigated methods for bias corrections using approximate propagation of
error methods for log-normal for log(Y+1). A simple way to check different models is only to
see how hawk counts are distributed based on time effect.
6.1 Single Mean
6.1.1 Exact calculation for a single mean
In section 5.2, we have discussed the application of delta method for single mean and cases for
two means . For single mean case, we only compare exact calculation with delta method. Let
(6.1.1) (6.1.2) Using (6.1.3) (6.1.4)
The two methods aren't on equal footing above, since the Poisson calculations don't use S=0
cases, but these are not common in the models considered, and comparisons of exact and
delta-method results, for the same model, are completely comparable.
6.1.2 Specific example of comparing exact calculation and
I use the one simple case to illustrate the differences between exact calculation and the delta
method approximation. Assume that we have observations and the observed hawk counts
have . Then the true value of is that . Applying
equations in sections 5.2 and 6.1, results for bias and root mean squared error (RMSE) about
Exact calculation for Poisson regression
Delta method for
Poisson regression Exact calculation for log‐normal of Delta method for log‐ normal of True 2.302585 2.302585 2.302585 2.302585 2.297543 2.297585 2.356573 2.356573 ‐0.005042438 ‐0.005 0.05398787 0.05398787 0.01015371 0.01 0.00944442 0.008264463 0.1008917 0.1001249 0.1097244 0.1057315
Table 6.1.2.1 for and
(6.1.5)
(6.1.6) Conclusions from these results are as follows. 1) Comparing the first and second or third
and fourth columns, we see that the delta method approximations are quite good for means of
this size. For smaller means the approximations will be less precise. The delta method
approximations could be used to develop more efficient models as done later in this report. 2)
Comparing the first and third columns, the normal approximation has larger bias, smaller
variance, and a bit larger RMSE. Developing approximately bias corrected estimators could be
competitive with generalized linear models.
For comparing models with discrete distributions and closed form solutions such as
log(Y+1) or Poisson estimation, the simulations of Ohara and Kotze can be replaced with exact
calculations. In addition, simple delta method approximations can be used for initial
comparisons of alternative modeling methods before using more lengthy exact calculations for
final results on promising methods.
The next step up in complexity is comparing two means. The difference between means is the
same as the regression slope with only two x values. In this section let’s look at this simpler case
before moving on to a more usual regression case.
6.2.1 Exact calculation for difference between means
In two-means case, I would like to discuss the comparisons among exact calculation, delta
method and simulation for Poisson and log-normal model. We also assume the data are Poisson
distributed with and . Based on (3.10), for the method of
estimation , where and , and
in exact calculation can be obtained by the following:
(6.2.1) (6.2.2) Where , k = 1, 2. For the method that , we have the followings: (6.2.3) (6.2.4)
6.2.2 Comparisons of Models for Two Means by Doing Simulation
Basically, I would like to compare different methods of estimation, such as ,
and non-linear mixed model, of biases and RMSEs for . Data
sets were simulated from a Poisson distribution. To check if the mean and number of data points
in each simulation are factors, I simulated data sets with different values of two-means and data
points[( , ,n=10), ( , ,n=20), ( , ,n=10), ( ,
,n=20)]. The data were analyzed assuming that time is a factor. Models were fitted making the
following assumptions about the response, y:
1. y follows a Poisson distribution
2. y follows a negative binomial distribution
3. log(y+1) transformation follows a normal distribution
a. A standard regression with mean linearly related to x and constant variance.
b. Nonlinear approximations to the mean and variance with nlmixed.
The simulations were also compared by using the mean bias and root mean-squared error
(RMSE). Simulations and analyses were carried out in the SAS statistical program using proc reg, proc genmod and proc nlmixed.
Fig. 4 and Fig. 5 show the bias and RMSE of against different
models for the data generated from different two means and data points. For example,
‘5_10with20’ means that the data are generated from , and .The amount of
the regression model for log(Y+1) has a little dependence on two means. But basically,
non-linear mixed model gives the best estimate of the slope, that is, the difference of means. The data
set with higher mean generates lower bias than the one with lower mean.
Fig. 4 Estimated mean bias from four different models, applied to data simulated form a Poisson distribution. A low bias means that the model will basically return the “true” value.
The root mean-squared error shows a similar pattern, with the non-linear mixed model having a
low RMSE. A combination of higher mean and more data points gives lower RMSE. From these
plots in Fig. 4 and Fig. 5, Poisson, negative binomial and nlmixed models perform well for
Poisson data no matter what values are chosen for and . In short, we don’t have to worry
Fig. 5 Estimated root mean-squared error from four six different models, applied to data simulated form a Poisson distribution.
6.2.3 Regression
From the previous sections, using nonlinear approximations to the means and variances is a
viable alternative to fitting the correct model when data are Poisson. If data were always Poisson,
these methods would not be necessary. But since data often follow fairly lognormal patterns
with much larger variances relative to the mean than a Poisson distribution or even a negative
binomial distribution, these methods could work well over a large range of models. For the
hawk migration data, our primary interests are usually regression type analyses such as whether
the populations are decreasing over a span of years.
To decide on what methods I should use to fit models to the hawk data, I will compare
and log normal distributed. We are most interested in estimating the slope, , to monitor
changes in bird populations over time. The results will be shown with relative bias and relative
RMSE of , which makes it easier to see how large the bias and RMSE's are without referring
to the actual parameter values.
(6.2.5)
(6.2.6)
For example, a value of 0.2 means that the ratio of estimate divided by the true value in error is
20%. For simplicity here, we consider the number of days past September 1 as a trend factor for
the hourly hawk counts and generate data corresponding to the number of days from 0 to 50.
6.2.3.1 Regression with Poisson data
The simplest model for count data is a Poisson distribution, so as in previous sections, at the
beginning of this Regression section, data are also simulated from a Poisson distribution. The
SAS statistical program is the main one for analyses. The main procedures include proc reg, proc genmod (For Poisson and negative binomial regression) and proc nlmixed. The following code is used to generate Poisson data with the mean of :
%let b0=0; %let b1=0.02; %let nsim=1000; %let n=10; data sim3.mydata; call streaminit(1895); do isim=1 to ≁ do days=0 to 50 by 5; do rep=1 to &n; mu=&b0+(&b1)*days;
output; end; end; end; run;
The values for β0 and β1 are chosen to represent relatively small expected counts corresponding
to hourly observation of hawk counts, Y; the mean counts increase from an average of 1 bird per
hour to 2.7 birds per hour.
For the nlmixed model for Ln(Y+1) the approximations to the mean and variance of Ln(Y+1)
play big roles in estimates of parameters. From section 6.1, will be a good way to do
approximations. The nlmixed code is as follows:
proc nlmixed data=sim3.mydata;
ods output ParameterEstimates=parm_nlmix;
by isim; title 'nlmixed'; parms b0=0 b1=0.031 r=1 c=0.45; bounds c > 0; mu = b0 + b1*days; mu_y = exp(mu); var_y = c*mu_y**r;
mu_ln = log(mu_y+1) - 0.5*var_y/((mu_y+1)**2); var_ln = abs(var_y/(mu_y+1)**2);
model ln_y_1 ~ normal(mu_ln, var_ln); run;
The comparison results can be seen in the following table 6.2.3.1(1). The nlmixed model doesn’t
work as well as Poisson model and negative binomial model, both of which work very well for
Poisson data. The negative binomial model does very well for Poisson data. We sacrifice little in
fitting this more general model to the data. The nlmixed approximation does not do a bad job
either. Perhaps the extra complexity of the nlmixed models for more complex data will be worth
the small sacrifice in efficiency when the data are Poisson. From this simulation result, it is very
obvious that the regression model does not fit well for Poisson data. We see the comparison
Obs method MEAN rel_bias rel_RMSE 1 Poisson 0.020128 0.00638 0.23568 2 Reg 0.012683 ‐0.36584 0.39503 3 Negbin 0.020120 0.00598 0.23612 4 nlmixed 0.020610 0.03050 0.24795 Table 6.2.4.1(1)
Fig. 7 The estimated relative RMSE for different models
6.2.3.2 Regression with lognormal data
As discussed earlier, in many case Ln(Y+1) is fairly normally distributed. Potentially, when data are of this sort, the generalized linear models such as Poisson regression or negative binomial regression might not be efficient compared to methods assuming normal errors. In this section, I simulate hawk data with Y following a discrete version of a lognormal distribution where Ln(Y) is normal. Because the discrete version will have zero counts, the analysis will be performed with Ln(Y+1). I will take the variance of Y to be proportional to a power of the expected value of Y. Since has a log normal distribution, where . The following equations show the way to generate random variables. Then the mean and variance of log normal variables are as follows:
(6.2.7)
` (6.2.9)
(6.2.10)
where and are both constants. Solving for we find
(6.2.11)
Then after knowing from (6.2.11), it is easily to get from (6.2.9)
(6.2.12)
For a Poisson distribution Vay(Y) = E(Y) which corresponds to r=1 and c=1. Meanwhile, Y has constant variance in Ln-scale when r=2. From the above methods, here comes the SAS code of generating data as followings:
%let b0=0; %let b1=0.02; %let nsim=1000; %let n=10; %let c=1; data one;
title 'Run Simulation';
call streaminit(1895733); do isim=1 to ≁ do days=0 to 50 by 5; do r= 1 to 3.0 by 0.2; do rep=1 to &n; ln_mu_y=&b0+(&b1)*days; mu_y=exp(ln_mu_y); sig_2=log(1+(&c)*exp((r-2)*ln_mu_y)); std=sqrt(sig_2); mu=ln_mu_y-0.5*sig_2;
x=rand('normal',mu,std);
y1=exp(x);
rem = y1 - floor(y1);
y = floor(y1) + 1*(rand('uniform') < rem);
ln_y_1 = log(y+1);
var_y = (exp(sig_2) - 1)*mu_y**2; mu_y_r = mu_y**r; output; end; end; end; end; run;
The code y = floor(y1) + 1*(rand('uniform') < rem); is to keep the expected value of the rounded
version of Y the same as the expected value of y. For example if Y1 = 1.75, then floor(y) = 1,
the smallest integer less than or equal to Y1, and Y=1 with probability 0.25 and Y=2 with
probability 0.75. For these simulated values the mean of Y increases from 1.0 at days=0 to 2.7 at
days=50. These values were chosen to represent fairly small counts corresponding to smaller
hourly recording for the data of Seeland (2010).
Hawk data are count data, which possibly include many zeros, so it is more meaningful to
compare models with dependent variable where represents buteos counts. In our
simulation, it is easy to build relationship between and independent variable days with regression, Poisson and negative binomial models. Here I also mainly introduce the nlmixed
model. Before finalizing nlmixed model, the first key thing is to find good approximations to the
mean and variance of Ln(Y+1). Three approximation methods were compared. The “best” one
was chosen with the approximate mean of closest to the real mean and variance. To do
a meaningful simulation, simulating data and choosing an appropriate approximation method in
nlmixed model are two key steps. The main step in our simulation part is about selecting and
checking approximation method. Different expansions of and generate
different approximations of variances and means of . For example, we can apply Taylor
Series to do the following expansion . In our simulation, we used
a different approximation method by constructing a log likelihood function which can be seen in
the following nlmixed code:
proc nlmixed data=one;
ods output ParameterEstimates=parm_nlmix2; by isim r;
bounds c > 0;
ln_mu_y = b0 + b1*days;
sig_2 =log(c*exp((r-2)*ln_mu_y)+1); mu = ln_mu_y - 0.5*sig_2;
if y = 0 then LogLike = log(probnorm((log(0.5)-mu)/sqrt(sig_2))) ;
else LogLike = log(probnorm((log(y+0.5)-mu)/sqrt(sig_2))
- probnorm((log(y-0.5)-mu)/sqrt(sig_2))); model y ~ general(LogLike);
run;
This likelihood treats the observed counts as rounded lognormal data. Since this was the way the
simulation data were generated, this maximum likelihood method should be optimal at least
asymptotically for large sample sizes. Comparing this nlmixed model with regression, Poisson
and negative binomial models, the result plots are showed in Fig. 8 and Fig. 9. Obviously,
regression model performs poorly with large relative biases and RMSEs. Poisson and negative
binomial models perform well with good estimates. We can say nlmixed model works very well
Fig. 9 Relative RMSE of against different r
6.2.3.3 Summaries for Model Comparison
Through comparing relative bias and RMSE for different models, the nlmixed model generally
does a good job no matter whether the data are Poisson distribution or log normal distribution.
Surprisingly, the regression model doesn’t work well for log normal data. Meanwhile, Poisson
and negative binomial still perform well. When the variance is proportional to a large power of
the mean, say 3 or more, the nlmixed nonlinear approximation works better, but for data between
Poisson, r = 1, and lognormal with constant variance, r = 2, the generalized linear models,
particularly the negative binomial model, work well even for lognormal data. The negative
binomial variance allows both Poisson variance with θlarge and variance
data, we will note that the variance is estimated to be proportional to µ1.8 which is within the range where either negative binomial or nonlinear approximations work well.
7
FITTING MODELS TO HAWK DATA
In this section, we will further look at the hawk example introduced in section 2. From Fig. 3 in
section 2 with logarithm of average buteos counts each day against date, we can find that there
might be a linearly increasing trend in date. Does the wind during the observation hour affect
buteos migration? Could the distance to dry land be a factor in buteos counts?
To draw valid forecasting of buteos counts, model selection is important to us. Buteo
counts are discrete variables and might include zeros. Models for such data include Poisson and
negative binomial distributions, but it’s possible that there are too many zeros for Poisson or
negative binomial distributions. Another option is to use and apply methods for
normally distributed data. The Central Limit Theorem (CLT) helps make models work assuming
normality of data. This made us do simulations in section 6 and try to find an appropriate model
for hawk data.
7.1 Simple introduction to some potential variables
Hawk counts were recorded under certain weather, geographic and geological conditions. Let’s
get to know more basic ideas about how we use these conditions.
1. Wind is considered as one of possible factor. Best wind direction is nearly zero=north, so north is chosen as the referenced wind direction. Wind was recorded as degrees
clockwise from north. The Wind_north_sp is wind speed times the cossin of the wind angle relative to north and can be understood as the strength of the northly wind vector.
The variable Wind Pre is used to record the number of days that winds did not have a westerly component before observation day.
2. We wonder if the time of a day, that is, a specific hour when observations began, could be a factor in counting migrating buteos, so variable Time will represent the starting time of observations a day.
3. We have noticed that buteo counts slightly increase with date. Then we use the variable
day to represent the number of days since Sept. 1, 2008.
4. Precipitation is also considered as potential predictor. The variable Precip Pre recorded the number of days with 50% or more hours of precipitation prior to observation day.
5. Likewise, we wonder if the distance to water would affect buteos counts. Distance to the
shore of Lake Superior is used to see if buteos migration somehow is related to this
geographical location.
7.2 Fitting Models
From section 6.2.3, it seems that the negative binomial model would be a reasonable choice
given that NLMIXED cannot handle the random effects that we need in the model. These mixed
models with negative binomial data can be fit with SAS procedure NLMIXED. However, fitting
these types of models with these random effects turns out to be tricky. Nonlinear optimization
and numerical integration are needed, and for all models we fit, the resulting gradient vector in
the "solution" was not close to zero, which is what we want if we are at a local maximum of the
log-likelihood function. So we are back to using Ln(Y+1) as an initial analysis of these data.
From the simulations, using Ln(Y+1) should be less efficient than using the better methods, so at
efficient models. Further work will need to be done beyond this project to figure out how to fit
more complicated models.
At the first step, we try many independent variables, e.g. 19 and use as a selection
criterion to obtain potential variables. Generally, several models might be highly similar in the
quality of the fit based on selection. Based on the values, we only can choose a shorter list
of independent variables to start studying. The runs were done without random effects in the
models, since software is readily available to do this. The p-values will not be correct, but the
relative importance of the potential independent variables should be fine. We then fit a model
including variables included in the top models based on . Then the p-value is used as one of
the criteria to cut down variables based on former runs. Here is an example to show you how we
get rid of variables. By running a regression model including independent variable temp_chg, we
found that the p value of variable temp_chg is around 0.9236, which is very big indicating that
temp_chg is not needed if other variables are included in the model. We can say that
temperature-change is not an important predictor to buteos counts, so variable temp_chg need
not be considered in the model. Finally there are only 14 independent variables left by using the
similar method of using p-values to reduce variables.
7.2.1 Fit Mixed Model to Data
Mixed models are widely used to model a linear relationship when the dependent data have
known structure. The commonly used mixed model involves repeated measurement. Repeated
measures are encountered in hawk data, so a mixed model is applied in analyzing the relationship
Transects are numbered by ordering the distances from Duluth up to the North Shore of
lake Superior. Drawing general conclusions about places in general is more meaningful than
finding out the effect of these specific transects. Thus, transect is considered as a random effect
here. Date is the day of observing hawk migration, which is treated as random effect too. The
sites on a given transect were distances from shore recorded as the variable shore (a, b, or c)
where shore = a is closest to Lake Superior and shore = c is farthest from Lake Superior. To
account for dependence of hourly measurements at the same site, a shore*date random effect is
also included.
Even though nlmixed might fit well for hawk data, nlmixed cannot handle both date and
shore*date random effects. This is the reason for using mixed rather than nlmixed for including
those random effects in the model. Proc Mixed in SAS system provides a very flexible platform for dealing with repeated measures problems. The mixed model can provide a better p value than
regression model to cut down variables. One of mixed model codes is as follows:
proc mixed data=fengying.buteos_before_nov plots=residualpanel(unpack);
class transect date shore;
model Ln_buteos_plus_1 = day shore Wind_Prev Precip_Pre wind_east
wind_north_sp time time*time/ residual outpm=outpm solution ;
random date shore*date;
ods select solutionf covparms tests3 ResidualQQplot ; run;
The estimated transect variance comes out as 0. To make the convergence of the estimation
simpler and more likely to find the right MLE, transect is taken out of the random effects for the
mixed models in our study.
7.2.2 Fit Nlmixed Model to Data
The procedure nlmixed model cannot handle the model with both date and shore*date random
and also to check for the best wind direction. This model should be fixed to include all the
variables from previous runs. No random effects in our nlmixed model were used to make the
estimation easier. Applying the approximation method we finalized in section 6.2.3 into the
hawk data, we came up the following nlmixed code:
proc nlmixed data=fengying.buteos_before_nov;;
parms b0=-8.8 b_day=0.02 b_shore_a=0.4 b_shore_b=0.3
b_wind_prev=0.2 b_time=1.5 b_time_2 = -0.06 b_precip_pre=-0.75 k=0.2
r=2 c=0.5 theta=0; bounds c > 0;
wind = k*wind_sp*cos( (wind_dir-theta)*3.1415927/180 );
ln_mu_y = b0 + b_day*day + b_shore_a*shore_a + b_shore_b*shore_b + wind + b_wind_prev*wind_prev
+ b_time*time + b_time_2*time*time + b_precip_pre*precip_pre; sig_2 =log(c*exp((r-2)*ln_mu_y)+1);
mu = ln_mu_y - 0.5*sig_2; y = buteos;
if y = 0 then ll = log(probnorm((log(0.5)-mu)/sqrt(sig_2))) ;
else ll = log(probnorm((log(y+0.5)-mu)/sqrt(sig_2)) -
probnorm((log(y-0.5)-mu)/sqrt(sig_2))); model y ~ general(ll); run;
Using this code, the maximum likelihood estimate of the clockwise angle relative to north is
-0.03 with a standard error of 10o, very nearly true north. The estimate of r is 1.8 with a standard
error of 0.19, corresponding to , indicating Y is log normal. We can say that
the data would not be modeled well as Poisson data.
7.3 Summary of Findings
One of mixed model was introduced in section 7.2.1. Firstly, a newly built model prompts us to
look at how the errors of the model are distributed. In Fig. 10, we can see that the residuals are
almost distributed around the straight line except the last two points, which indicates the data
Fig. 10 QQ-plot for residuals from a mixed model
Intuitively, a good model should have the predicted values as close to true values as
possible. The R2 value of a model is the square of the correlation between the fitted and observed values. It is interesting to see how the predicted values from the mixed model are compared
with real values of . In Fig. 11 the basic trend can be described as equation y=x
except two outliers. The R2 value for this model is about 0.4. Based on the simulations, better
models could potentially be fit, but generally speaking, this model works fairly well for these
Fig. 11 Ln(buteos+1) VS. Predicted Mean of Ln(buteos+1)
The following table 7.3(1) with p-values shows that the effects day, wind_north_sp and time are significant to predict buteos counts. Again, better models could potentially be fit, but the significant p-values from this model are reliable. This is like using non-parametric methods
when data are normal or some other distribution. The statistics are not as efficient as they could
be, but significant effects can still be considered significant.
Type 3 Tests of Fixed Effects
Effect Num DF Den DF F Value Pr > F
day 1 425 26.73 <.0001 shore 2 45 3.69 0.0328 Wind_Prev 1 425 2.63 0.1056 Precip_Pre 1 425 5.30 0.0218 wind_east 1 425 4.31 0.0384 wind_north_sp 1 425 37.56 <.0001 time 1 425 78.66 <.0001 time*time 1 425 71.68 <.0001
The day effect has been showed in Fig. 3. More and more buteos migrate as date gets
close to winter. In this part, we are more concerned about the wind_north_sp and time effects. To
check their effects, LSMEANS statements were added to the previous mixed model, respectively.
For example, lsmeans wind_north_sp /obsmargins;. For the mixed model with
this LSMEANS statement, the variable wind_east with p value of 0.7624 is not useful to this
model. The Fig. 12 is the plot the estimates of against the least squares means of
wind_north_sp using each unique value of wind_north_sp as its own effect , using
wind_north_sp as a "class" variable, rather than a linear effect. Obviously, there is an increasing
trend in wind_north_sp, which further illustrates that north is the best wind direction for buteos
migration.
Fig. 12 Estimates of VS. LS-means of wind_north_sp
For another mixed model with LS-means variable time, wind_east also came up to be an
is shown in Fig 13 using each hour of the day as its own effect, day as a "class" variable, rather
than a quadratic effect. Basically, the buteo migration peak in a day is in the early afternoon.
Fig. 13 Estimates of VS. LS-means of time
In nlmixed model, we want to see if there exists a relationship between variance of
and mean of , where Y is the buteo counts. We used the idea that in nlmixed
model. After running the nlmixed code in section 7.2.2, the estimates of r is 1.8 with a standard
erro of 0.19. In another words, the variance of buteo count is approximately proportional to the
square of the mean of buteos counts, so the mixed model which assumes equal variances in the
log scale is reasonable.
8
CONCLUSION
Count data are commonly studied nowadays. The LM, GLM, MIXED and NLMIXED models
some of the results from Ohara and Kotze (2010) that log-transformation of count data performs
poorly while negative binomial and Poisson work well, so we do not recommend
log-transforming count data with many ‘zero’ observations. The negative binomial model might
perform better than the Poisson model for these kinds of data. The mixed model provides an
effective way to analyze count data with complicated random effects instead of NLMIXED
model. When applying the NLMIXED model, the main focus should be put on choosing a good
approximation method. SAS is a good statistical software to fit these models. In our simulations
the negative binomial model did well even for lognormal data. However, in 2008 there were not
as pronounced large bursts of buteos during the migration. With more very large count days, the
variance may be a higher power of the mean where the nonlinear models would have advantage
over the negative binomial models.
To answer the questions of research for the hawk data set in this report, the mixed model
is used to analyze it. The effects of day, wind_north_sp and time play central roles in estimating buteos count during a certain period of time. It is understandable that there are
increasing number of buteos that migrate as time gets closer to winter. Buteos fly from north to
south with the benefit of north wind when winter is coming, so it makes sense that
wind_north_sp is a significant factor and there exists an increasing trend in wind_north_sp. It is
possible that buteos prefer to migrate during a slightly warmer time, in the early afternoon,
which can also be seen from Fig. 13 with a downward ‘parabola’.
For future studies, we can incorporate the mixed model with other bird data, such as
accipiters which fly closer to the ground. By analyzing other bird data, we can further see if day, wind_north_sp and time are still significant in predicting other bird’s counts. Although the transformation of hawk data, to some extent, supports the mixed model, the generalized mixed
model (GLMM) can be studied to fit hawk data, since the nonnegative counts are more possible
to be Poisson or negative binomial distribution than normal distribution. More extensive data
from the Hawk Ridge observatory station in Duluth could be used for further investigations
including years with very large counts of buteos on some days. But before fitting GLMM, we
have to address the concern that the GLMM has less flexibility of selecting covariance structure
than the linear mixed model (LMM). The thing is how to balance the benefits and disadvantages
from the GLMM, which could become a future subject of study. For count data, many types of
Poisson mixed model have been put forward. It is often the case that there are more zero counts
than there should be for Poisson distribution. For this kind of case such as hawk data,
zero-inflated Poisson (ZIP) mixed models, which include not only the Poisson regression for zero and
nonzero counts but also a logistic regression for the probability of a nonzero response, have been
proposed and developed. For future work, we are also interested in fitting zero-inflated models,
ZIP or ZINB for data such as counts of migrating hawks. Likewise, we will compare power for
different models and consider other variables such as atmosphere pressure into the model. In
addition to estimating the slope trend over time, we would also like to investigate how well the
models predict the number of hawks at any given point in time.
The primary take home message from the simulations is that even if the data are
generated by normal models with variances proportional to no more than the mean squared, the
GLM negative binomial models are still quite good even for lognormal data and that methods
other than LMM's for Ln(Y+1) should be investigated, at least for the small counts that we used
in our simulations. The nlmixed models are similar to Generalized Estimating Equation (GEE)
models for correlated generalized linear models in that they apply normal theory using
need to be able to fit these reliably with more general random effects than nlmixed can handle.
The next step in fitting better models to these data is to figure out how to fit GLM negative
binomial models or nlmixed type models with multiple random effects. SAS procedure
GLIMMIX has the potential for mixed effects GLM models, but getting reliable convergence has
been a problem for us. Other software such as the nlme library in R or Bayesian methods should
9
REFERENCES
[1] Robert B. O’Hara and D. Johan Kotze, 2010. Do Not Log-transform Count Data. Methods in Ecology& Evolution 2010, 1, 118-122.
[2] John A. Rice, 1987. Mathematical Statistics and Data Analysis, University of California, San Diego.
[3] Changming Xia, 2010. Modeling Data Correlation with Structured Covariance in Mixed Model, University of Minnesota Duluth.
[4] Annette J. Dobson and Adrian G. Barnett, 2002. An Introduction to Generalized Linear Models. Boca Raton: CRC Press
[5] Sheldon M. Ross, 2002. Introduction to Probability Models. Academic Press
[6] Norman I. Johnson and Samuel Kotz, 1969. Discrete Distributions. Boston: Houghton Mifflin Company
[7] Brian S Everitt and Graham Dunn, 1991. Applied Multivariate Data Analysis. New York &
Toronto: Halsted Press
[8] David Shen and Zaizai Lu. Statistical Application of SAS in Method Comparison Analysis.
[9] Mike Zdeb and Rober Allison. SAS/GRAPH® 101,SUGI 131.
10
APPENDICES
10.1SAS Code
Two-means simulation code:
libname sim3 "F:\Fengying Miao\simulation\Poisson vs Ln_normal"; run; %macro choose(u1,u2,n,nsim,dataset) ; %let b0=log(&u1); %let b1=log(&u2)-log(&u1); data sim3.mydata; do isim=1 to ≁ call streaminit(1895); do time=0 to 1; do rep=1 to &n; mu=&b0+(&b1)*time;
y=rand('poisson',exp(mu));
ln_y_1=log(y+1); output; end; end; end; run;
proc sort data=sim3.mydata; by isim;
run;
ods listing close;
proc printto log="F:\Fengying Miao\simulation\Poisson vs Ln_normal\junk.log";; run;
proc reg data=sim3.mydata;
ods output ParameterEstimates=parm_reg3; by isim; model ln_y_1=time; run; data reg3; set parm_reg3(drop=model); method="Reg"; if variable="time"; run;
proc means data=reg3; by method;
output out=outs_reg3; run;
data outs_rega1(drop=_stat_ estimate); set outs_reg3(keep=method _stat_ estimate); if _stat_="STD";
run;
data outs_rega2(drop=_stat_ estimate); set outs_reg3(keep=method _stat_ estimate); if _stat_="MEAN";
MEAN=estimate; run;
data outs_rega3(drop=STD); merge outs_rega1 outs_rega2; by method; bias=MEAN-(&b1); var=STD**2; MSE=var+bias**2; RMSE=sqrt(MSE); run;
proc genmod data=sim3.mydata;
ods output ParameterEstimates=parm_gen3; by isim;
model y=time/link=log dist=poisson; run; data genmod3(drop=parameter); set parm_gen3; method="Poisson"; variable=parameter; dependent="y"; if variable="time"; run;
data a0(drop=s_include s_above s_below lowerwaldcl upperwaldcl); set genmod3(keep=isim method lowerwaldcl upperwaldcl);
retain s_include s_above s_below;
if lowerwaldcl<(&b1) and (&b1)<upperwaldcl then s_include+1; else if lowerwaldcl>(&b1) then s_above+1;
else s_below+1;
p_b1=s_include/≁ p_ab1=s_above/≁ p_bb1=s_below/≁
label p_b1="prob(include b1)"; label p_ab1=" prob(above b1)"; label p_bb1="prob(below b1)"; if isim=≁
run;
proc means data=genmod3; by method;
output out=outs_all3; run;
data outs_a1(drop=_stat_ estimate);
STD=estimate; run;
data outs_a2(drop=_stat_ estimate);
set outs_all3(keep=method _stat_ estimate lowerwaldcl upperwaldcl); if _stat_="MEAN";
MEAN=estimate; run;
data outs_a3(drop=isim STD lowerwaldcl upperwaldcl); merge a0 outs_a1 outs_a2;
by method; bias=MEAN-(&b1); var=STD**2; MSE=var+bias**2; RMSE=sqrt(MSE); run;
proc genmod data=sim3.mydata;
ods output ParameterEstimates=parm_neg3; by isim;
model y=time/link=log dist=negbin; run; data neg3(drop=parameter); set parm_neg3; method="Negbin"; variable=parameter; dependent="y"; if variable="time"; run;
data neg_a(drop=s_include s_above s_below lowerwaldcl upperwaldcl); set neg3(keep=isim method lowerwaldcl upperwaldcl);
retain s_include s_above s_below;
if lowerwaldcl<(&b1) and (&b1)<upperwaldcl then s_include+1; else if lowerwaldcl>(&b1) then s_above+1;
else s_below+1;
p_b1=s_include/≁ p_ab1=s_above/≁ p_bb1=s_below/≁
label p_b1="prob(include b1)"; label p_ab1=" prob(above b1)"; label p_bb1="prob(below b1)"; if isim=≁
run;
proc means data=neg3; by method;
output out=outs_neg3; run;
data outs_neg_b(drop=_stat_ estimate); set outs_neg3(keep=method _stat_ estimate); if _stat_="STD";
data outs_neg_c(drop=_stat_ estimate);
set outs_neg3(keep=method _stat_ estimate lowerwaldcl upperwaldcl); if _stat_="MEAN";
MEAN=estimate; run;
data outs_neg_d(drop=isim STD lowerwaldcl upperwaldcl); merge neg_a outs_neg_b outs_neg_c;
by method; bias=MEAN-(&b1); var=STD**2; MSE=var+bias**2; RMSE=sqrt(MSE); run;
proc nlmixed data=sim3.mydata;
ods output ParameterEstimates=parm_nlmix; by isim; title 'nlmixed'; parms b0=1.6 b1=0.7 r=1 c=1; bounds c > 0; mu = b0 + b1*time; mu_y = exp(mu); var_y = c*mu_y**r;
mu_ln = log(mu_y+1) - 0.5*var_y/((mu_y+1)**2); var_ln = abs(var_y/(mu_y+1)**2);
model ln_y_1 ~ normal(mu_ln, var_ln); run;
data nlmix(keep=isim variable estimate method) ; set parm_nlmix;
method="nlmixed"; variable=parameter; if variable="b1"; run;
proc means data=nlmix; by method;
output out=outs_nlmix; run;
data outs_nlmix1(drop=_stat_ estimate); set outs_nlmix(keep=method _stat_ estimate); if _stat_="STD";
STD=estimate; run;
data outs_nlmix2(drop=_stat_ estimate); set outs_nlmix(keep=method _stat_ estimate); if _stat_="MEAN";
MEAN=estimate; run;
data outs_nlmix3(drop=STD var MSE); merge outs_nlmix1 outs_nlmix2;
bias=MEAN-(&b1); var=STD**2; MSE=var+bias**2; RMSE=sqrt(MSE); run; data results_&dataset;
set outs_a3 outs_rega3 outs_neg_d outs_nlmix3; twomean="&u1._&u2.with&n"; run; %mend choose; %choose(u1=5,u2=10,n=10,nsim=1000,dataset=1) %choose(u1=2,u2=5, n=10,nsim=1000,dataset=2) %choose(u1=5,u2=10,n=20,nsim=1000,dataset=3) %choose(u1=2,u2=5,n=20,nsim=1000,dataset=4) data sim3.result_all;
set results_1 results_2 results_3 results_4;
run;
ods rtf file="F:\Master Project\Exact
calculation\exact_simulation\two_mean.rtf"; goptions reset=all;
symbol1 value=dot c=green height=0.25in; symbol2 value=star c=red height=0.3in;
symbol3 font=marker value=U c=brown height=0.15in; symbol4 value=circle c=red height=0.25in;
axis1 label=("Bias of b1"); axis2 label=("RMSE of b1");
proc gplot data=sim3.result_all;
plot bias*method=twomean;
plot RMSE*method=twomean;
run;
ods rtf close;
Regression code for Poisson data:
libname sim3 "F:\Master Project\Exact calculation\exact_simulation\pois_sim"; run; %let b0=0; %let b1=0.02; %let nsim=1000; %let n=10; data sim3.mydata; call streaminit(1895); do isim=1 to ≁ do time=0 to 50 by 5; do rep=1 to &n; mu=&b0+(&b1)*time;
y=rand('poisson',exp(mu));
ln_y_1=log(y+1); output;
end; end;
run;
proc sort data=sim3.mydata;
by isim;
run;
ods listing close;
proc printto log='F:\Master Project\Exact
calculation\exact_simulation\pois_sim\junk.log'; run;
proc reg data=sim3.mydata;
ods output ParameterEstimates=parm_reg3;
by isim; model ln_y_1=time; run; data reg3; set parm_reg3(drop=model); method="Reg"; if variable="time"; run;
proc means data=reg3;
by method;
output out=outs_reg3; run;
data outs_rega1(drop=_stat_ estimate);
set outs_reg3(keep=method _stat_ estimate);
if _stat_="STD"; STD=estimate; run;
data outs_rega2(drop=_stat_ estimate);
set outs_reg3(keep=method _stat_ estimate);
if _stat_="MEAN"; MEAN=estimate; run;
data outs_rega3(drop=STD var MSE);
merge outs_rega1 outs_rega2;
by method; bias=MEAN-(&b1); var=STD**2; MSE=var+bias**2; RMSE=sqrt(MSE); rel_bias=bias/(&b1); rel_RMSE=RMSE/(&b1); run;
proc genmod data=sim3.mydata;
ods output ParameterEstimates=parm_gen3;
by isim;
model y=time/link=log dist=poisson; run;
data genmod3(drop=parameter); set parm_gen3; method="Poisson"; variable=parameter; dependent="y"; if variable="time"; run;
data a0(drop=s_include s_above s_below lowerwaldcl upperwaldcl);
set genmod3(keep=isim method lowerwaldcl upperwaldcl);
retain s_include s_above s_below;
if lowerwaldcl<(&b1) and (&b1)<upperwaldcl then s_include+1; else if lowerwaldcl>(&b1) then s_above+1;
else s_below+1;
p_b1=s_include/≁ p_ab1=s_above/≁ p_bb1=s_below/≁
label p_b1="prob(include b1)"; label p_ab1=" prob(above b1)"; label p_bb1="prob(below b1)"; if isim=≁
run;
proc means data=genmod3;
by method;
output out=outs_all3; run;
data outs_a1(drop=_stat_ estimate);
set outs_all3(keep=method _stat_ estimate);
if _stat_="STD"; STD=estimate; run;
data outs_a2(drop=_stat_ estimate);
set outs_all3(keep=method _stat_ estimate lowerwaldcl upperwaldcl);
if _stat_="MEAN"; MEAN=estimate; run;
data outs_a3(drop=isim STD lowerwaldcl upperwaldcl var MSE);
merge a0 outs_a1 outs_a2;
by method; bias=MEAN-(&b1); var=STD**2; MSE=var+bias**2; RMSE=sqrt(MSE); rel_bias=bias/(&b1); rel_RMSE=RMSE/(&b1); run;
proc genmod data=sim3.mydata;
ods output ParameterEstimates=parm_neg3;
by isim;
model y=time/link=log dist=negbin; run;
data neg3(drop=parameter); set parm_neg3; method="Negbin"; variable=parameter; dependent="y"; if variable="time"; run;
data neg_a(drop=s_include s_above s_below lowerwaldcl upperwaldcl);
set neg3(keep=isim method lowerwaldcl upperwaldcl);
retain s_include s_above s_below;
if lowerwaldcl<(&b1) and (&b1)<upperwaldcl then s_include+1; else if lowerwaldcl>(&b1) then s_above+1;
else s_below+1;
p_b1=s_include/≁ p_ab1=s_above/≁ p_bb1=s_below/≁
label p_b1="prob(include b1)"; label p_ab1=" prob(above b1)"; label p_bb1="prob(below b1)"; if isim=≁
run;
proc means data=neg3;
by method;
output out=outs_neg3; run;
data outs_neg_b(drop=_stat_ estimate);
set outs_neg3(keep=method _stat_ estimate);
if _stat_="STD"; STD=estimate; run;
data outs_neg_c(drop=_stat_ estimate);
set outs_neg3(keep=method _stat_ estimate lowerwaldcl upperwaldcl);
if _stat_="MEAN"; MEAN=estimate; run;
data outs_neg_d(drop=isim STD lowerwaldcl upperwaldcl var MSE);
merge neg_a outs_neg_b outs_neg_c;
by method; bias=MEAN-(&b1); var=STD**2; MSE=var+bias**2; RMSE=sqrt(MSE); rel_bias=bias/(&b1); rel_RMSE=RMSE/(&b1); run;
proc nlmixed data=sim3.mydata;
ods output ParameterEstimates=parm_nlmix;
by isim;
title 'nlmixed';
parms b0=0 b1=0.031 r=1 c=0.45; bounds c > 0;