3.5 Application to yeast nucleosome array data
3.6.2 Cross comparison under different models
Next, we simulated data under one specific model, and tried to estimate the parameters and states from a model different to the earlier model, in order to judge whether in each case the correct model was the one that most accurately fitted the generated data. For this set of simulations, the variances were fixed to 0.48 and 0.64 for the nucleosomal and NFR states respectively. The base model transition parameters were -3 and -3. For both the transition and emission models, M1 and M2, the parameter corresponding to the first principal component covariate in the NFR state was fixed to 1, while the other
parameters were set to 0.
As expected, when the simulation and estimation models matched, the method overall performed very accurately, with the maximum MSE of the emission distribution parametersµ and σ being less than 0.01, and forλand µ, less than 0.05. The classification rates are highest for the estimation models which match the simulation model. This pattern is also seen in the MSEs of the estimated parameters (more details in Supplementary materials).
We also computed the BIC for each analytical method under the condition that the data is simulated from one of these models (Table??). We used 5 data sets for computing the BIC under each simulation model. We then used the proportion of times BIC selected the model as a measure of model fit. to give a measure of the performance of an estimation method under a model. In general, we see that if we use the model for simulation in estimation, we tend to get the best results. Also given a simulation model, sayMA, if we try to estimate it by a model which contains it, sayMB, we get a very similar value of the maximal log likelihood. Although it is, in principle, erroneous to state that the same maximal log likelihood is achieved,however,in the ideal scenario of estimation, the extra components in the bigger model would turn out to be 0, and we would get the same likelihood This is not expected to be achieved in all data sets. However, in our case, we see that the log likelihood values are very close,( the error margin is less than .01) and the expected log-likelihood , obtained from the averages of the MCMC iterations are equal. This is so because of the large size of the data set. In another of our working papers’ titled ’Asymptotics of continuous time Hidden Markov models’ we have discussed how as in the iid cases, consistency results can be achieved for Bayesian estimates hidden markov models, as an extension of previous Bayesian consistency approaches. This implies that since our data set is sufficiently large, the errors in estimation have got smoothened out to a considerable degree, the parameters are actually equal to the true simulation parameters with a very high probability ( a fact reflected in the table for mean squared errors and misclassification rates), and that the estimated log likelihood would be actually very close to the true log likelihood of the given model .
Five data sets were simulated under each model. The variances of the parameter estimates over the simulation sets were in the range of .01 to .04. The MSE s for each data set under a model was averaged to give an estimate of the estimation error of the parameters under a given model. The state classification percentages were similarly averaged over the five datasets. For all the five datasets, and for each simulation setting the BIC model chose the best estimation model to be the model under which the datasets were simulated. Thus the proportion of times BIC selected the model as a measure of model fit was always 100 percent for the simulation model. See Table 3.4
both the transition and emission functions dependent on the covariates could not be tested. However we generated a total of 5 set from this model, and ran the other three estimation models on this simulated data to see how these models perform under this setup. For this set of simulations too, the variances were fixed to 0.48 and 0.64 for the nucleosomal and NFR states respectively. The base model transition parameters were -3 and -3. For both the transition and emission parameters, the parameter corresponding to the first principal component covariate in the NFR state was fixed to 1, while the other parameters were set to 0. The length of the simulated data sets were 5000. We took the sequence features of the first 5000 covariates of the Hogan data set to be the covariates. The MSEs and the BICs were averaged over all the five data sets to give a measure of the performance of the estimation models. We report the classification errors and the BICs in Table 3.5. In 2 of the five datasets, the transition model achieved the lowest BIC and MSE, while in the rest of the 3 the emission model scored the top position. In all five datasets , the base model M0 had the lowest rank.