variables (or ‘exposures’) that reflect exposure to sus- pected risk factors associated with a disease (the out- come variable) are commonly measured with error. These errors can be either differential or non-differential, according to whether they depend on the values of other variables in the study, for instance the outcome variable [1,2]. As has been discussed by many authors [3-6], mea- surement error reduces power for detecting relationship between exposures and disease, and ignoring this error may bias the assessment of the association between health outcome and exposure variables. In particular, ordinary logisticregression can lead to biased estimates of odds radios (ORs) when the covariates are subject to measurement error . Researchers have proposed non- Bayesian methods to correct for measurement error in exposures in individually matchedcase-control studies. For instance, Guolo et al.  used conditional likelihood methods to correct for measurement error in a single continuous exposure using simulated data. These authors compared the performance of the likelihood methods with two other corrections techniques (regression cali- bration  and simulation-extrapolation (SIMEX) ), observing that the likelihood approach outperforms the alternative methods when a single continuous exposure is measured with error. McShane et al.  proposed a conditional scores procedure to correct for measurement error in some components of one or more continuous covariates. In that study, the authors treated the true cov- ariates as fixed unknown parameters, which were removed from the likelihood by conditioning on a suffi- cient statistic and estimated together with the unknown parameters. However, the conditional scores procedure experienced convergence problems in the presence of large relative risks or when large measurement errors were considered. Also, conditional scores procedures are typically not very generalizable when data structures are changed even slightly. In addition, Liu et al. , Prescott and Garthwaite , and Rice  proposed Bayesian adjustments for misclassification of a binary exposure variable. Nevertheless, to our knowledge, very little atten- tion has been given to measurement error in multiple continuous exposures in matchedcase-control studies, except for McShane et al.  whose procedure may be challenging numerically, and which is quite dependent on the settings of the problem.
high interpretability of model parameters and ease of use, but the use of linear combinations of variables is not suitable for modeling highly nonlinear complex in- teractions as is demonstrated in biologic and epidemio- logic systems . ANN with its resemblance to the human brain is appealing because of flexible nonlinear systems that show robust performance in dealing with noisy, incomplete or missing data and have the ability to generalize. They may be better at predicting outcomes when the relationships between the variables are multi- dimensional as found in complex biological systems. The ANN model allows inclusion of a large number of variables and there are not many assumptions (such as normality) that need to be verified. However, the com- parative performance of these two methods has been widely reported with great controversy in the literature. In a review of 28 major studies carried out by Sargent , the performance was superior for ANN in 10 stud- ies (36%), was superior for logisticregression in 4 cases (14%), and was similar in the remaining 14 cases. In an- other review of 72 papers conducted by Dreiseitl and Ohno-Machado , with statistical tests, both models performed similarly in 42%, ANN better in 18%, and logisticregression better in 1%. By contrast, without statistical tests, ANN was better in 33% and logisticregression better in 6%. The authors also surveyed the quality of the methodology and found a shortage of reporting ANN model building details in 49%, lack of statistical testing in 39%, and lack of calibration informa- tion in 75%. ANN is theoretically more flexible than lo- gistic regression because of multi-layer networks, but on the other hand, it is threatened by over-fitting and in- stability . Especially, there are still no set methods for constructing ANN models , which may lead to the wide variation in the comparative results.
Conditional logisticregression was used to compare the risk of breast cancer-specific death by aspirin exposure (defined by number of tablets or prescriptions) calculating odds ratios (ORs) and 95% confidence intervals (95% CIs). Adjusted analyses were conducted including potential confounders, including stage, chemotherapy, radiotherapy, exposure to tamoxifen or aromatase inhibitors, comorbidi- ties (including myocardial infarction, cerebrovascular dis- ease, congestive heart disease, chronic pulmonary disease, peripheral vascular disease, peptic ulcer disease and dia- betes) and pre-diagnosis smoking status. Similar analyses were conducted for all-cause mortality (where up to three controls were matched to each case). Various additional sensitivity analyses were conducted. Analyses were con- ducted excluding prescriptions in the first 12 months after cancer diagnosis as these may be influenced by cancer treatment. Analyses were conducted stratifying by aspirin exposure in the year prior to diagnosis and a separate ana- lysis was conducted investigating aspirin prescriptions in the year prior to diagnosis. Analyses were also conducted by stage and adjusting for stage when restricted to regis- tries with higher rates of available stage information. An analysis was also conducted in patients receiving prescrip- tions for hormone therapy (a proxy for oestrogen receptor positivity). All stratified analyses were conducted after re- matching cases to controls within the strata of interest. In another analysis, breast cancer-specific death was based upon a breast cancer ICD code recorded as any cause of death in ONS data, rather than just the underlying cause of death as used in the main analysis. Sensitivity analyses were also conducted analysing the entire breast cancer co- hort, prior to conversion to case-control data, and apply- ing survival analysis to investigate aspirin as a time varying covariate  (in which an individual was a non- user until first use and then remained a user until the end of follow-up, applying a six-month lag to mimic the case- control analysis). A separate analysis was conducted using this time varying covariate approach and also accounting for competing risk of deaths from other causes using the proportional subhazards model .
Methods: Eight hundred subjects (400 cases with colorectal cancer and 400 controls) were enrolled in this study. Cases were primarily colorectal cancer patients diagnosed by histopathology at the Department of Intestinal Surgery, Sichuan Cancer Hospital from July 2010 to May 2012. Controls were people receiving routine medical examinations from the Zhonghe Community Health Service Center during the same period of time. An in-person interview was used to collect demographic characteristics, lifestyle, and dietary habits of the subjects in reference to the 10 years prior to disease diagnosis. Conditional logisticregression was conducted to examine the possible association between the risk of colorectal cancer and chili peppers consumption.
We caution against the use of standard conditioning approaches (LogR+Cov) in case-control ascertained studies, which can increase or decrease power as a function of covariate effect size and disease prevalence [8,10,26]. The relationship between modeling disease on the liability threshold and dichotomous scale has been examined by  as well as [24,25] in the context of computing the area under the receiver operator curve (AUC), estimating risks, and the distribution of disease in a population. A recent study of Clayton has examined the use of covariates in case- control ascertained association studies and shown that a reweight- ing method (such as ours) can increase power . This paper discusses the issue of power loss from conditioning  in logisticregression and states, ‘‘the loss of power resulting from the use of stratified tests can be avoided by matching in the design of case- control studies’’. We have shown that by including information from external epidemiological information, it is possible to not only avoid a power loss, but to achieve substantial power gain in matchedcase-control-covariate studies. The paper also states, ‘‘the strategy of ignoring other known disease susceptibility loci and risk factors when testing for new associations with complex disease, for example in genome-wide association studies, is justifiable, but only when effects combine additively on the logistic scale.’’ While
Background: Data confidentiality and shared use of research data are two desirable but sometimes conflicting goals in research with multi-center studies and distributed data. While ideal for straightforward analysis, confidentiality restrictions forbid creation of a single dataset that includes covariate information of all participants. Current approaches such as aggregate data sharing, distributed regression, meta-analysis and score-based methods can have important limitations. Methods: We propose a novel application of an existing epidemiologic tool, specimen pooling, to enable confidentiality- preserving analysis of data arising from a matchedcase-control, multi-center design. Instead of pooling specimens prior to assay, we apply the methodology to virtually pool (aggregate) covariates within nodes. Such virtual pooling retains most of the information used in an analysis with individual data and since individual participant data is not shared externally, within-node virtual pooling preserves data confidentiality. We show that aggregated covariate levels can be used in a conditional logisticregression model to estimate individual-level odds ratios of interest.
We used logisticregression as our second prediction model. Since logisticregression can build a multi-class model having linear weights. And we can compare these to the feature weight obtained from the linear regression model. We needed to formulate the regression problem as classification problem to apply the logical regression. By splitting the range of the target variable into a finite number of buckets of equal size, we constructed it as a classification problem. For determining our classes, with the help of histogram we drew from movie revenues, we were able to create different buckets for prediction which were continuous ranges of movie revenues. We created buckets such that it covers the entire sample space.
DCS, as a formal information system, is based on the correction of deviations to compare the results obtained with specific performance standards (Atkinson, Kaplan and Young, 2004: 321; Simons, 1994: 170). This information system has five main characteristics (Green and Welsh 1988). First, there are measurements that enable the quantification of an underlying phenomenon, activity or system. Second, there are standards of performance or targets to be met. Third, there is a feedback process that enables comparison of the outcome of the activities with the standard. This analysis of variance arising from feedback is the fourth aspect of cybernetic control systems. The fifth characteristics is the ability to modify the system’s behaviour or underlying activities. In short, diagnostic control systems are used on an exceptional basis to monitor and reward the achievement of specified goals through the review of critical performance variables or key success factors (Bisbe and Otley, 2004:711).
the logisticregression model under this sampling scheme, it is implicitly assumed that this model is appropriate for the data at hand. To test this assumption, a number of goodness-of-fit tests have been proposed. This prospective sampling situation is the setting in which the typical goodness-of-fit procedures are constructed and distribution theory discussed. Hosmer et al. (1997) compare the properties of some of these tests via a simulation study.
Abstract: It is tempting to assume that confounding bias is eliminated by choosing controls that are identical to the cases on the matched confounder(s). We used causal diagrams to explain why such matching not only fails to remove confounding bias, but also adds colliding bias, and why both types of bias are removed by conditioning on the matched confounder(s). As in some publications, we trace the logic of matching to a possible tradeoff between effort and variance, not between effort and bias. Lastly, we explain why the analysis of a matchedcase-control study – regardless of the method of matching – is not conceptually different from that of an unmatched study.
Univariate regression analyses identi ﬁ ed older age, longer symptom duration, male sex, and level of ASDAS and CRP as being signi ﬁ cantly associated with a higher rate of radiographic progression (table 5). HLA-B27, smoking, treatment and SIJ in ﬂ ammation on MRI were not signi ﬁ cant factors. Among baseline imaging para- meters, higher mSASSS, higher SSS scores for fat meta- plasia, back ﬁ ll and ankylosis, and lower SSS score for Table 3 Clinical, laboratory and imaging parameters at baseline in FORCAST patients matched for age, symptom duration and follow-up duration according to the presence (damage) or absence (lack of damage) of any syndesmophytes or ankylosis on cervical and lumbar spine radiographs*
The characteristics of the incident breast cancer subjects and matched controls were compared using a t-test for age (con- tinuous variable) and the chi-square test for other categorical variables. All P-values were obtained from two-tailed tests. The effects of breast density according to BI-RADS classification on breast cancer risk were analyzed using conditional logisticregression for matched analysis (whole analysis) or logisticregression for unmatched analysis (sub- group analysis according to age group, screening results, assessment category, time interval after non-recall screen- ing, and menopausal status), adjusting for the covariates mentioned above. Missing variables were treated as dum- mies. Cochran–Armitage test was applied to test for trends in associations between increased breast density and breast cancer risk. We estimated adjusted odds ratio (aOR) and 95% confidence intervals (CIs) and stratified them according to screening results category: recall (BI-RADS categories 0, 4, and 5) and non-recall (BI-RADS categories 1 and 2). To avoid masking effects, recalled cases were divided by each BI-RADS assessment category, and non-recalled cases were
A variety of approaches have been employed in the literature (see Ahmed and Karmakar, 1993; Salinger and Griffiths, 2001) to predict rainfall. Most of the methods were based on linear models and the findings were inconclusive. Moreover, they lack diagnostic checking which has become an essential part of data analysis. In this paper we employ the logisticregression technique to predict rainfall. In recent times, this method is commonly employed in many fields including biomedical research, business and finance, criminology, ecology, engineering, health policy, linguistics, wildlife biology etc. Logisticregression is useful for situations in which we want to be able to predict the presence or absence of an outcome (e.g, rainfall) based on values of a set of predictor variables. In our study we have used a climatic data set from Bihar, India, that has been extensively analyzed by many others (Mollaet al., 2006). Although climatic data are usually subjected to gross measurement error, we often see that this issue is not much focused in the literature. Before fitting the model by a logisticregression, we use some recently developed data screening methods like brushing and clustering to identify spurious observations. After fitting the model we employ some recently developed logisticregression diagnostics like Generalized Standardized Pearson Residuals (Imon and Hadi, 2008) to identify the outliers. Then we apply the cross validation technique which is a very popular and useful technique (Montgomery et al. 2006; Rao, 2005) for the validation of the fitted model for the future data. We measure the probability of misclassification error since the response in our study is a class variable. We also use Cohen’s Kappa (Cohen, 1960) statistic to test the concordance of rainfall by the logisticregression model.
This study aims to identify an application of Multinomial LogisticRegression model which is one of the important methods for categorical data analysis. This model deals with one nominal/ordinal response variable that has more than two categories, whether nominal or ordinal variable. This model has been applied in data analysis in many areas, for example health, social, behavioral, and educational.To identify the model by practical way, we used real data on physical violence against children, from a survey of Youth 2003 which was conducted by Palestinian Central Bureau of Statistics (PCBS). Segment of the population of children in the age group (10-14 years) for residents in Gaza governorate, size of 66,935 had been selected, and the response variable consisted of four categories. Eighteen of explanatory variables were used for building the primary multinomial logisticregression model. Model had been tested through a set of statistical tests to ensure its appropriateness for the data. Also the model had been tested by selecting randomly of two observations of the data used to predict the position of each observation in any classified group it can be, by knowing the values of the explanatory variables used. We concluded by using the multinomial logisticregression model that we can able to define accurately the relationship between the group of explanatory variables and the response variable, identify the effect of each of the variables, and we can predict the classification of any individual case.
The downside of Newton’s method is that exact evaluation of the Hessian and its inverse are quite expensive in com- putational terms. In addition, the goal is to estimate the parameters of the logisticregression model in a privacy- preserving manner using homomorphic encryption, which will further increase the computational challenges. Therefore, we will adapt the method in order to make it possible to compute it efficiently in the encrypted domain. The first step in the simplification process is to approx- imate the Hessian matrix with a fixed matrix instead of updating it every iteration. This technique is called the fixed Hessian Newton method. In , Böhning and Lind- say investigate the convergence of the Newton-Raphson method and show it converges if the Hessian H (β) is replaced by a fixed symmetric negative definite matrix B (independent of β ) such that H (β) ≥ B for all feasible parameter values β , where “ ≥ " denotes the Loewner ordering. The Loewner ordering is defined for symmetric matrices A, B and denoted as A ≥ B iff their differ- ence A − B is non-negative definite. Given such B, the Newton-Raphson iteration simplifies to
tested later on with the selected set of covariates one at a time. This is a possible limitation because any of these var- iables that would be significant when put in the model jointly will be missed. However, being significant jointly may indicate multicollinearity, in which case the analyst may choose to use only one of those as a proxy or not at all. Also if there is some multicollinearity between signif- icant variables they would likely be retained by all selec- tion procedures as a result of their significant effect. Second, if two non-significant covariates confound each other, they are going to be retained as confounders since all covariates are assumed to be equally important. In a situation where that happens, the analyst should probably consider retaining the two covariates if they are significant at the 0.25 level, indicating some reasonable association with the outcome. Otherwise, the analyst should probably exclude both from the model as meaningless confound- ers. Additionally, if there is some multicollinearity between non-significant variables, they would likely be retained by PS as a result of confounding effect on each other, and missed by other three selection procedures as a result of their non-significant effect. Third, this algorithm was not designed to force all dummy variables in the model (for instance, one that has three nominal levels which corresponds to two dummy variables that need to be considered as a unit in model inclusion), if one is sig- nificant. Other selection procedures have this limitation as well, unless you force dummy variables in the model. However, it is not possible to know a priori whether one of the dummy variables will be significant. If one of the dummy variables is retained as significant, the analyst can manually insert the rest of them in the model. Finally, multi-class problems were not explored in this paper; therefore, the results do not support the robustness of PS over a range of model selection applications and prob- lems.
Abstract. Logisticregression is a popular technique used in machine learning to construct classification models. Since the construction of such models is based on computing with large datasets, it is an appealing idea to outsource this computation to a cloud service. The privacy-sensitive nature of the input data requires appropriate privacy preserving mea- sures before outsourcing it. Homomorphic encryption enables one to com- pute on encrypted data directly, without decryption and can be used to mitigate the privacy concerns raised by using a cloud service. In this pa- per, we propose an algorithm (and its implementation) to train a logisticregression model on a homomorphically encrypted dataset. The core of our algorithm consists of a new iterative method that can be seen as a simplified form of the fixed Hessian method, but with a much lower multiplicative complexity. We test the new method on two interesting real life applications: the first application is in medicine and constructs a model to predict the probability for a patient to have cancer, given ge- nomic data as input; the second application is in finance and the model predicts the probability of a credit card transaction to be fraudulent. The method produces accurate results for both applications, comparable to running standard algorithms on plaintext data. This article introduces a new simple iterative algorithm to train a logisticregression model that is tailored to be applied on a homomorphically encrypted dataset. This algorithm can be used as a privacy-preserving technique to build a bi- nary classification model and can be applied in a wide range of problems that can be modelled with logisticregression. Our implementation results show that our method can handle the large datasets used in logistic re- gression training.
One of the most significant threats of a national economy is the bankruptcy of its banks. Estimation of bankruptcy provides invaluable information on which governments, investors and shareholders can base their financial decisions in order to prevent possible losses. In this paper model was developed using stepwise logisticregression with financial ratios to make bankruptcy predictions. Descriptive statistics, correlations and independent T-test are used for testing to see the characteristics of each variable on both failed and non-failed banks. Samples were developed by using financial ratios from 16 nationalised banks in India. The result from empirical study reveals that the financial ratios related to one year prior model are better than two year prior model for the purpose of prediction . The result of statistical test has pointed out that owners fund as percentage of total source, long term debt/equity and quick ratio are the significant in predicting bankruptcy. The Nagelkerke R 2 indicated 84.4% of the variation in the outcome variable. The predictability accuracy of the model with owners fund as percentage of total source is 87.5 % which is under 95%confidence level.