Tianxi Cai - Harvard University

Starts: 4:00 pm on Thursday, December 2, 2010
Ends: 5:00 pm on Thursday, December 2, 2010
Location: MCS 149

TITLE: Adaptive Naive Bayes Kernel Machine Approach to Classification with GWAS Data ABSTRACT: As genetic studies of human diseases progress, it is becoming increasingly evident that genetics often play a major and complex role in many types of diseases. Therefore, the complexity of the genetic architecture of human health and disease makes it difficult to identify genomic markers associated with disease risk or to construct accurate genetic risk prediction models. Accurate risk assessment is further complicated by the availability of a large number of markers that may be predominately unrelated to the outcome or may explain a relatively small amount of genetic variation. Often, standard prediction models merely rely on additive or marginal relationships between the markers and the phenotype of interest. Marginal association based analysis has limited power in identifying markers truly associated with disease, resulting in a large number of false positives and false negatives. Simple additive modeling does not perform well when the underlying structure of association involves interactions and other non-linear effects. Additionally, these methods do not make use of information that may be available regarding genetic pathways or gene structure. We propose a multi-stage method relating possibly predictive markers to the risk of disease by first forming multiple gene-sets based on certain biological criteria. By imposing a naive bayes kernel machine model, we estimate gene-set specific risk models that relate information from each gene-set to the outcome. In the second stage, we aggregate information across all gene-sets by adaptively estimating the weights for each gene-set via a regularization procedure. The KM framework efficiently models the potentially non-linear effects of predictors without specifying a particular functional form. Estimation and predictive accuracy is further improved with kernel PCA approximation to reduce the degrees of freedom in the first stage and with adaptive regularization in the second stage to remove non-informative regions from the final prediction model. Prediction accuracy is assessed with bias-corrected ROC curves and AUC statistics. Numerical studies suggest that the model performs well in the presence of non-informative regions and both linear and non-linear effects.