Subventions et des contributions :
Subvention ou bourse octroyée s'appliquant à plus d'un exercice financier. (2017-2018 à 2022-2023)
The technological advances in next generation sequencing have enabled researchers to unveil the wide variability in microbial communities and their relationships with different diseases. Therefore, it is becoming critical to understand both environmental and host genetic factors that impact the composition of the microbiome. However, robust and powerful methods in this area are underdeveloped due to the complexity of microbiome sequencing data, which includes: a) microbial taxa data are usually grouped into operational taxonomic units (OTUs) and these counts are often highly skewed, over-dispersed, and zero inflated, b) OTU counts within a taxonomic hierarchical cluster are often highly correlated, but this multivariate nature is usually ignored, c) the study designs often involve repeated measures taken from related family members, thus inducing temporal and familial correlations. In this proposal, I will develop powerful bioinformatics, statistical, and computational methods to overcome these challenges. Specifically, I propose to use the latent variable (LV) methodology to jointly model multiple taxa from hierarchical taxonomic clusters within a longitudinal family study framework. The LV framework represents the underlying conceptual traits of the cluster and explains the correlations among different taxa. To address the over-dispersed and zero inflated features of the taxa counts, I will apply both zero-inflated and hurdle models on the multivariate OTU outcomes.
The LV inference will be constructed based on a Bayesian framework with samplings from the posterior distribution obtained using Markov Chain Monte Carlo (MCMC) algorithms. A Bayesian model selection algorithm will be developed to choose the optimal models for a particular dataset. I will incorporate dimensionality reduction methodologies on the genetic factors so that the genetic association signals can be identified from genome-wide data. I will also explore gene-gene (GxG), and gene-environment (GxE) interactions on the microbiome data. High-efficiency computational algorithms will be developed using C++, and computational software will be implemented within a user-friendly interface which will be distributed to the microbiome research community. In addition, a standardized analytic pipeline for modeling and analysis of microbiome data will be constructed and tested by simulations. Sample size estimation and power analysis based on both theoretical deduction and empirical results will also be provided to allow design of future studies. This proposal will help standardize and optimize future research on modifiable environmental risk factors, as well as genetic factors, for microbiome sequencing studies. This research program will advance large-scale microbiome sequencing analytic technologies for Canadian and international genetics and computational biology researcher community.