Subventions et des contributions :
Subvention ou bourse octroyée s'appliquant à plus d'un exercice financier. (2017-2018 à 2022-2023)
Big data is an important issue for modern data and statistical analysis. Computers can store huge amounts of data; however, methods to accurately and quickly analyze the data have not kept pace with improvements to modern storage technology. In some cases, data are discarded without being analyzed. Improved statistical analysis of big data will benefit any field dealing with massive amounts of data, such as biological sciences (e.g., genomics), finance and informatics, astronomy, cosmology, and climate science.
My proposed research will utilize a form of computer programming called Evolutionary Computation (EC). EC uses techniques copied from the biological theory of evolution by natural selection. In biology, the goal usually is to produce as many fit offspring as possible, who go on to produce their own fit offspring. Random mutations to the genome will make some children more fit, or less fit, than their parents. The fitter children are more likely to produce healthy offspring, so their genes get passed on. For my research, the measure of "fitness" used is how well the algorithm searches for optimum solutions with regard to clustering big data. (Clustering involves accounting for the underlying structure that links data points, so that they can be put into correct groups, or labelled correctly, e.g., linking gene expression to types of cancer.) Techniques such as cross-over and mutation are copied from biology, and are used to "evolve" the algorithm and make it fitter each time it runs.
Under the proposed research, evolutionary algorithms (EAs) will be developed, as alternatives to the almost ubiquitous expectation-maximization (EM) algorithm and its variants, for Gaussian and non-Gaussian mixture model-based approaches to clustering. EAs will be developed for the mixture of factor analyzers model, the mixture of variance-gamma distributions, and the mixture of variance-gamma factor analyzers models. Other short term objectives include the development of a mixture of multiple scaled variance-gamma distributions. This will bring a phenomenal level of modelling flexibility, while also guaranteeing cluster convexity -- the resulting components are hypercubiods so that the rate of decay can differ in each dimension. The mixture of multiple scaled variance-gamma distributions model will be extended to the mixture of multiple scaled variance-gamma factor analyzers model, for application to high-dimensional data. EAs will then be developed for the mixture of multiple scaled variance-gamma distributions and mixture of multiple scaled variance-gamma factor analyzers models and investigated as alternatives to alternating expectation-conditional maximization algorithms.