Subventions et des contributions :

Retour à la page de recherche

Titre :

Parameter Estimation for Non-Gaussian Model-Based Clustering with High-Dimensional Data

Numéro de l’entente :

RGPIN

Valeur d'entente :

70 000,00 $

Date d'entente :

10 mai 2017 -

Organisation :

Conseil de recherches en sciences naturelles et en génie du Canada

Location :

Ontario, Autre, CA

Numéro de référence :

GC-2017-Q1-02542

Type d'entente :

subvention

Type de rapport :

Subventions et des contributions

Informations supplémentaires :

Subvention ou bourse octroyée s'appliquant à plus d'un exercice financier. (2017-2018 à 2022-2023)

Nom légal du bénéficiaire :

McNicholas, Sharon (McMaster University)

Programme :

Programme de subventions à la découverte - individuelles

But du programme :

Big data is an important issue for modern data and statistical analysis. Computers can store huge amounts of data; however, methods to accurately and quickly analyze the data have not kept pace with improvements to modern storage technology. In some cases, data are discarded without being analyzed. Improved statistical analysis of big data will benefit any field dealing with massive amounts of data, such as biological sciences (e.g., genomics), finance and informatics, astronomy, cosmology, and climate science.

My proposed research will utilize a form of computer programming called Evolutionary Computation (EC). EC uses techniques copied from the biological theory of evolution by natural selection. In biology, the goal usually is to produce as many fit offspring as possible, who go on to produce their own fit offspring. Random mutations to the genome will make some children more fit, or less fit, than their parents. The fitter children are more likely to produce healthy offspring, so their genes get passed on. For my research, the measure of "fitness" used is how well the algorithm searches for optimum solutions with regard to clustering big data. (Clustering involves accounting for the underlying structure that links data points, so that they can be put into correct groups, or labelled correctly, e.g., linking gene expression to types of cancer.) Techniques such as cross-over and mutation are copied from biology, and are used to "evolve" the algorithm and make it fitter each time it runs.

Under the proposed research, evolutionary algorithms (EAs) will be developed, as alternatives to the almost ubiquitous expectation-maximization (EM) algorithm and its variants, for Gaussian and non-Gaussian mixture model-based approaches to clustering. EAs will be developed for the mixture of factor analyzers model, the mixture of variance-gamma distributions, and the mixture of variance-gamma factor analyzers models. Other short term objectives include the development of a mixture of multiple scaled variance-gamma distributions. This will bring a phenomenal level of modelling flexibility, while also guaranteeing cluster convexity -- the resulting components are hypercubiods so that the rate of decay can differ in each dimension. The mixture of multiple scaled variance-gamma distributions model will be extended to the mixture of multiple scaled variance-gamma factor analyzers model, for application to high-dimensional data. EAs will then be developed for the mixture of multiple scaled variance-gamma distributions and mixture of multiple scaled variance-gamma factor analyzers models and investigated as alternatives to alternating expectation-conditional maximization algorithms.