Subventions et des contributions :

Titre :

Outlier Detection in High-Dimensional Big Data using Bio-Inspired Methods for Emerging Applications in Engineering, Healthcare, and Business

Numéro de l’entente :

RGPIN

Valeur d'entente :

120 000,00 $

Date d'entente :

10 mai 2017 -

Organisation :

Conseil de recherches en sciences naturelles et en génie du Canada

Location :

Ontario, Autre, CA

Numéro de référence :

GC-2017-Q1-01812

Type d'entente :

subvention

Type de rapport :

Subventions et des contributions

Informations supplémentaires :

Subvention ou bourse octroyée s'appliquant à plus d'un exercice financier. (2017-2018 à 2022-2023)

Nom légal du bénéficiaire :

Raahemi, Bijan (Université d’Ottawa)

Programme :

Programme de subventions à la découverte - individuelles

But du programme :

In this research program, I will explore, design, and analyze innovative algorithms for outlier detection in high-dimensional big data using bio-inspired approaches, and apply the new methods to emerging applications in engineering (namely, Intrusion detection in computer networks), business (namely, fraud detection in corporate financial statements), and healthcare (namely, detecting abnormalities in patient’s vital signals).

Big data, characterized by four (and sometimes more) V’s of Volume, Velocity, Variety, and Veracity, is defined as a collection of data sets so large, dynamic, and complex that it becomes difficult to process using traditional data analytics techniques. In high dimensional spaces, distances between points become relatively uniform, and the notion of the nearest neighbors of a data point becomes meaningless. A high dimensional data has also numerous permutations of sub-spaces which are practically infeasible to be examined all. Processing such large-scale multi-dimensional data is computationally complex and expensive.

Detecting outliers (objects considerably dissimilar and inconsistent with respect to the majority of data) in Big data, especially in high-dimensional data and in the presence of noise, is an important research problem which has drawn many attentions in research community due to scientific challenges it introduces, and a wide range of real-world applications it supports including in engineering, healthcare, business, environment, and public security.

In this research program, I will explore novel techniques for dimension reduction, data summarization, and feature transformation running on distributed platforms, combined with ensemble of models to make fast and accurate detection of outliers. I will explore bio-inspired algorithms to search a large space of permutations with fitness functions minimizing sparsity of the samples in selected sub-spaces.

Analysis of Big data relies on scalable distributed platforms such as Hadoop (which supports MapReduce structure for analysis of large data in parallel), and Spark (a fast in-memory engine for large scale data processing.). In my Knowledge Discovery and Data Mining Lab, we have experimented with processing tasks in parallel using Hadoop and Spark. Building on these experiences, we will design and implement our novel solutions on distributed platforms.

The solutions and algorithms discovered in this research program will be applied to emerging applications in 3 areas of engineering (analyzing large volume of high-dimensional data generated by Internet traffic to detect intrusion in the network), business (detecting financial fraudulent activities in a real dataset of more than 4000 firms provided by Bloomberg, and CompuStat), and healthcare (analyzing vital signals collected from patients including temperature, heartbeat, blood pressure, and ECG signals to detect anomalies).