Subventions et des contributions :

Titre :
Entity augmentation and data cleaning for machine learning
Numéro de l’entente :
CRDPJ
Valeur d'entente :
180 000,00 $
Date d'entente :
13 déc. 2017 -
Organisation :
Conseil de recherches en sciences naturelles et en génie du Canada
Location :
Colombie-Britannique, Autre, CA
Numéro de référence :
GC-2017-Q3-00414
Type d'entente :
subvention
Type de rapport :
Subventions et des contributions
Informations supplémentaires :

Subvention ou bourse octroyée s'appliquant à plus d'un exercice financier (2017-2018 à 2020-2021).

Nom légal du bénéficiaire :
Wang, Jiannan (Simon Fraser University)
Programme :
Subventions de recherche et développement coopérative - projet
But du programme :

With the rise of Big Data, companies and organizations are increasingly eager to use machine learning to extract value from their data and to enable data-driven decision making. However, machine learning often assumes that data has been well-prepared, and puts its main focus on learning and making predictions based on the data. But, in reality, data often comes from multiple sources and a lot of time is spent on data integration; real-world data is often dirty and data cleaning is an extremely time-consuming and expensive process. According to the interviews of data scientists, they can spend 80% of their time on data preparation. This problem will be further exacerbated in emerging Big Data scenarios when data volumes are increasing, or when data comes from a larger variety of sources.x000D
To this end, in this project, we study how to reduce the cost of data preparation for machine learning. We will particularly focus on two challenging research topics: (1) "Entity augmentation" studies how to efficiently augment entities (e.g., restaurants, persons) with new attributes (e.g., location, occupation) from external data sources. (2) "Data cleaning for machine learning" studies how to reduce the cost by only cleaning the data that are most beneficial to predictions. This project has benefits to the Canadian economy in multiple aspects. First, more and more companies in Canada are relying on machine learning to make critical business decisions (e.g., churn prediction, fraud detection). The techniques developed in this project can save their time to better prepare data for use in machine learning, helping them to improve prediction accuracy and grow revenue. Second, the outcome of the project will further boost the development of data science technologies, democratize machine learning for small companies, and help to create more data science related jobs in Canada.x000D