Subventions et des contributions :

Retour à la page de recherche

Titre :

Faster Compressed Indexes On Next-Generation Hardware

Numéro de l’entente :

RGPIN

Valeur d'entente :

210 000,00 $

Date d'entente :

10 mai 2017 -

Organisation :

Conseil de recherches en sciences naturelles et en génie du Canada

Location :

Québec, Autre, CA

Numéro de référence :

GC-2017-Q1-01595

Type d'entente :

subvention

Type de rapport :

Subventions et des contributions

Informations supplémentaires :

Subvention ou bourse octroyée s'appliquant à plus d'un exercice financier. (2017-2018 à 2022-2023)

Nom légal du bénéficiaire :

Lemire, Daniel (TÉLUQ)

Programme :

Programme de subventions à la découverte - individuelles

But du programme :

Software indexes accelerate applications in business analytics, machine learning, and data science. They often determine the performance of big-data applications. Efficient indexes not only improve latency and throughput, but they also reduce energy usage. Many indexes make parsimonious use of internal memory so that critical data remains close to the processor. It is also desirable to work directly on the compressed data, to avoid potentially harmful decoding passes. Thus, we use lightweight compression strategies, optimized for speed.

We are interested in bitmap indexes. We find them in a wide range of popular systems: Oracle, Apache Hive, Apache Spark, Druid, Apache Kylin, Apache Lucene, Elastic, Git and so forth. They are an integral part of systems—such as Wikipedia or GitHub—used by millions of people every day .

Our long-term plan has three axes of research:

(1) Pursue the optimization of existing bitmap indexes, as they are used in current systems. Many of these systems rely on either Roaring or EWAH bitmaps, two formats we produced. We plan to multiply the performance of some of these indexes on processors supporting advanced SIMD (single instruction, multiple data) instructions such as those of the AVX2 and AVX-512 families.

(2) Continue to break speed records with our integer-compression techniques. We focus on sorted lists of integers, as they frequently appear in B+-trees, inverted indexes and compressed bitmap indexes. In recent years, we showed that we could decode billions of integers per second while maintaining compression ratios close to the limit given by Shannon's entropy. Yet we used all but a fraction of the features of the latest processors. We will establish new performance records on recent server processors while compressing even better. We will also accelerate applications with rank, select, merge, update and insert functions operating directly over the compressed data. Finally, we will apply this work to database engines (e.g., upscaledb ) and big-data systems (e.g., Apache Parquet, Druid, Apache Spark).

(3) Develop a novel bitmap index format that outclasses the state of the art in both memory usage and raw speed. Currently, one of the best formats is Roaring: it is widely adopted, fast and it uses little memory. We seek to design a new format that uses even less memory than Roaring while improving the speed. Though it is relatively easy to offer better compression, doing so while improving query performance represents a real challenge. This research axis builds directly on the first two axes.