Towards a new API for big data analysis
Source of subsidy
Projets Européens
Professors involved
- Guy Tremblay (UQAM)
- Marco Aldinucci (Prof., Università degli studi di Torino)
Summary
With the increasing number of big data analytics tools, it becomes difficult for a user approaching analytics to have a clear picture of the expressiveness they provide to solve user-defined problems. Our first aim is to review the features typical big data analytics tools (e.g., Spark, Storm, Flink, Beam) provide to the user in terms of API and how they relate to parallel computing paradigms. More precisely, those well-known tools will be analysed a described in terms of a common computation model underlying all of them, namely, the Dataflow model. This common model is the key to compare the expressiveness and the theoretical scalability of the presented big data tools. On top of the dataflow model, we will define a stack of layers, where each layer represents a dataflow graph with a different meaning, describing a program from what is exposed to the programmer down to the underlying execution model layer.We also aim to develop a new C++ API with a fluent interface that aims at making easier the programming of data analytics applications while preserving or enhancing their performance. This will be attained through three key design choices: 1) unifying batch and stream data access models, 2) decoupling processing from data layout, and 3) exploiting a stream-oriented, scalable, efficient C++11 runtime system. This API will propose a programming model based on pipelines and operators that are polymorphic with respect to data types in the sense that it is possible to re-use the same algorithms and pipelines on different data models (e.g., streams, lists, sets, etc.).
Étudiants et stagiaires
- Claudia Misale (IBM T.J. Watson Research Center)
- Maurizio Drocco (Postdoc. Univ. di Torino)
