Machine Learning and Deep Learning frameworks and libraries for large-scale data mining: a survey


The combined impact of new computing resources and techniques with an increasing avalanche of large datasets, is transforming many research areas and may lead to technological breakthroughs that can be used by billions of people. In the recent years, Machine Learning and especially its subfield Deep Learning have seen impressive advances. Techniques developed within these two fields are now able to analyze and learn from huge amounts of real world examples in a disparate formats. While the number of Machine Learning algorithms is extensive and growing, their implementations through frameworks and libraries is also extensive and growing too. The software development in this field is fast paced with a large number of open-source software coming from the academy, industry, start-ups or wider open-source communities. This survey presents a recent time-slide comprehensive overview with comparisons as well as trends in development and usage of cutting-edge Artificial Intelligence software. It also provides an overview of massive parallelism support that is capable of scaling computation effectively and efficiently in the era of Big Data.


Machine Learning and Deep Learning are research areas of computer science with constant developments due to the advances in data analysis research in the Big Data era. This work provides the comprehensive survey with detailed comparisons of popular frameworks and libraries that exploit large-scale datasets. This work could be summarized as follows:

  1. Most of the Deep Learning frameworks are developed by the world’s largest software companies such as Google, Facebook, and Microsoft. These companies possess huge amounts of data, high performance infrastructures, human intelligence and investment resources. Such tools include TensorFlow, Microsoft CNTK, Caffe, Caffe2, Torch, PyTorch, and MXNet. Apart from them, other Deep Learning frameworks and libraries such as Chainer, Theano, Deeplearning4J, and H2O from other companies and research institutions, are also interesting and suitable for industrial use.
  2. There are many high level Deep Learning wrapper libraries built on top of the above-mentioned Deep Learning frameworks and libraries. Such wrappers include Keras, TensorLayer and Gluon.
  3. Big Data ecosystems such as Apache Spark, Apache Flink and Cloudera Oryx 2 contain build-in Machine Learning libraries for large-scale data mining mainly for tabular data. These Machine Learning libraries are currently in an evolving state but the power of the whole ecosystem is significant.
  4. Vertical scalability for large-scale DL is still limited due to the GPU memory capacity and horizontal scalability is still limited due to the network communication latency between nodes.
  5. Every tool, including traditional general purpose Machine Learning tools, provides a way to process large-scale data.
  6. As of 2018, Python is the most popular programming language for data mining, Machine Learning and Deep Learning applications. It is used as a general purpose language for research, development and production, at small and large scales. The majority of tools are either Python tools or support Python interfaces.
  7. The trend shows a high number of interactive data analytics and data visualisation tools supporting decision makers.

The impact of new computing resources and techniques combined with an increasing avalanche of large datasets is transforming many research areas. This evolution has many different faces, components and contexts. Since the technology is more and more present, the domain knowledge is not sufficient to tackle complex problems. This brings a great challenge for data mining projects in deciding which tools to select among the myriad of frameworks, libraries, tools and approaches from divergent Machine Learning and Deep Learning user communities in different applicable areas.

This work is supported by the “Designing and Enabling E-infrastructures for intensive Processing in a Hybrid DataCloud” (DEEP-Hybrid-DataCloud) project that has received funding from the European Union’s Horizon 2020 Research and Innovation Programme under Grant Agreement No. 777435.

Published19 January 2019.


Nguyen, G., Dlugolinsky, S., Bobák, M. et al. Machine Learning and Deep Learning frameworks and libraries for large-scale data mining: a survey. Artif Intell Rev 52, 77–124 (2019).

This is an open access article distributed under the terms of the Creative Commons CC BY license, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Click to rate this post
[Total: 0 Average: 0]

Liked this post? Follow this blog to get more.