Frequently Asked Questions

What is Apache Spark?

Apache Spark is an open source cluster computing framework which has seen very rapid development since becoming a top level Apache Software Foundation project in 2014. In only the past year, it has quickly gained functionality that brings it into the mainstream as a fully functional Business Analytics platform, one with unparalleled technical advantages over even the much more mature legacy platforms like SAS or Hadoop. It can be though of as evolution of Hadoop’s two-stage disk-based MapReduce paradigm. Spark’s multi-stage in-memory primitives provides vastly improved performance, scalability, usability and functionality over MapReduce or SAS for many applications. By allowing user programs to load data into a cluster’s memory and query it repeatedly, Spark is well-suited to advanced analytical methods and machine learning algorithms (ML/MLLib, GraphX), as well as data integration tasks.

What is Big Data?

Big Data has become a Big  Buzzword over the last few years, and nobody really knows what exactly it means. Some people use it to refer to analytics in general, while others use it in a more proper sense to talk about scalability of technologies to handle vast volumes of data. At WWD, our preferred definition is whenever you are dealing with enough data that technical challenges exist to manipulating and getting the most out of that data. By that definition, data scientists and statisticians have been grappling with Big Data challenges long before the buzzword came along. To overcome data volume issues, you typically have to employ a number of strategies such as sampling to get the required outcome in a reasonable time-frame. Current Big Data technologies such as Apache Spark do not always remove these obstacles, they simply make it technically feasible to overcome them by clustering a large number of computers together to accomplish a task much quicker. This strategy may however be uneconomical depending on the nature of the problem and the size of the data.

What is Machine Learning?

Machine learning is a different way to say advanced analytics or predictive analytics. Because data science has evolved from both the computer science domain and the statistics domain, each field of study has different terminology for the same thing. Apache Spark was started by a team of computer scientists, so the terminology they use differs from a language such as SAS or R, which were influenced mainly by statisticians. Machine learning can also be wider in definition, such as incorporating speech, image, and video recognition and analysis, typically not the domain of Business Analytics. Machine learning algorithms are increasingly being embedded as product features into software products to automatically learn from user interactions, something that is blurring the lines of what Business Analytics is.

What is AI/Deep Learning?

Deep learning algorithms has become a similar buzzword to big data. Many people have been calling them artificial neural networks for decades, and they are highly accurate way to build predictive and classification models. The fact that they are now starting to be embedded in a real-time fashion, in user interactions with software, is where the buzz has really started to become a mainstream concept. Unfortunately, for many purposes they are not the most suitable algorithms due to their lack of transparency. For many Business Analytics use cases, it is desirable to have a slightly less powerful model, but one which can be easily understood and diagnosed.