An analytics platform is a system in which all necessary steps and functions within the business analytics process are supported within the same common framework, enabling code reuse and removing tool integration issues. There are only a few analytical platforms on the market, the open source Apache Spark and the legacy proprietary analytics platforms from SAS and IBM. IBM’s analytics toolset is not a true platform, it’s more of a loose collection of buggy, under-invested and poorly designed legacy tools, all of which have been acquired by IBM, so we won’t bother covering it further.
The Apache Spark Analytics Platform
Apache Spark is an open source cluster computing framework originally developed in the AMPLab at University of California, Berkeley in 2009, but was later donated to the Apache Software Foundation where it remains today. It can be though of as evolution of Hadoop’s two-stage disk-based MapReduce paradigm, but with vastly improved performance, scalability, usability and functionality. Spark’s multi-stage in-memory primitives provides performance 100’s of times faster for certain applications. By allowing user programs to load data into a cluster’s memory and query it repeatedly, Spark is well-suited to advanced analytical methods and machine learning algorithms (ML/MLLib, GraphX/GraphFrames), as well as data integration tasks.
Apache Spark surfaces a lot of it’s functionality though Application Programming Interfaces (API’s) for a number of programming languages, namely SQL, R, Python, Scala and Java. Selection of programming language for a given task is up to the user, as most of the interfaces support the same underlying Spark functionality. Spark supports a vast array of 3rd party tools that interact with Spark, including popular tools for Python, R, and Hadoop such as Jupyter/Ipython Notebook, Rstudio, HUE, Zepplin and others. Together the entire Apache Spark ecosystem supports all the necessary tools for efficient Business Analytics work.
The SAS Analytics Platform
SAS is a legacy software suite that can mine, alter, manage and retrieve data from a variety of sources and perform statistical analysis on it. Nearly 50 years old, SAS still has a large legacy base of customers, mainly due to the perceived migration costs, something WWD has been investing heavily to address. The main use case for SAS is for legacy customers. WWD does not recommend SAS to any customers not already highly invested in SAS based technology. With the core SAS data structure, the SAS Data set format (SAS7BDAT) largely unchanged since the 1990’s, SAS has major performance issues and technical gaps that limit its ability to solve today’s complex business problems.