Back in 2016, I wrote a blog article posing an interesting question; given the rise of Apache Spark as the defacto analytics platform, is SAS still relevant? The answer back then was a definitive yes. Well a lot has changed in the past 4 years!

While we are reflecting on the past, let me digress for a moment to recognize an important birthdate. In a few weeks, it will be the 50th anniversary of the the first official release of SAS, SAS71 (1971 release year). The genius of Anthony Barr’s creation can’t be understated. His Statistical Analysis System (SAS) was decades ahead of its time, influencing the world of analytics and data processing forever. The fact this post is still discussing it a half century later is a testament to that fact. While Barr handed over control of the SAS system many years ago, he’s still a prolific innovator and inventor, and has gone on to have an incredible impact on the world around him. 

Over the past few years, SAS has done its best to keep up with the latest data science trends, supporting Python, releasing a new parallel capable architecture, providing integration with other data science tools, but the customer response has been underwhelming. Our market research indicates that SAS’ new Viya platform has had a very slow uptake in the market. This may be due to the incompatibility of Viya with existing SAS customers massive SAS9 code bases. Code needs to be migrated to take advantage of Viya’s features. SAS still ships SAS9 with Viya so all un-migrated code can still operate in the slow SAS9 engine.

Enough about SAS, what’s happened to Spark? It’s been a busy 4 years, so let’s just do a quick recap:

  • Over 100 fold increase in users, with use cases on clusters of over 10,000 commodity servers, as well as many of the largest super computers in the world
  • Major investments from virtually all major tech giants (Huawei, NTT, Alibaba, Tencent, Intel, NVIDIA, AMD, Adobe, eBay, Apple, Google, Facebook, Netflix, etc.) including Microsoft, where SQL Server now runs on-top of Spark
  • Extremely active development with 2 major releases, 4 minor releases, and 20 maintenance releases
  • Over 20 fold increase in execution speed for most workloads (100’s of times faster than SAS9)
  • Simpler API’s, with feature parity for the Scala and Python APIs
  • Hundreds more DataFrame and SQL functions
  • Hundreds of new Spark ML models and features
  • A proper metadata model with the Spark 3 Catalog API
  • Tight integration with leading Deep learning frameworks (Tensorflow, Keras, Pytorch, etc.)
  • ANSI SQL Compatibility (and significant PostgreSQL compatibility)
  • 50 times more books available on Apache Spark
  • PySpark reaches over 330,000 downloads per day via PiPy (one way to download PySpark), making it by far the top data science and engineering package for Python and Open Source in general
  • Spark is a the leading component in every major analytics cloud offering
    • AWS (EMR, Sagemaker, Databricks)
    • Azure (HDInsight, Synapse, Databricks)
    • GCP (Dataproc)
  • Databricks (a Spark Cloud offering by the creators of Spark) was valued at over $6 Billion USD in it’s latest round of funding in 2019
  • Migrating SAS code to Spark is now Fast, Simple and Accurate

How has so much happened is such a short time frame? The way I see it, the greatest advantage of Open Source development, is the diversity of thoughts and perspectives from impassioned people from around the world; it’s truly a force multiplier for innovation. In Spark’s case, the incredible early gains attracted even more people to help, including some of the most talented researchers and practitioners in the field.

Looking back, one of my two main criticisms of Spark in 2016 was the lack of a mature ecosystem of supporting technologies. That picture has changed dramatically. For model and data governance, you’ve now got mature tools like MLFlow and Spline. For data and job flow orchestration (e.g. SAS DI Studio), you’ve got the industry leading Apache Airflow and Apache NiFi tools. Point and click data mining has matured as well with tools like H2O / Sparkling Water. And the data sources, it’s basically got everything you need now, including a full featured spark-sas7bdat SAS dataset reader (proudly sponsored by WiseWithData). There’s even Delta Lake and Apache Hudi, ACID compliant Spark data sources for building large data warehouses with near real-time capabilities.

The other big criticism I had of Spark at the time, was the lack of resources that knew Spark. A situation that has now completely reversed. Spark’s ease of use and exponential growth in adoption, have created a wealth of Spark based data science and data engineering talent. Contrast that with an ever steepening decline in SAS knowledge in the marketplace. What we are consistently hearing from our customers however, is that they can’t find SAS based talent at any price. A great many SAS resources have retired after long and fruitful careers, and for the past 20 years few universities have been teaching SAS to students. Companies complain they can’t attract or retain data science talent, as many only want to work with modern platforms.

What can we take from all of this? In the face of an overwhelming trend towards free and open data science and combined global resources fast tracking innovation that succeed at removing siloed data domains, SAS will be hard pressed to out-manoeuvre, out-innovate or out-market Apache Spark, which matches and exceeds SAS’ capabilities in almost every respect – while being 100s of times more performant and infinitely scalable.

Free is free and SAS is most definitely not free. How does one make money when your competition is free? The harsh reality is in these disruptive times it’s hard to without a significant change in strategy. The SAS Institute is now admitting that it is losing money for the first time in 44 years. SAS is of course not alone, IBM’s dramatic moves to ditch all its legacy dead weight and focus on open source and cloud points this out. Bottom line, organizations trying to use (or sell) proprietary tools that are being displaced with evolving open source innovations, solving the current needs of the marketplace, all have to face the question how can we still be relevant.