Back in 2015, when we set out to build SPROCKET, the World’s only SAS modernization solution, one key design question plagued our thoughts. Scalable, simple, fast and open-source, it was obvious from the early days of Apache Spark that it was analytics platform of the future. But, there are many languages you can interact with Spark. Back then, Scala and SQL were the dominant languages of Spark, but Java, Python and R were also officially supported. At the time, both R and Python had strong existing user bases and strong growth trajectories.
Let’s talk about the obvious choices first. SQL by itself was just not going to cut it. SAS is a an extremely rich language that can’t be described solely by SQL. Scala, was and still is a relatively obscure language. It was a perfect choice for the authors of Spark, because it allowed the community to leverage all the existing Hadoop Java based libraries with ease, without the extreme verbosity of the Java language. But it’s still based on the JVM and that comes with a lot of complexity and added baggage for end-users.
So I set out to give both Python and R a really good shake, and figure out what path we should take. In fact, I even wrote a blog post on the topic. From my days at SAS, I’d heard a lot about R, and most of it was not good. But learning about R from current and former SAS employees is clearly a flawed strategy, as many will have biases against anything that’s not SAS.
I thought perhaps talking to people who use R might be a better strategy. But, I’ve found most of those conversations went down weird rabbit holes. Even R fanatics are more likely to come up with a list of excuses on why the language is odd, rather than to talk about its strengths and use-cases. The best quote I’ve heard is that “R is a language designed by and for academic statisticians”. If this is indeed true, then it would certainly help define what language is all about.
Is R just a domain specific language for academic statisticians that’s been co-opted by others out of need? The only theme I heard in support of R, was that it is the only open-source tool with a highly developed system of packages for all statistical applications. While that might have been true many years ago, in my near decade of Python use, I’ve never found it lacking statistical functionality.
I really needed to explore this topic more for myself to be sure. I spent months trying to learn and understand R. Let’s just say that I came back from my exploration underwhelmed and confused. After spending 20 years learning SAS, an antique language with bizarre out of date concepts, R by comparison seemed in many ways to be far worse. It was filled with inconsistencies and a seeming complete lack of knowledge of how programming languages are supposed to work. No wonder many at SAS laughed off the threat of R.
In my research, I came across a in-depth 27 page exploration of the R language that helped to shed some light on the situation. As a side note, I doubt Microsoft would have invested so much money and effort buying Revolution Analytics (an R based startup) back in 2015, if it had read and understood this research. These days, they’ve all but abandoned the R language in favor of Python. While the research is now somewhat dated, it still holds largely true today. This sentence in the abstract of the paper provides a great glimpse into the language:
“This rather unlikely linguistic cocktail would probably never have been prepared by computer scientists, yet the language has become surprisingly popular.”
While not a page-turner, some of the findings are striking blunt in their conclusions about the strange semi-object oriented paradigm being leveraged in R. The first such conclusion relates to the overall structure of the language and the project itself:
“Our first challenge was to understand the unconventional semantics of the language and the sometimes subtle interactions between its features. While some documentation exists, it is incomplete. The language is effectively defined by the successive releases of its implementation. Relying on an implementation as the authoritative specification of a language is unsatisfactory; the R interpreter is constrained by implementation decisions and presents a programming model that is at same time overconstrained and ambiguous. Implementation details are exposed and slowly bleed into the language…The object-oriented side of the language feels like an afterthought. The combination of mutable objects without references or cyclic structures is odd and cumbersome.”
Indeed, not only is the language poorly laid out, its also really bad at its main job, which is to process data. This causes users to jump through all kinds of hoops just to make things work correctly, effectively making your code much more complicated, just like I’m so familiar with in SAS.
“The current implementation of R is massively inefficient…For the object system, it should be built-in rather than synthesized out of reflective calls. Copy semantics can be really costly and force users to use tricks to get around the copies. A limited form of references would be more efficient and lead to better code”
Most importantly, R’s use in larger enterprise scale projects is a no-go. The features that allow a language to scale use-cases from a single-user notebook, all the way up to massive systems with millions of lines of code, developed by many users over many years, are just not there with R. For enterprise uses, the evolution of the language and its versions must be tightly controlled and planned out, which is clearly not the case with R.
“So, R is not the ideal language for developing robust packages. Improving R will require increasing encapsulation, providing more static guarantees, while decreasing the number and reach of reflective features. Furthermore, the language specification must be divorced from its implementation and implementation-specific features must be deprecated.”
This last point around deprecation is something that I know all to well. The SAS language is now over 50 years old, and as far as I know nothing has ever been removed or deprecated since the first official SAS release in 1971 (SAS ’71). Someone please do prove me wrong on this one:) There are still arcane references to punch cards and mainframe concepts used throughout, even in recently developed SAS code. That complexity is ultimately a burden to end-users, making the language harder and harder to learn over time.
“R is a language designed by and for academic statisticians”. Yes, that statement does indeed capture the essence of it. Not that I have anything against statisticians, I am good friends with quite a few, but they are not computer scientists or software engineers, and that’s who should be developing programming languages. Dentists and hair stylists – I go sit in a chair and both try desperately to make me look presentable – but alas they require vastly different skills.
By contrast, Python is clean, popular and modern, and designed by incredibly talented computer scientists and engineers. It has had its growing pains (i.e. the switch from Python 2 to Python 3), but it did the right thing and deprecated concepts that didn’t fit with the overall idea of the language. Python by most measures is now the #1 programming language in the world, used by Analysts, Statisticians, Mathematicians, Engineers, Physicists, Chemists, Biologists, Astronomers, Climatologists, AI experts, Data Scientists and Data Engineers.
In 2022, Python is the undeniably the common tongue of science and engineering, used by all. Its no wonder R usage and popularity is falling quickly, a steep 50% drop measured usage in just the past year alone.
With our SPROCKET Conversion Solution that is Fast, Simple and Accurate, we only convert SAS code to modern PySpark. Learn more about the SPROCKET Solution, contact us at email@example.com