The revolution of Spark is igniting an vigorous debate within the data operations, data science and analytic professionals community. Spark supports of 4 different programming languages (not including SQL), making choosing a programming language difficult. Although you can learn multiple languages, it takes a long time to master any one. If you are someone who wants to get started with Spark, this fundamental choice will be one of the more important choices you make. Let’s start by comparing the different languages available in Spark.
A mature purely object-oriented language with a large developer base and loyal following. Many of the desktop and web applications we all know and love were developed in Java. It is one of the core languages taught in computer science programs, being very easy to learn the basics of the language and a great language for teaching object-oriented concepts. It also has a huge ecosystem of proprietary and open-source libraries components. Java rose in popularity in large part due to its Java Virtual Machine (JVM) technology, which allowed for applications to be portable across different platforms and operating systems. Its main drawbacks are the complex nature of the ecosystem, slower performance than lower level languages, and verbose language constructs.
Scala is also an object-oriented language, developed to run within the Java Virtual Machine (JVM). As it runs within the JVM, all libraries built for Java are able to be simply used within Scala. Its maintains many of the benefits of Java while fixing two of the main issues with Java, its verbose language, and lack of REPL (Read-Evaluate-Print-Loop). A REPL let programmers try out code interactively, eliminating the need for big compile steps to test bits of code, a must-have feature for many data science and analytics use cases. Its popularity has grown tremendously over the past few years, and is the language that most of Apache Spark was developed in, so many features come to the Scala API first.
Python is a popular general purpose object-oriented programming language, with a strong emphasis on simple, readable, and maintainable code. It has a large developer base, and is often taught not only in computer science courses, but also relevant engineering and science courses as well. It has a very intuitive language focused on only a few core foundation concepts. Like Scala, it has a REPL, which combined with the simple code syntax leads to rapid application development. Code style choice has been purposefully restricted, as some style aspects such as indentation, change the meaning of code. It has vast library of add-on modules including data processing, data science and scientific data processing functionality. It also has the advantage that there is a pre-built Spark compatible DataFrames object with the NumPy module that has built-in graphing routines. The main drawback to Python is performance, which can be considerably slower than other languages for certain operations. There is however a separate implementation called PyPy, which has dramatically improved performance for many purposes.
R has been gaining in popularity for statistical and analytics applications for the past few years. Developed in large part by statisticians, it has a very narrow focus on pure statistical applications. Its main strengths are the vast number of available statistical procedures, along with data processing and graphing capabilities. The language is not very readable or intuitive, especially to non-statisticians and performance of statistical procedures is not as good as those available in other languages.
What to choose
The Spark API’s are very similar in all the languages. Regardless of the language you choose, the code you write when interacting with the Spark cluster will have a very similar syntax. The difference lies in what happens when the the data gets returned to the driver node, and wrappers you put around Spark API code. That means that performance of the underlying language is not a big deal in most cases, especially when using DataFrames or the upcoming Datasets API. Each language has its strengths, but here’s some general recommendations.
If you are working on data operations or application development, Scala‘s access to all that Java has to offer, without the complex code makes it the preferred choice. You’ll need to know Scala if you want to contribute or tinker with the Spark code base. If you’re doing any data science work, the simplicity and speed of development makes Python the preferred choice. If you are integrating Spark into an existing Java application, it makes sense to use the Java API’s. I would only recommend using R if you are a academic statistician or need access to analytical features that aren’t available elsewhere. Over time, Spark will likely absorb more and more of R‘s analytical features in the ML/MLLib modules, as they need to be rewritten to take advantage of the scalability and parallelism of Spark.
To summarize, I view Java and R as serving niche purposes within the Spark ecosystem; the real choice is between Scala and Python. Personally I love Python, as I find it easy to learn, intuitive, and simple and quick to express complex concepts within it.