Spark 2.2 is out…and we have a surprise giveaway to celebrate

After over 3 months of intensive testing, Spark 2.2 was released today. This release contains over 1,100 improvements, bug fixes, and new features. The most importantly,  Spark now has a Cost-Based Optimizer (CBO).  Huawei and Databricks worked together on the ability to collect table and column level statistics along with the CBO, for more intelligent optimizations of physical query plans. The CBO feature was needed for Spark to round out its data warehousing capabilities. There’s also a slew of great enhancements to ML, and for the first time ever, near complete feature parity for Python vs Scala. In addition to Python support, R gets much better support now as well. Structured Streaming goes “mainstream” with additional enhancements and the removal of the “experimental” tag. With Structured Streaming, the DataFrame batch and stream interfaces are almost identical, making development and code reuse a snap. Streams can now even be queried in real-time on live data.

To celebrate all the great work that has gone into Spark 2.2, we are giving away free PySpark 2.2 quick reference guides. Our PySpark quick reference guides, which are typically only provided to students in our courses, are a single double-sided page, and provide at a glance lookups for core object types, functions and capabilities in PySpark. To get your free copy, simply send us an email to inquiry@wisewithdata.com.

Spark 1.6.0 is Here

The next major release of Apache Spark is now available. As usual, it contains a huge number of improvements across a the whole platform. Here’s a quick run down of some of the big changes:

New Dataset API

The Dataframes API, and it’s major advantages (especially use of the Catalyst optimizer) has been extended into the area of RDD’s though a new API called Datasets. With Datasets, you loose small amount of flexibility vs RDD’s (namely that Datasets can’t contain any arbitrary Java object), but in return you get massive performance and scalability gains. Going forward, most users should be using Datasets and Dataframes API’s, and only use RDD’s when absolutely necessary. RDD’s will become a low level concept that is rarely used. Unfortunately for Python and R fans, you’ll have to wait a bit longer to have the dataset API’s available, as it is Scala only at this point. My guess is that’ll be released in the next release, which may be 1.7, or might be 2.0.

Memory Management and Performance

Spark 1.6 is much more intelligent about how memory is managed between data and execution. Prior to 1.6, the two area’s where separated, leading to inefficient memory utilization. Along with the memory management enhancements, there’s a raft of performance enhancements, so related to Tungsten and Catalyst, but also on the storage side as well.

ML/MLLIB

There’s a bunch of new algorithms and stats functionality available in 1.6 including Survival Models, and Bisecting K-means, A/B testing models in Spark Streaming. There’s also a bunch of new model diagnostic outputs, bringing the Spark’s modelling capabilities much more in line with other tools like R, SAS, or SPSS.

Python Support

Python has gained significant capabilities in 1.6, bringing it close to being a first class citizen like Scala, especially in Spark Streaming. Prior to 1.6, if you were writing streaming applications, you pretty much had to do it in Scala, now you have some more options depending on the nature of the application.

R Support

R support has gained a large number of improvements in 1.6, making the transition for R users much simpler.

 

 

Sparking a language debate

The revolution of Spark is igniting an vigorous debate within the data operations, data science and analytic professionals community. Spark supports of 4 different programming languages (not including SQL), making choosing a programming language difficult. Although you can learn multiple languages, it takes a long time to master any one. If you are someone who wants to get started with Spark, this fundamental choice will be one of the more important choices you make. Let’s start by comparing the different languages available in Spark.

Java

A mature purely object-oriented language with a large developer base and loyal following. Many of the desktop and web applications we all know and love were developed in Java. It is one of the core languages taught in computer science programs, being very easy to learn the basics of the language and a great language for teaching object-oriented concepts. It also has a huge ecosystem of proprietary and open-source libraries components. Java rose in popularity in large part due to its Java Virtual Machine (JVM) technology, which allowed for applications to be portable across different platforms and operating systems. Its main drawbacks are the complex nature of the ecosystem, slower performance than lower level languages, and verbose language constructs.

Scala

Scala is also an object-oriented language, developed to run within the Java Virtual Machine (JVM). As it runs within the JVM, all libraries built for Java are able to be simply used within Scala. Its maintains many of the benefits of Java while fixing two of the main issues with Java, its verbose language, and lack of REPL (Read-Evaluate-Print-Loop). A REPL let programmers try out code interactively, eliminating the need for big compile steps to test bits of code, a must-have feature for many data science and analytics use cases. Its popularity has grown tremendously over the past few years, and is the language that most of Apache Spark was developed in, so many features come to the Scala API first.

Python

Python is a popular general purpose object-oriented programming language, with a strong emphasis on simple, readable, and maintainable code. It has a large developer base, and is often taught not only in computer science courses, but also relevant engineering and science courses as well. It has a very intuitive language focused on only a few core foundation concepts. Like Scala, it has a REPL, which combined with the simple code syntax leads to rapid application development.  Code style choice has been purposefully restricted, as some style aspects such as indentation, change the meaning of code. It has vast library of add-on modules including data processing, data science and scientific data processing functionality. It also has the advantage that there is a pre-built Spark compatible DataFrames object with the NumPy module that has built-in graphing routines. The main drawback to Python is performance, which can be considerably slower than other languages for certain operations. There is however a separate implementation called PyPy, which has dramatically improved performance for many purposes.

R

R has been gaining in popularity for statistical and analytics applications for the past few years. Developed in large part by statisticians, it has a very narrow focus on pure statistical applications. Its main strengths are the vast number of available statistical procedures, along with data processing and graphing capabilities. The language is not very readable or intuitive, especially to non-statisticians and performance of statistical procedures is not as good as those available in other languages.

What to choose

The Spark API’s are very similar in all the languages. Regardless of the language you choose, the code you write when interacting with the Spark cluster will have a very similar syntax. The difference lies in what happens when the the data gets returned to the driver node, and wrappers you put around Spark API code. That means that performance of the underlying language is not a big deal in most cases, especially when using DataFrames or the upcoming Datasets API. Each language has its strengths, but here’s some general recommendations.

If you are working on data operations or  application development, Scala‘s access to all that Java has to offer, without the complex code makes it the preferred choice. You’ll need to know Scala if you want to contribute or tinker with the Spark code base. If you’re doing any data science work, the simplicity and speed of development makes Python the preferred choice. If you are integrating Spark into an existing Java application, it makes sense to use the Java API’s. I would only recommend using R if you are a academic statistician or need access to analytical features that aren’t available elsewhere. Over time, Spark will likely absorb more and more of R‘s analytical features in the ML/MLLib modules, as they need to be rewritten to take advantage of the scalability and parallelism of Spark.

To summarize, I view Java and R as serving niche purposes within the Spark ecosystem; the real choice is between Scala and Python. Personally I love Python, as I find it easy to learn, intuitive, and simple and quick to express complex concepts within it.