The next major release of Apache Spark is now available. As usual, it contains a huge number of improvements across a the whole platform. Here’s a quick run down of some of the big changes:
New Dataset API
The Dataframes API, and it’s major advantages (especially use of the Catalyst optimizer) has been extended into the area of RDD’s though a new API called Datasets. With Datasets, you loose small amount of flexibility vs RDD’s (namely that Datasets can’t contain any arbitrary Java object), but in return you get massive performance and scalability gains. Going forward, most users should be using Datasets and Dataframes API’s, and only use RDD’s when absolutely necessary. RDD’s will become a low level concept that is rarely used. Unfortunately for Python and R fans, you’ll have to wait a bit longer to have the dataset API’s available, as it is Scala only at this point. My guess is that’ll be released in the next release, which may be 1.7, or might be 2.0.
Memory Management and Performance
Spark 1.6 is much more intelligent about how memory is managed between data and execution. Prior to 1.6, the two area’s where separated, leading to inefficient memory utilization. Along with the memory management enhancements, there’s a raft of performance enhancements, so related to Tungsten and Catalyst, but also on the storage side as well.
There’s a bunch of new algorithms and stats functionality available in 1.6 including Survival Models, and Bisecting K-means, A/B testing models in Spark Streaming. There’s also a bunch of new model diagnostic outputs, bringing the Spark’s modelling capabilities much more in line with other tools like R, SAS, or SPSS.
Python has gained significant capabilities in 1.6, bringing it close to being a first class citizen like Scala, especially in Spark Streaming. Prior to 1.6, if you were writing streaming applications, you pretty much had to do it in Scala, now you have some more options depending on the nature of the application.
R support has gained a large number of improvements in 1.6, making the transition for R users much simpler.