Spark 1.6.0 is Here

The next major release of Apache Spark is now available. As usual, it contains a huge number of improvements across a the whole platform. Here’s a quick run down of some of the big changes:

New Dataset API

The Dataframes API, and it’s major advantages (especially use of the Catalyst optimizer) has been extended into the area of RDD’s though a new API called Datasets. With Datasets, you loose small amount of flexibility vs RDD’s (namely that Datasets can’t contain any arbitrary Java object), but in return you get massive performance and scalability gains. Going forward, most users should be using Datasets and Dataframes API’s, and only use RDD’s when absolutely necessary. RDD’s will become a low level concept that is rarely used. Unfortunately for Python and R fans, you’ll have to wait a bit longer to have the dataset API’s available, as it is Scala only at this point. My guess is that’ll be released in the next release, which may be 1.7, or might be 2.0.

Memory Management and Performance

Spark 1.6 is much more intelligent about how memory is managed between data and execution. Prior to 1.6, the two area’s where separated, leading to inefficient memory utilization. Along with the memory management enhancements, there’s a raft of performance enhancements, so related to Tungsten and Catalyst, but also on the storage side as well.

ML/MLLIB

There’s a bunch of new algorithms and stats functionality available in 1.6 including Survival Models, and Bisecting K-means, A/B testing models in Spark Streaming. There’s also a bunch of new model diagnostic outputs, bringing the Spark’s modelling capabilities much more in line with other tools like R, SAS, or SPSS.

Python Support

Python has gained significant capabilities in 1.6, bringing it close to being a first class citizen like Scala, especially in Spark Streaming. Prior to 1.6, if you were writing streaming applications, you pretty much had to do it in Scala, now you have some more options depending on the nature of the application.

R Support

R support has gained a large number of improvements in 1.6, making the transition for R users much simpler.

 

 

ScyllaDB: Apache Cassandra C++ Rewrite (claims up to 10x faster)

With some degree of skepticism, this announcement is almost unbelievable. The worlds most scalable columnar NoSQL database might have just gotten a lot faster. Apache Cassandra, already a heavyweight in the world of Big Data has been rewritten using C++. A new lock-less shared nothing architecture has been leveraged, resulting in incredible performance and low latency. It’s API compatible with Cassandra, so all existing code and tools should work without modification. Only time will tell if these claims are really as good as it seems, but it may be a dream come true for many data engineers and data scientists. For more information, check out ScyllaDB’s website at http://www.scylladb.com/.

Spark Release 1.5.0

New release of Apache Spark includes large improvements in performance and functionality including:

  • Significant expansion of DataFrame functionality with over 100 new functions
  • Integration of Project Tungsten with massively improved performance and response consistency (eliminating or reducing JVM GC pauses)
  • Improved Python support, bringing API compatibility much closer to Scala and Java

For more information check out https://spark.apache.org/news/spark-1-5-0-released.html.