The summer of Spark 2.0

Summer is almost upon us, the birds are chirping the frogs are croaking, and the smell of 2.0 is in the air. The great team of Spark developers have turned a corner and are busy fixing and polishing 2.0 for release, likely coming in 6-8 weeks from now. I’m so excited by what’s coming, that for the past month I’ve been using the nightly snapshot’s builds for all my work. There are a few kinks and oddities to work out, some of which I’m hoping to find the time to fix myself and contribute back. On the whole, it’s already very usable for non-mission critical applications. Here’s my impressions so far.

  • For Python users, 2.0 finally brings almost complete feature parity, which is awesome. Scala is great, as it fixes many things broken with Java, but for data science, Python is king.
  • Performance is unbelievable good and consistent, I cannot stress this enough. Similar things that I was doing 2 years ago on a 16-core Xeon with 256G RAM in SAS, run an order of magnitude faster on Spark 2.0 on my laptop! The work done merging Datasets and Dataframes, the optimizations to the Catalyst engine, Whole Stage Code Gen, and enhancements to project Tungsten are really paying off. The switch in default compression formats for Parquet from GZIP to Snappy probably also has a big impact in this area. Moving to Snappy for compression throughput went from roughly 100MB/s per core to 500MB/s core.
  • Finally CSV becomes a first class format in Spark! This was an ugly omission in 1.x, as requiring a plugin for such a common format was difficult to explain to outsiders. Along with the inclusion, there’s a lot more options for dealing with different forms of CSV and error handling.
  • Code compatibility with 1.6 is very good, even with plugins. I’m especially impressed by this, given the fact the whole subsystems have been completely rewritten or replace. My 1.6 code runs without issue on 2.0, even the now deprecated CSV plugin code.
  • I haven’t yet tried out some of the Structured Streaming components yet, I think there’s still a bit of work going on there. I’m very excited about the design for this. Basically the streaming engine will be moving to Dataframes and all the benefits that come with that. This is similar to the project under way to migrate GraphX to Dataframes called GraphFrames (though that won’t likely make it in for 2.0, but is available via plugin) .

I still do have a few critiques of Spark that 2.0 isn’t likely to fix:

  • Dataframe functions and operators are still pretty restrictive, and inconsistent. It’s getting better, but I long for the day when I don’t need to jump to UDFs or RDD/Datasets for managing complex transformations. My guess is that 2.1/2.2 will bring many improvements in this area, as the focus on 2.0 was structural, not functional.
  • Many error messages still produce awful JVM stack traces. This isn’t really acceptable for end-user software, as it’s very disconcerting to non-developers (ie. data scientists). The default error messages should be intuitive, clear, and at all possible instructive. They certainly should not contain stack traces, unless enabled via a run-time option. Unfortunately this one is likely going to stick around for a long time, as the developer community won’t likely think of this as a priority.