Come join us at Spark Summit Europe

The team here at WiseWithData will be traveling across the big pond in a few weeks to attend Spark Summit Europe 2017 in Dublin. This fantastic event brings some of the top data science professionals from around the globe together, to share ideas and knowledge about data science and Apache Spark. This will be our 3rd Spark Summit, and its looking like it will be the most exciting yet. If you are attending, please let us know. We are always excited to meet new people. Send us an email at



Spark 2.2 is out…and we have a surprise giveaway to celebrate

After over 3 months of intensive testing, Spark 2.2 was released today. This release contains over 1,100 improvements, bug fixes, and new features. The most importantly,  Spark now has a Cost-Based Optimizer (CBO).  Huawei and Databricks worked together on the ability to collect table and column level statistics along with the CBO, for more intelligent optimizations of physical query plans. The CBO feature was needed for Spark to round out its data warehousing capabilities. There’s also a slew of great enhancements to ML, and for the first time ever, near complete feature parity for Python vs Scala. In addition to Python support, R gets much better support now as well. Structured Streaming goes “mainstream” with additional enhancements and the removal of the “experimental” tag. With Structured Streaming, the DataFrame batch and stream interfaces are almost identical, making development and code reuse a snap. Streams can now even be queried in real-time on live data.

To celebrate all the great work that has gone into Spark 2.2, we are giving away free PySpark 2.2 quick reference guides. Our PySpark quick reference guides, which are typically only provided to students in our courses, are a single double-sided page, and provide at a glance lookups for core object types, functions and capabilities in PySpark. To get your free copy, simply send us an email to

Announcing Free Spark 101 Training Day

We are proud to be offering a free training day in Ottawa on Tuesday April 25, 2017 in  Ottawa. Please email if you would like to attend. Spaces are limited. Below is a syllabus of the topics that will be covered, which includes live coding exercises.

Spark 101 Training Day Syllabus

  • Distributed Computing Basics
    • Grid and Cluster Computing
    • Partitioning
    • Map Operations
    • Reduce Operations
    • Data Skew
  • Spark Architecture
    • Drivers, Workers and Executors
    • RDD, Dataframes and Datasets
    • Lazy Execution
    • Memory and Caching
    • The DAG and the Optimizer
    • Shuffle and Broadcast
    • Spark ML and GraphX
    • API’s
  • The Spark Ecosystem
    • Cluster Managers
    • Data and Job Orchestration
    • Workbooks
    • 3rd Party Packages
    • The Thrift Server
  • Apache Zeppelin Workbook
    • Interpreters
    • Graphs
    • Exporting results
  • Python Fundamentals
    • Python philosophy
    • Syntax
    • Data structures
    • Data Science Ecosystem
  • The PySpark API overview
    • The Spark Context
    • Data Structures
    • Libraries
    • SQL
    • ML Pipelines
    • Streaming
    • Dataframe Deep Dive
  • Spark Programming Exercises

The summer of Spark 2.0

Summer is almost upon us, the birds are chirping the frogs are croaking, and the smell of 2.0 is in the air. The great team of Spark developers have turned a corner and are busy fixing and polishing 2.0 for release, likely coming in 6-8 weeks from now. I’m so excited by what’s coming, that for the past month I’ve been using the nightly snapshot’s builds for all my work. There are a few kinks and oddities to work out, some of which I’m hoping to find the time to fix myself and contribute back. On the whole, it’s already very usable for non-mission critical applications. Here’s my impressions so far.

  • For Python users, 2.0 finally brings almost complete feature parity, which is awesome. Scala is great, as it fixes many things broken with Java, but for data science, Python is king.
  • Performance is unbelievable good and consistent, I cannot stress this enough. Similar things that I was doing 2 years ago on a 16-core Xeon with 256G RAM in SAS, run an order of magnitude faster on Spark 2.0 on my laptop! The work done merging Datasets and Dataframes, the optimizations to the Catalyst engine, Whole Stage Code Gen, and enhancements to project Tungsten are really paying off. The switch in default compression formats for Parquet from GZIP to Snappy probably also has a big impact in this area. Moving to Snappy for compression throughput went from roughly 100MB/s per core to 500MB/s core.
  • Finally CSV becomes a first class format in Spark! This was an ugly omission in 1.x, as requiring a plugin for such a common format was difficult to explain to outsiders. Along with the inclusion, there’s a lot more options for dealing with different forms of CSV and error handling.
  • Code compatibility with 1.6 is very good, even with plugins. I’m especially impressed by this, given the fact the whole subsystems have been completely rewritten or replace. My 1.6 code runs without issue on 2.0, even the now deprecated CSV plugin code.
  • I haven’t yet tried out some of the Structured Streaming components yet, I think there’s still a bit of work going on there. I’m very excited about the design for this. Basically the streaming engine will be moving to Dataframes and all the benefits that come with that. This is similar to the project under way to migrate GraphX to Dataframes called GraphFrames (though that won’t likely make it in for 2.0, but is available via plugin) .

I still do have a few critiques of Spark that 2.0 isn’t likely to fix:

  • Dataframe functions and operators are still pretty restrictive, and inconsistent. It’s getting better, but I long for the day when I don’t need to jump to UDFs or RDD/Datasets for managing complex transformations. My guess is that 2.1/2.2 will bring many improvements in this area, as the focus on 2.0 was structural, not functional.
  • Many error messages still produce awful JVM stack traces. This isn’t really acceptable for end-user software, as it’s very disconcerting to non-developers (ie. data scientists). The default error messages should be intuitive, clear, and at all possible instructive. They certainly should not contain stack traces, unless enabled via a run-time option. Unfortunately this one is likely going to stick around for a long time, as the developer community won’t likely think of this as a priority.


Spark 1.6.0 is Here

The next major release of Apache Spark is now available. As usual, it contains a huge number of improvements across a the whole platform. Here’s a quick run down of some of the big changes:

New Dataset API

The Dataframes API, and it’s major advantages (especially use of the Catalyst optimizer) has been extended into the area of RDD’s though a new API called Datasets. With Datasets, you loose small amount of flexibility vs RDD’s (namely that Datasets can’t contain any arbitrary Java object), but in return you get massive performance and scalability gains. Going forward, most users should be using Datasets and Dataframes API’s, and only use RDD’s when absolutely necessary. RDD’s will become a low level concept that is rarely used. Unfortunately for Python and R fans, you’ll have to wait a bit longer to have the dataset API’s available, as it is Scala only at this point. My guess is that’ll be released in the next release, which may be 1.7, or might be 2.0.

Memory Management and Performance

Spark 1.6 is much more intelligent about how memory is managed between data and execution. Prior to 1.6, the two area’s where separated, leading to inefficient memory utilization. Along with the memory management enhancements, there’s a raft of performance enhancements, so related to Tungsten and Catalyst, but also on the storage side as well.


There’s a bunch of new algorithms and stats functionality available in 1.6 including Survival Models, and Bisecting K-means, A/B testing models in Spark Streaming. There’s also a bunch of new model diagnostic outputs, bringing the Spark’s modelling capabilities much more in line with other tools like R, SAS, or SPSS.

Python Support

Python has gained significant capabilities in 1.6, bringing it close to being a first class citizen like Scala, especially in Spark Streaming. Prior to 1.6, if you were writing streaming applications, you pretty much had to do it in Scala, now you have some more options depending on the nature of the application.

R Support

R support has gained a large number of improvements in 1.6, making the transition for R users much simpler.



When seeing the future is not enough

When IBM announced in June that they would be making a major investment in Apache Spark, little details were revealed. The only said that 3,500 developers would be assigned to working on Spark work. Last week we got a bit more detail on what exactly they have in mind for Spark. It would appear as though at least part of IBM’s strategy is to offer a Spark-as-a-service integrated with their other cloud offerings.

With the exception of offering some integration with outdated and under-invested tools that IBM customers are stuck with, I’m not seeing where the value proposition is here. Databricks already offers a great cloud based Spark solution; I would have expected IBM to try and differentiate itself a bit more. After 14 consecutive quarters of revenue declines, despite a huge raft of acquisitions, IBM appears to be struggling to find its place in the new open computing world.

Let’s commend them for being able to spot the trends, and for giving a huge endorsement to Spark. They’ve often been able to spot the trends, as they did so famously with Linux, and with analytics in general, and of course now with Spark. They invested many billions into Linux as well as analytics (through many acquisitions), but those investments do not appear to have stemmed their revenue losses. Seeing the future, and adapting and profiting from it are two entirely different things.

IBM is of course not alone in its misery; nearly all the old world business computing companies are all struggling. HP, founded in 1947, a company who also invested considerably in analytics, just completed one of the largest voluntary corporate breakups in history. Other large software companies founded decades ago, who’ve made significant investments in their own proprietary analytics tools, are also struggling to remain relevant in a world where companies are embracing the open-source Apache Spark; which is immeasurably superior in capabilities and cost to those tools. All of this leads to one interesting observation; many of the business computing companies founded before the dawn of the Web era seem unable to adapt to the realities of the web and the open systems it has fostered.

That’s not to say there isn’t money to be made in the open world, far from it; Red Hat taught us that. It’s just that it takes a radically new way of thinking about your customers and how to solve their business problems. Gone are the days of proprietary mass market software driving huge margins, either directly or via services lock-in. Mass-market tools, especially analytical ones, are a commodity now; so get over it and adapt. What the open source community is unlikely to do though, is build a fraud detection platform for insurers, or a route optimization solution for public transit systems, or a pricing optimization solution for grocery stores. What companies like IBM, HP, and others should be doing is building niche analytical solutions on top of open systems to solve their customers business problems. That requires a big change in thinking, one that moves away from centralized R&D developers, and towards solutions built in the field.

So for IBM, this means ditching their plan to build connectors to Spark for all their existing proprietary tools and their build out of a Spark cloud. What they should be doing is building and training a huge team of highly skilled data scientists and analytics specialists, who know how to build stuff in Spark to solve real customer business problems, not just create elaborate publicity stunts.


Sparking a language debate

The revolution of Spark is igniting an vigorous debate within the data operations, data science and analytic professionals community. Spark supports of 4 different programming languages (not including SQL), making choosing a programming language difficult. Although you can learn multiple languages, it takes a long time to master any one. If you are someone who wants to get started with Spark, this fundamental choice will be one of the more important choices you make. Let’s start by comparing the different languages available in Spark.


A mature purely object-oriented language with a large developer base and loyal following. Many of the desktop and web applications we all know and love were developed in Java. It is one of the core languages taught in computer science programs, being very easy to learn the basics of the language and a great language for teaching object-oriented concepts. It also has a huge ecosystem of proprietary and open-source libraries components. Java rose in popularity in large part due to its Java Virtual Machine (JVM) technology, which allowed for applications to be portable across different platforms and operating systems. Its main drawbacks are the complex nature of the ecosystem, slower performance than lower level languages, and verbose language constructs.


Scala is also an object-oriented language, developed to run within the Java Virtual Machine (JVM). As it runs within the JVM, all libraries built for Java are able to be simply used within Scala. Its maintains many of the benefits of Java while fixing two of the main issues with Java, its verbose language, and lack of REPL (Read-Evaluate-Print-Loop). A REPL let programmers try out code interactively, eliminating the need for big compile steps to test bits of code, a must-have feature for many data science and analytics use cases. Its popularity has grown tremendously over the past few years, and is the language that most of Apache Spark was developed in, so many features come to the Scala API first.


Python is a popular general purpose object-oriented programming language, with a strong emphasis on simple, readable, and maintainable code. It has a large developer base, and is often taught not only in computer science courses, but also relevant engineering and science courses as well. It has a very intuitive language focused on only a few core foundation concepts. Like Scala, it has a REPL, which combined with the simple code syntax leads to rapid application development.  Code style choice has been purposefully restricted, as some style aspects such as indentation, change the meaning of code. It has vast library of add-on modules including data processing, data science and scientific data processing functionality. It also has the advantage that there is a pre-built Spark compatible DataFrames object with the NumPy module that has built-in graphing routines. The main drawback to Python is performance, which can be considerably slower than other languages for certain operations. There is however a separate implementation called PyPy, which has dramatically improved performance for many purposes.


R has been gaining in popularity for statistical and analytics applications for the past few years. Developed in large part by statisticians, it has a very narrow focus on pure statistical applications. Its main strengths are the vast number of available statistical procedures, along with data processing and graphing capabilities. The language is not very readable or intuitive, especially to non-statisticians and performance of statistical procedures is not as good as those available in other languages.

What to choose

The Spark API’s are very similar in all the languages. Regardless of the language you choose, the code you write when interacting with the Spark cluster will have a very similar syntax. The difference lies in what happens when the the data gets returned to the driver node, and wrappers you put around Spark API code. That means that performance of the underlying language is not a big deal in most cases, especially when using DataFrames or the upcoming Datasets API. Each language has its strengths, but here’s some general recommendations.

If you are working on data operations or  application development, Scala‘s access to all that Java has to offer, without the complex code makes it the preferred choice. You’ll need to know Scala if you want to contribute or tinker with the Spark code base. If you’re doing any data science work, the simplicity and speed of development makes Python the preferred choice. If you are integrating Spark into an existing Java application, it makes sense to use the Java API’s. I would only recommend using R if you are a academic statistician or need access to analytical features that aren’t available elsewhere. Over time, Spark will likely absorb more and more of R‘s analytical features in the ML/MLLib modules, as they need to be rewritten to take advantage of the scalability and parallelism of Spark.

To summarize, I view Java and R as serving niche purposes within the Spark ecosystem; the real choice is between Scala and Python. Personally I love Python, as I find it easy to learn, intuitive, and simple and quick to express complex concepts within it.