After over 3 months of intensive testing, Spark 2.2 was released today. This release contains over 1,100 improvements, bug fixes, and new features. The most importantly, Spark now has a Cost-Based Optimizer (CBO). Huawei and Databricks worked together on the ability to collect table and column level statistics along with the CBO, for more intelligent optimizations of physical query plans. The CBO feature was needed for Spark to round out its data warehousing capabilities. There’s also a slew of great enhancements to ML, and for the first time ever, near complete feature parity for Python vs Scala. In addition to Python support, R gets much better support now as well. Structured Streaming goes “mainstream” with additional enhancements and the removal of the “experimental” tag. With Structured Streaming, the DataFrame batch and stream interfaces are almost identical, making development and code reuse a snap. Streams can now even be queried in real-time on live data.
To celebrate all the great work that has gone into Spark 2.2, we are giving away free PySpark 2.2 quick reference guides. Our PySpark quick reference guides, which are typically only provided to students in our courses, are a single double-sided page, and provide at a glance lookups for core object types, functions and capabilities in PySpark. To get your free copy, simply send us an email to email@example.com.