Learn more about Analytics and Spark @ PD Summit April 24th in Halifax

In collaboration with our partner Mariner Innovations, we’ll be presenting at the Professional Development Summit in Halifax, April 24th, 2017. Come join us to learn more about the exciting innovations happening in the world of analytics, with a fascinating talk on “Apache Spark: The new language of Analytics”. To signup for the conference, head over to http://www.pdsummit.ca/.

 

 

Come join us at Spark Summit Europe

The team here at WiseWithData will be traveling across the big pond in a few weeks to attend Spark Summit Europe 2017 in Dublin. This fantastic event brings some of the top data science professionals from around the globe together, to share ideas and knowledge about data science and Apache Spark. This will be our 3rd Spark Summit, and its looking like it will be the most exciting yet. If you are attending, please let us know. We are always excited to meet new people. Send us an email at inquiry@wisewithdata.com.

Cheers,

Ian

Spark 2.2 is out…and we have a surprise giveaway to celebrate

After over 3 months of intensive testing, Spark 2.2 was released today. This release contains over 1,100 improvements, bug fixes, and new features. The most importantly,  Spark now has a Cost-Based Optimizer (CBO).  Huawei and Databricks worked together on the ability to collect table and column level statistics along with the CBO, for more intelligent optimizations of physical query plans. The CBO feature was needed for Spark to round out its data warehousing capabilities. There’s also a slew of great enhancements to ML, and for the first time ever, near complete feature parity for Python vs Scala. In addition to Python support, R gets much better support now as well. Structured Streaming goes “mainstream” with additional enhancements and the removal of the “experimental” tag. With Structured Streaming, the DataFrame batch and stream interfaces are almost identical, making development and code reuse a snap. Streams can now even be queried in real-time on live data.

To celebrate all the great work that has gone into Spark 2.2, we are giving away free PySpark 2.2 quick reference guides. Our PySpark quick reference guides, which are typically only provided to students in our courses, are a single double-sided page, and provide at a glance lookups for core object types, functions and capabilities in PySpark. To get your free copy, simply send us an email to inquiry@wisewithdata.com.

Announcing Free Spark 101 Training Day

We are proud to be offering a free training day in Ottawa on Tuesday April 25, 2017 in  Ottawa. Please email inquiry@wisewithdata.com if you would like to attend. Spaces are limited. Below is a syllabus of the topics that will be covered, which includes live coding exercises.

Spark 101 Training Day Syllabus

  • Distributed Computing Basics
    • Grid and Cluster Computing
    • Partitioning
    • Map Operations
    • Reduce Operations
    • Data Skew
  • Spark Architecture
    • Drivers, Workers and Executors
    • RDD, Dataframes and Datasets
    • Lazy Execution
    • Memory and Caching
    • The DAG and the Optimizer
    • Shuffle and Broadcast
    • Spark ML and GraphX
    • API’s
  • The Spark Ecosystem
    • Cluster Managers
    • Data and Job Orchestration
    • Workbooks
    • 3rd Party Packages
    • The Thrift Server
  • Apache Zeppelin Workbook
    • Interpreters
    • Graphs
    • Exporting results
  • Python Fundamentals
    • Python philosophy
    • Syntax
    • Data structures
    • Data Science Ecosystem
  • The PySpark API overview
    • The Spark Context
    • Data Structures
    • Libraries
    • SQL
    • ML Pipelines
    • Streaming
    • Dataframe Deep Dive
  • Spark Programming Exercises

All I want for Christmas is My Two Dot One

Spark 2.1-rc5 was marked as the official Spark2.1 just in time for Christmas. Streaming applications have got the Gold treatment, with Structured Streaming receiving a lot of attention in order to stabilize the API and engine. Event Time watermarks should make it much easier to deal with data arriving late. The Python API has gained a few new API features from Scala, and the API is very close now to being 1-for-1 with Scala for the majority of use cases. Performance has improved a bit in niche areas, especially ML (LinearRegression, RandomForest and K-Means). Expect much bigger performance changes in 2.2 (ETA April 2017) with Star Schema optimizations, and Cost Based Optimizations being the stars of the performance show.

Spark 2.0 is here

After months of eagerly waiting, it’s finally here! I’ve been using nightly builds for the last few months, but for those not willing to live on the edge, Spark 2.0 is now officially out. I could highlight all the wonderful changes and enhancements, but the good folks over at Databricks have done such a good job, I’m not going to bother. Check out their post here. Suffice it to say that there are over 2,500 changes and performance has been drastically improved in almost all areas. There are new and simpler ways to do many things, but backwards compatibility is still very much a focus.

IFRS9 is coming! Are you stuck hoping your legacy vendor can deliver?

The banking industry is under immense pressure to modernize the way in which risk is calculated for financial accounting purposes. IFRS9 means big changes in both regulations, reporting requirements and ultimately portfolio strategies. With a January 1st, 2018 deadline looming, many bankers are understandably nervous about having the supporting systems in place to meet the deadline.

Unfortunately, many have pined their hopes on legacy vendors with long track records of cost over-runs, late deliveries, and failures. This is no accident, most legacy vendors are pitching ancient technology, in some cases over 40 years old,  to solve modern business problems. Not only is this incredibly inefficient, it leaves customers with brittle systems, with only those within reach of of their retirement knowing how to support them.

IBM’s Canadian Government payroll overhaul, a system name Phoenix,  is but just one recent example. IBM chose to use an ancient payroll system, built on top of an ancient database platform for the foundation of this massive government “Modernization” initiative. Now hundreds of thousands of federal government employees are facing a crisis, trying to get money they are owed. In some cases employees haven’t been paid in over 6  months, and are close to insolvency. Why are things such a mess? Development and maintenance on legacy systems is so complex and costly, it is nearly impossible to be agile when dealing with the complexities of modern business problems.

Solving requirements for IFRS9 using the state of the art technology, is a far better use of resources. Not only are you able to address your requirements much quicker, for a small fraction of the cost and complexity, you are also building towards the systems of tomorrow.  For example, instead of relying on a dwindling pool of SAS programmers to develop and maintain your IFRS systems, start fresh using the latest high productivity tools. Tools such as Cassandra, MongoDB, Python, and Spark and simple interface stacks like HTML5, AJAX, Anguar.js and Node.js, enable rapid development, testing and deployment of complex workflows. Resources that can develop systems using these technologies are far more available in the market, and in many cases less expensive.

The clock is ticking, hopefully everyone makes it across the finish line in time.

Is SAS still relevant?

I’m increasingly being asked the same question in a number of different ways. Essentially the question boils down to this; Given that open-source analytics, and Apache Spark in particular have become such a force to be reckoned with, does SAS still have any relevance? SAS is now over 40 years old, and has a large user base. Like it’s decades old counterparts Fortran and COBOL, SAS will exist for many many years to come.

The trend though is undeniable. Take for example the report by done by KDNuggets clearly shows that all proprietary analytics solutions are in a steep and sustained decline. For SAS, the lack of big data ETL capabilities is probably the largest contributing factor to the decline. Because of technical deficits in the platform itself, users have no choice but to embrace Apache Spark for big data and streaming applications. The lack of unstructured data capabilities is also driving many SAS customers away. Ironically, SAS’s unstructured data capabilities are largely based in Python, due to a 32KB field limits of the traditional SAS processing engine.

Although the cost advantage of open-source is a contributing factor, I don’t think it is the primary driver. The continued existence of IBM, Oracle, SAP and others proves that most companies aren’t all that motivated to ditch their overpriced ancient software towards superior open-source solutions. It’s only when a business problem simply can’t be solved with their existing technologies, do companies go out looking for other options. The lack of resources that know Spark is also a helping support SAS in the short term. That gap will not remain for long, as most graduates coming out of school with relevant degrees are learning Spark now. It’s been over a decade since SAS was the language taught in most schools.

For now, SAS will remain relevant until a fully developed ecosystem of tools surrounds Spark, probably a couple years off still. Enterprises crave platform stability, SAS if nothing else is solid as a rock. Spark on the other hand is evolving fast as lightning.