This blog post serves as an introduction to the WiseWithData core SPROCKET Robot code conversion tool. For more information on the SPROCKET automated migration solution (SPROCKET Search Party, SPROCKET Robot, SPROCKET Validator), please get in touch with us at hello@wisewithdata.com.

What is SPROCKET Robot?

SPROCKET Robot is an integral part of the SPROCKET automated migration solution; it performs high speed and high accuracy code conversion, taking in SAS code and outputting optimized and complete PySpark code (i.e. Python code that uses Python API for APACHE Spark). Once our clients have received the output PySpark code from SPROCKET Robot, they receive all the benefits of using both Python and APACHE Spark, which include:

  • No licensing fees: Python and APACHE Spark are both free and open-source analytics engines, where SAS licenses can cost clients thousands or millions of dollars each year.
  • Using open source: Python and APACHE Spark allow transparency on all functions of their code bases, where SAS is closed source.
  • Having a fully-featured analytics engine: All users of Python/APACHE Spark have free access to all features of these engines, where SAS users can pay additional fees for more features.
  • Enhanced speed: Python and APACHE Spark are incredibly fast; users of APACHE Spark can see more than 100x speed up of their business analytics after migration from SAS.
  • Enhanced scalability: APACHE Spark uses the technology from APACHE Hadoop (built from Google’s MapReduce – a scalable, distributed, fault-tolerant processing framework), while making it more efficient and easier to use. Spark takes advantage of distributed data, distributed computing, and functional programming to make it one of the most scalable platforms for analytics.
  • Modernized code base: Innovators are constantly improving Python and APACHE Spark, and many of the world’s top companies use Python including Google, Facebook, Instagram, Spotify, Reddit, Netflix, and many more.
  • Talent availability: Python’s popularity is undisputed – it rivals C, C++, Java, and JavaScript for being the most popular language overall, depending on your application. Python is an obvious choice for an all-in-one language that is leading the way when it comes to data analytics, machine learning, and AI. This translates to ease in finding resources for your analytics team due to the availability of programmers that know and love Python.

All of these advantages come with the PySpark code that is output by SPROCKET Robot. To perform code conversion, SPROCKET Robot leverages syntactic analysis that is similar to how compilers work. In fact, the SPROCKET Robot tool can be thought of as a source-to-source compiler (transpiler) for SAS code into PySpark code.

The design of SPROCKET Robot enables our customers to leverage the knowledge of our SAS coding experts (some with 20+ years of SAS coding experience) to efficiently migrate their SAS analytics with line-by-line conversion. We are always improving how SPROCKET Robot works to help our clients not only convert their code, but also to optimize our clients analytics infrastructure.

How does SPROCKET Robot work?

SPROCKET Robot works in four key steps:

  • Lexing SAS code,
  • Parsing SAS code,
  • Optimizing SAS code functionality, and
  • Translating SAS code into PEP8 compliant PySpark code.

We outline each of these steps briefly below.

Lexing SAS Code

Lexing or tokenization is the process of converting a sequence of characters (i.e. input SAS code) into a sequence of tokens (i.e. identifying what that line of code does). To perform this step in SPROCKET Robot, we need to have rules or expressions that define the syntax of the input language (SAS). The rules which define how a language works are called grammar; the type of grammar WiseWithData has implemented is called parsing expression grammar or PEG.

Leveraging our SAS experts’ programming experience, our team has created a PEG which nearly describes the entire SAS language. With our expertise on SAS syntax loaded into SPROCKET Robot, the SPROCKET Robot performs lexing and matches each line of SAS code to it’s relevant expression, and identifies what structure and components that code possesses.

Parsing SAS Code

Once Lexing of the input has been performed, the SPROCKET Robot has a sequence of tokens which can be parsed to construct a unique abstract syntax tree for each line of SAS code; much like is done in a compiler. The PEG parser in SPROCKET Robot is lightweight, and uses RAM frugally to perform parsing at high speeds. The parser ensures that the input text conforms to the syntax rules of the SAS language and constructs an abstract syntax tree to assist in optimization of the code input, and generation of PySpark code.

The abstract syntax tree created in this step helps to create a functional description of the input SAS code. This functional description of what each line of SAS code does is called the Sprocket Description Language (SDL).

Optimizing SAS Code Functionality

Sprocket Description Language (SDL) is a proprietary language invented by WiseWithData. The SDL created for SAS code input into the SPROCKET Robot describes the actions that the SAS code performs. As sometimes there are complicated multi-line statements in SAS, or combinations of PROC steps and DATA steps, the SDL is used to describe what different blocks of code accomplish, and it equips SPROCKET Robot with the information necessary to construct PySpark code.

The process of producing a SDL for SAS code is proprietary to WiseWithData, and is constantly improving. The goal of the optimization steps in the SDL include, but are not limited to:

  • Loop optimizations,
  • Data-flow optimizations,
  • Code generator optimizations, and
  • Functional language optimizations.

Future blog posts will describe roughly how the SDL performs these optimizations to not only convert our clients code, but also to enhance it.

Translating SAS code into PEP8 compliant PySpark code

The final thing that SPROCKET Robot does is, of course, produce complete and optimized PySpark code. SPROCKET Robot uses the Sprocket Description Language to understand how the different components of our clients SAS code work together. With this understanding, the back-end of SPROCKET Robot uses our vast library of code generation mappings which take in an SDL, and produce consistently PEP8 compliant formatted code at incredible speed.

That all might seem like a lot that happens to perform code conversion, but the code base is lightweight and conversions happen at fantastic speeds – 1000 lines of SAS code can be converted in under 1 second!

What’s next?

After reading all about SPROCKET Robot, you should have a better understanding on what it is and how it performs code conversion. If the idea of how SPROCKET Robot works is still unclear, don’t worry – I will be writing another blog post soon which illustrates each of these steps with a simple example.