Foolishly Automating Or Automating Fools

What an absurd title! A strange twist on the “working hard or hardly working” cliché, but read on and I promise it will all make sense.

Here’s a tale of a court jester and his great invention. Long ago in a vast and complex kingdom, there lived a great king. The king had built a beautiful Lakehouse at the far ends of his land, so he could escape the chaos of life in the castle. Life at the Lakehouse was simple, peaceful, and full of beauty. But it was 700 miles away, a journey of over 10 days by horse and carriage. With all his commitments in the castle, he never really got to enjoy the benefits of the Lakehouse.

The court jester was an odd fellow, but full of big dreams. He overheard the king’s discussion with his advisors about the problem of getting to the Lakehouse. He thought he might be able to invent a solution, which would bring him fame and fortune. Now the jester had never been in a carriage, or made the trek to the Lakehouse, but he was not deterred by his limited knowledge.

He began to toil over the problem day and night, until one evening an idea came to him. He set to work on building his invention. When it was finally done, he was overjoyed and ran through the streets screaming “I’ve done it!”.

The next day, he managed to get an audience with the king. He showed his invention to the king. It was a small toy carriage with a rubber band engine. He did it, he invented an automated carriage. As the king and the court looked on, and they started to understand what they were seeing, a thunderous roar of laughter filled the room.

Now of course that concept wouldn’t solve the king’s problem. It was indeed a laughable solution, but he did create automation. Without the context of knowing the current technology and path to the Lakehouse, one simply can’t invent a solution to a problem they don’t understand. In fairy tales one can say or do anything, but in the real world you need knowledge and experience to solve complex problems.

Foolishly Automating

With no knowledge of how to do the work in a manual way, how can one automate a solution? You don’t even understand what the destination should look like.

So many times we see companies claiming automated SAS code migration to Informatica, Snowflake, Hive, Java, R, Python Pandas or other tools. None of those tools provide the functionality required to handle the richness of SAS’ capabilities. Like the jester, these folks don’t actually understand the problem they are trying to automate.

If you see someone claiming they have an automated engine that can convert SAS code to R or Python Pandas, run the other way. Those who understand the problem know R and Python Pandas are constrained not only by memory (which SAS is not), but also by poor performance, scalability and limited features. So many companies have gone down that path, only to realize it’s a dead end.

Automating Fools

Like the jester, many folks just don’t have the skills or experience to do the job in a manual way, let alone automating it. WiseWithData is often called in to clean up the mess left behind by others who claimed to be using automated tools. Customers have even shared with us examples of how much of a mess the output is. From those examples, not only is it clear the designers of these tools know little about SAS, they know even less about PySpark. While we can’t share those customer examples, we can dissect some publicly available content from one such tool. Here’s a very simple example which demonstrates the many issues.

Let’s look at the most simple of datasteps, a single line of business logic.

data scoredata0tmp;
    set SID.demo_score_data_miss;
    score1 = score2 + score3;
run;

Now, if you ignore the SID library reference (any resemblance to actual persons is purely coincidental), the most amateur of PySpark developers will end up with something like this.

scoredata0tmp = demo_score_data_miss.withColumn('score1', col('score2') + col('score3'))

Pretty simple, but maybe you want to use a style more in line with SAS and that’s scalable to more complex datasteps.

scoredata0tmp = (
    demo_score_data_miss
    .withColumn('score1', col('score2') + col('score3'))
)

Nice! 4 lines of SAS code, 4 lines of PySpark code.

Now let’s look at the automated output from the competing tool. One would expect that it looks similar to what both amateurs and experts would write. Note: This code was captured from a screenshot, and we weren’t able to see the last part of the code block.

# ------------Passed Begin Datastep Block: 9, SAS Lines: 14 to 17 ------------
class DataStep1():
    """
    This class encapsulates the Data Step related to ScoreData0tmp
    """
    # runs the code
    @staticmethod
    def run(pyspark):
        """
        run function executes the logic in this block
        """
        pyspark.START_BLOCK('Datastep1', 1)
        pyspark.get(globals())
        
        SID_demo_score_data_miss = pyspark.localFile(filePath=f'{demo_score_data_miss}')
        #pyspark.showDF(SID_demo_score_data_miss)
        SID_demo_score_data_miss = SID_demo_score_data_miss.reparttion()
        pyspark._data_['SID_demo_score_data_miss'] = SID_demo_score_miss
        

        def scoredata0tmp_func(dfSchema, row=None):
            """
            Function whch accepts arguments as listed below to
            and returns as specified below
            Args:
                dfSchema: dfSchema is ...
                row: row=None is ...
            Returns:
                returns ...
            """
            row['score1'] = row['score2'] + row['score3']
            return row


DataStep1.run(pyspark)

Clearly this code doesn’t look at all like the previous code. So much to unpack here, it’s almost overwhelming, but let’s try and break this down. Beyond the obvious style issues, what’s most striking is just the sheer size and complexity, 4 lines of SAS code becomes at least 35 lines (9x more code) of opaque “pyspark” code, if you can even call it that. Needing 35 lines of code to add 2 columns together, and I thought MapReduce was verbose! This is as simple a datastep as you can get, and based on customers feedback about this vendors tool, even this likely doesn’t produce accurate results. What if you had merges, macro variables and macro logic, call routines, complex formulas and functions? Imagine a 1,000+ line datastep, something we frequently encounter. As Mr T. put it so eloquently, I pity the fool who has to debug that mess. How could this scale to the millions of lines of code we convert annually, or even a just few thousand lines? Bringing this back to our jesters story, it’s clear this is a rubber band solution.

Structurally, this is some of the strangest PySpark code ever seen. Even skilled experts would take some time to wrap their heads around this. The use of OOP concepts like classes is unnecessary and defies all logic or reason. SAS is a procedural and functional language (except SCL), and most SAS developers have little to no experience with object oriented concepts like classes. Who is ultimately going to own the converted code, it’s the analysts and SAS developers of course. Indeed, even modern data scientists and data engineers are much more at home with procedural and functional programming concepts. Classes are just the wrong tool for the job here. Why stray from the simple DataFrame API at all, unless you just don’t know any better. Both the style and substance of this code make it completely unreadable and unmaintainable.

While its quite challenging to even understand this as PySpark, it’s obvious this is RDD based code, not standard DataFrame code. In PySpark 101 courses, the first thing they teach you is that RDD is a deprecated API, and should only be used when absolutely necessary. Our jester doesn’t seem to have even a basic grasp of PySpark. When using RDDs, all of the performance and scalabily advances Spark has made over the past 7 years, the reason for its broad success and adoption, are completely thrown away. Wholestage code-gen, column pruning, DPP, AQE, limit/filter/aggregate push down, even simple API’s, all of it gone! Not to mention the additional overhead of sending each and every row to Python one by one. RDD based code can easily be 100x slower. Poor performance and efficiency are a nuisance and indirectly expensive on-prem, but in the cloud they directly relate to cost. Our jesters code will cost 100x more to execute, now that’s no laughing matter.

While the unnecessary use of RDDs is a huge issue, something even more troubling shows up on line 17. The second thing they teach you about Spark, and indeed all distributed systems is that you want to avoid shuffling the data. Not only does that repartition() cause a shuffle, it also breaks up Spark’s codegen pipeline, and is completely unnecessary given this is purely a map side transformation. While Spark’s lazzy execution engine is very cleaver at avoiding shuffles, if the user explicitly calls a repartition transformation, Spark often has no choice but to obey.

If every single converted datastep triggers a repartition, that will bring even the most tuned Spark cluster to its knees. Imagine even scaling this out to just 100 concurrent processes, and all of them constantly repartitioning, yikes! Certainly this tool only works in fairy tales.

Foolishly automating or automating fools. One is building automation with a complete lack of understanding from a strategic perspective (e.g. no knowledge of source or target architecture or capabilities). The other is building automation using resources that have a fundamental lack of understanding of how to tactically solve the issue (even manually). In this most simple example, we see both concepts at work. SAS code is of course far, far more complex than this example. It’s crystal clear there’s no way this solution works for complex code.

What does WiseWithData do differently? For starters we know SAS, in fact we know it better than almost anybody on the planet. So we understand the problem, and we know PySpark, again almost better than anyone else, except our close partner and Lakehouse expert Databricks. With over 7 years of focused experience doing SAS migrations, our automation is orders of magnitude more advanced. When we built our automation, we started with producing the right code manually using our experts first, and then developed the automation to match that result.

With our expertise, we use the right strategy and the right tactics to deliver a great scalable solution to get you to the Lakehouse quickly and efficiently. Find out why SPROCKET is the only path to the Lakehouse that is Fast, Simple and Accurate. hello@wisewithdata.com