It’s no secret that many large global organizations are choosing open source Apache Spark to help them handle big data. Its also acknowledged that Python is trusted by data scientists to perform data analysis, machine learning and many more tasks on big data. So it is a tad elementary that combining Spark and Python would rock the world of big data, right?
The Apache Spark community came up with a tool called PySpark which is basically a Python API for Apache Spark. Being able to get the benefit of all the key features of Python in the Spark framework while also getting to use the building blocks and operations of Spark in Python, using Apache Spark’s Python API, is truly a gift from the Apache Spark community.
With most organizations looking to adopt Apache Spark, moving off of SAS becomes the challenge.
What do you need to know about moving to PySpark from SAS?
First, you need to understand where your existing SAS analytics processes are running across your organization or your Enterprise. This will allow you to plan your migration to PySpark accordingly. Having a catalogue of your SAS processes becomes essential input to developing a migration strategy and implementation plan, including a project timeline. Categorizing your SAS processes by order of complexity, size (i.e. lines of code per process) and run time frequency will help prioritize what to convert first – enabling a focus on those processes core to your business operations. Further, gaining access of your SAS process inventory is key to understanding the level of effort required to convert to PySpark both in terms of human and financial capital. WiseWithData has a tool called SPROCKET SearchParty that will complete this critical step to your SAS to PySpark migration process.
Second, you are going to need to embark on converting the lines of code from SAS to PySpark. Doing so in a way that ensures consistency in data format and data set mapping between SAS and PySpark can be especially challenging when applying a ‘brute force’ approach. Each programmer will apply their own technique and approach to code conversion. This is why WiseWithData recommends automating the SAS to PySpark code conversion and our SPROCKET Robot does so quite efficiently.
Third, validating that the code from SAS has correctly converted to PySpark is key from a unit test, user acceptance and quality assurance perspective. This requires row by row, column by column verification with a view to both SAS and PySpark in test environments. Keeping track of the code validation process is key to ensuring anomalies are handled and fixes are implemented. WiseWithData created SPROCKET Validator to handle this final yet critical step to converting from SAS to PySpark.