Frequently Asked Questions
Your most common questions answered by our experts.
Analytics Platform Questions
Why Modernize To PySpark?
Apache Spark and PySpark (the Python API for Spark) is a fast and general-purpose cluster computing platform. It offers an open source, wide range data processing engine with revealing development API’s. Scalable & fault tolerant, it’s become the defacto analytics platform in the market, with performance and capabilities that far surpass that of traditional platforms (SAS, IBM etc.). Apache Spark offers 100X+ greater data processing performance and is open source, hosted at the vendor-independent Apache Software Foundation. Most of the global Cloud Service Providers use Spark to handle big data in their cloud. Simply put, there is no easier migration to Cloud than with Spark.
Performance may have netted Spark an initial following among the big data and analytics crowd, but the ecosystem and interoperability is what continues to drive broader adoption of Spark today. Apache Spark is the most modern unified data science platform available with the ability to run models and analytics programs 100’s of times faster over legacy systems like SAS, MapReduce, or R, and it can scale to any size of data.
Who is using PySpark?
Most of the Fortune 100, 500 and 1000 companies, including General Motors, Capital One, Amazon, Google, Microsoft, General Electric, BMW, Uber, Toronto Dominion Bank, HSBC, Bloomberg, etc, use Spark for its built-in libraries for data access, streaming, data integration, graph processing and advanced analytics and machine learning. This immensely performant, unified data science platform is being widely adopted globally with significant advantages being promoted by all in open source chat rooms and conferences.
Why not convert to other languages / platforms? Why just PySpark?
For decades, SAS users have tried to migrate to a variety of other technologies. Inevitably these projects fail or become far more complicated than initially planned. SAS is a very rich language, with broad and unique set of features. Many of the concepts in SAS code can’t be easily replicated in other tools such as databases, Python, R, SPSS, etc. These tools all have severe limitations in terms of flexibility, coding constructs, scalability or performance.
PySpark is the only platform that meets and exceeds SAS’ language capabilities in every respect, and is the industry standard analytics platform.
We’ve heard that PySpark is much faster than legacy SAS. Just how much faster is it?
While benchmarks will differ, PySpark truly shines when operating on larger datasets. It’s parallel-first engine is far more efficient than legacy SAS’ single threaded single machine model. We’ve had some processes run over 1,000x faster on similar hardware, but 100x is more typical. On very small datasets, the performance advantage diminishes due to the additional overhead of Spark’s parallelization and optimization algorithms.
We don’t have a Spark cluster or “Big Data”, but want to modernize and get off SAS. Can you help?
PySpark is the most scalable analytics platform in existence, scaling to clusters of 1,000’s of machines. While this scalability is awesome, some believe PySpark is only a “Big Data” technology. The reality is that PySpark is easy to deploy on a single machine or small server. If you already have Python, simply “pip install pyspark” to install the necessary components to run it in local mode. You still gain the benefits of running a high performance parallel engine, even on a laptop.
We have teams that use Python Pandas, do you convert to that?
To maximize performance and scalability, we only convert to native PySpark DataFrame code. The Python Pandas API is supported as an API on PySpark, so your existing Pandas based data teams will feel right at home.
Is it possible to convert SAS to R?
The R language has gained some support over the years, especially within academic and statistician communities. For this reason, many believe it could be a good replacement for SAS, which also had strong support by those communities. Unfortunately R was built by and built for academic statisticians, essentially becoming a very niche domain specific language (DSL).
Experts agree that with R, little attention has been paid to proper software engineering practices within the community, and so consistency and maintainability are major issues. It also lacks many of the features and qualities necessary for an enterprise replacement to SAS. The overwhelming trend is to move away from niche DSL’s, and to leverage the power and generality of Python along with purpose built libraries such as PySpark, SKLearn, SciPy, statsmodels, etc.
What products support PySpark?
PySpark is the most widely deployed and supported analytics platform on the planet. It’s supported on all major cloud platforms and many on-premise products as well.
The list includes these popular products:
SQL Server (2019+)
Azure (Synapse, HD Insight, Data Factory)
Google (Data Proc)
Python 3 (via pip install pyspark)
Jupyter / JupyterHub Notebook
What versions of PySpark do you support? Are there other requirements?
Migration Process Questions
What is a brute force migration?
We define brute force as the manual process of translating proprietary SAS code into PySpark code. Human’s manually performing the work. This type of approach is typically taken by large enterprises hiring or assigning work to corporate resources, or large integrators spreading the process across many hired individuals.
How much faster is automation?
Our automation is easily 90% faster than a brute force approach. We’ve already solved many of the hard issues that people encounter when trying to convert code in a brute force way.
Anecdotes from our customers put the speed of manual conversion at ~250 ± 250 lines per day for a pair of programmers. That’s right, some days 0 lines will be produced due to complex technical challenges. Our analysts using automation are typically producing over 10,000 lines of code per day.
How does automation reduce risk and exposure?
Because automation has very little human intervention there is significantly less risk in code and IP being exposed or openly shared by humans. Further, the output is consistent, unlike a manual approach where every person has a unique way of coding, the end result is accurate code. Code consistency is also beneficial to those working with it daily, as they will learn from the format and calls much more quickly reducing the learning curve and increasing adoption, while allowing them an opportunity to grow their skills.
We also hear feedback that human “turn over” in the manual conversion process is as high as 30%. Needing to switch gears, bring new people up to speed, and resolve issues left behind by those moving on elongates the migration process. People become bored with manually converting code daily for months at a time, leading them to find alternative opportunities or even shifting focus to other projects.
I’ve looked at other solutions claiming to be automated, but was disappointed. What differentiates the WiseWithData approach?
We hear that from our customers too, and the results of “bake-off’s” performed by our customers lead us to understand the code we generate has far better coverage, is way more accurate, leverages industry best practices, and works without the need for manual intervention.
Having years of experience converting millions of lines of code, we have an unmatched depth of experience, which is incorporated into our automation. Of course, there’s always very complex code that we apply our SAS & PySpark experts to enhance the final output, improve performance, and integrate within the customers unique environment.
We’ve been told some SAS processes cannot run outside of SAS. What processes do you support?
SAS is a 50 year old technology, with many different components, many which have been acquired over the years. Of course we can’t convert every conceivable process built in a SAS product into PySpark, but we do have broad support & experience converting processes from the most popular components including:
Base and Macro
Stat & ETS
Database Access Engines
EMiner Scoring models
SPDE & SPDS
Grid, Share & Connect
We are looking to migrate over a long period of time. Is your service flexible?
Yes. In addition to our SPROCKET migration packages, we also offer on on-demand service that can be used to convert code at any time. Please contact us to find out more.
Out tasking code conversion has cost us significantly more due to change fees and scope creep. Is this the same case with your delivery model?
Our business is built on the understanding of both SAS and PySpark which is a unique value add according to our customers. Without the large and diverse human presence in the code conversion process, automation allows us to focus on delivery of the output. It also allows us to focus on the any challenges that arise and put our expertise to use.
Our service is priced per the scope and documented understanding of the task. We are SLA driven and our goal is to ensure accuracy and deliver identical outcomes in the new platform. Our commitment to our customers is that we do not charge change fees, and work to ensure you are satisfied with the outcomes.
You can count on WiseWithData to deliver within the committed statement of work without going beyond the time frame or surprising extra costs.
I have a few hundred lines of complex code SAS code. Can you convert this for me?
As much as we would love to assist everyone on their modernization journey, this isn’t always feasible. Our solution is designed to work for enterprise organizations to convert large amounts of code quickly and efficiently.
Our entry package starts at 5,000 lines of code delivered in 5-10 business days, and we are comfortable working with millions of lines.
We offer services to supplement your brute force approach and full-on conversion services. If you have a need to convert a few thousand lines of code or more, we’re happy to work with you to scope and price your project.
SPROCKET Runtime Questions
What is SPROCKET Runtime?
We built the SPROCKET Runtime in order to bring the most important features of the SAS language into PySpark. While we strive to generate native PySpark code, without these features, a significant amount of migrated code would be far more complex and thus harder to understand and use.
There are some advanced SAS language features which have no parallel whatsoever and would be almost impossible to convert into plain PySpark. The most notable example is the use of RIPL to convert datasteps that involve complex iterations, arrays, retained columns, and by group processing.
We use Databricks and EMR. Does the SPROCKET Runtime support these platforms?
Can We Build New Processes On Top of SPROCKET Runtime?
Yes, of course! The feedback from our customers is that the existing SAS developers love using the SPROCKET Runtime.
A customer described to us what developing using the SPROCKET Runtime means to them:
“Like seeing an old friend that’s got everything together now. All the things I liked about the SAS language without the legacy baggage.”
Do you support the SPROCKET Runtime with feature and maintenance releases?
SPROCKET Runtime includes Custom Format support, why would I want use this?
The custom format support within the SAS language is a powerful feature. In the same concept you have the ability to do in-memory mapping of value-to-value but also range-to-value. It also supports default values and +- infinity in the ranges.
There is no equivalent to custom formats in PySpark, though you can approximate some of it’s features using complex join conditions. We believe this language element in SAS is so powerful its worth bringing into the modern world of PySpark (and many of our customers agree).
What is RIPL?
RIPL, which stands for Row Iterative Processing Language is add on PySpark API part of the SPROCKET Runtime. It brings an entirely new data processing language into PySpark, which closely mirrors features of the SAS Datastep language but using Python.
It’s best used for large and complex data pipelines where you want to perform calculations that depend on the results of previous rows, or where you want to iterate through many computations for each row of data.
How does RIPL’s performance compare with PySpark DataFrame operations?
RIPL and Dataframe operations are both extremely efficient, and scalable. In other words, you can tune them to go as fast are is required.
Each API has it’s strengths and weaknesses. DataFrame operations are best for simple operations on massive data, since the optimizer will plan out the best way to perform the work. They are not as good with complex operations on smaller data due to optimization overhead.
RIPL by contrast excels when you are doing very complex operations on smaller data. With RIPL you are in control, letting you optimize the data processing pipeline. Just like in the SAS datastep language, the more you do in a single RIPL, the more efficient your code will be.
Our users are big fans of SAS Datasteps. When should we use RIPL vs DataFrames?
You should think of RIPL as bringing additional features to the DataFrame concept not a replacement for DataFrames.
Any time where you have calculations which depend on the results of other rows calculations you should consider using RIPL. While the DataFrame API has some capabilities for inter-row dependencies, namely the Window API, its capabilities are limited. You can only have static references to other rows, and the relationship to the desired rows must also be static (i.e. previous 5 rows). If those values or the relationship are dynamically changing in the same process, Window will not work.
Another type of data processing that RIPL is well suited for, is where for each row, you iterate through a series of business logic steps inside do loops, and potentially using arrays. Such processes often show up in handling messy or dirty data such as within the Health and Life Sciences domain.
Lastly, SAS datasteps support arbitrary output of rows anywhere. This flexibility is simply not possible with the DataFrame API. If you use datasteps to “fill in” rows of missing data, or dynamically generate rows for some other purpose, using RIPL is a great way to accomplish that workflow.
Does RIPL support custom formats?
Of course. We’ve converted SAS code from our customers that uses RIPL’s features in conjunction with custom formats and informats.
Our custom format implementation, both in DataFrames and in RIPL is highly scalable. We’ve tested custom format value sets up to many millions of records.
Didn’t find the answer you are looking for?
We’d love to hear from you