SAS Migration FAQ

 

Why Migrate To PySpark?

Apache Spark and PySpark (the Python API for Spark) is a fast and general-purpose cluster computing platform. It offers an open source, wide range data processing engine with revealing development API’s. Scalable & fault tolerant, it’s become the defacto analytics platform in the market, with performance and capabilities that far surpass that of traditional platforms (SAS, IBM etc.). Apache Spark offers 100X+ greater data processing performance and is open source, hosted at the vendor-independent Apache Software Foundation. Most of the global Cloud Service Providers use Spark to handle big data in their cloud. Simply put, there is no easier migration to Cloud than with Spark.

Performance may have netted Spark an initial following among the big data and analytics crowd, but the ecosystem and interoperability is what continues to drive broader adoption of Spark today. Apache Spark is the most modern unified data science platform available with the ability to run models and analytics programs 100’s of times faster over legacy systems like SAS, MapReduce, or R, and it can scale to any size of data.

Who is using PySpark?

Most of the Fortune 100, 500 and 1000 companies, including General Motors, Capital One, Amazon, Google, Microsoft, General Electric, BMW, Uber, Toronto Dominion Bank, HSBC, Bloomberg, etc, use Spark for its built-in libraries for data access, streaming, data integration, graph processing and advanced analytics and machine learning. This immensely performant, unified data science platform is being widely adopted globally with significant advantages being promoted by all in open source chat rooms and conferences.

Why not convert to other languages/platforms? Why just PySpark?

For decades, SAS users have tried to migrate to a variety of other technologies. Inevitably these projects fail or become far more complicated than initially planned. SAS is a very rich language, with broad and unique set of features. Many of the concepts in SAS code can’t be easily replicated in other tools such as databases, Python, R, SPSS, etc. These tools all have severe limitations in terms of flexibility, coding constructs, scalability or performance.

PySpark is the only platform that meets and exceeds SAS’ language capabilities in every respect, and is the industry standard analytics platform.

We don't have a Spark cluster or "Big Data", but want to modernize and get off SAS. Can you help?

PySpark is the most scalable analytics platform in existence, scaling to clusters of 1,000’s of machines. While this scalability is awesome, some believe PySpark is only a “Big Data” technology. The reality is that PySpark is easy to deploy on a single machine or small server. If you already have Python, simply “pip install pyspark” to install the necessary components to run it in local mode. You still gain the benefits of running a high performance parallel engine, even on a laptop.

Is it possible to convert SAS to R?

The R language has gained some support over the years, especially within  academic and statistician communities. For this reason, many believe it could be a good replacement for SAS, which also had strong support by those communities. Unfortunately R was built by and built for academic statisticians, essentially becoming a very niche domain specific language (DSL).

Little attention has been paid to proper software engineering practices within the R community, and so consistency and maintainability are major issues. It also lacks many of the features and qualities necessary for an enterprise replacement to SAS. The overwhelming trend is to move away from niche DSL’s, and to leverage the power and generality of Python along with purpose built libraries such as PySpark, SKLearn, SciPy, statsmodels, etc.

We are looking to migrate over a long period of time. Is your service flexible?

Yes. In addition to our SPROCKET migration packages, we also offer on on-demand service that can be used to convert code at any time. Please contact us to find out more.

I have a few hundred line of complex code SAS code. Can you convert this for me?

As much as we would love to assist everyone on their modernization journey, this isn’t always feasible. Our solution is designed to work for enterprise organizations to convert large amounts of code quickly and efficiently. Our entry package starts at 5,000 lines of code delivered in 5-10 business days, and we are comfortable working with millions of lines of code. We offer services to supplement your brute force approach and full-on conversion services. If you have a need to convert a few thousand lines of code or more, we’re happy to work with you to scope and price your project.

What is a brute force migration?

We define brute force as the manual process of translating proprietary SAS code into PySpark code. Human’s manually performing the work. This type of approach is typically taken by large enterprises hiring or assigning work to corporate resources, or large integrators spreading the process across many hired individuals.

How much faster is automation?

Our automation is easily 90% faster than a brute force approach. Machines running conversions 24/7 with ability to scale and meet demands.

How does automation reduce risk and exposure?

Because automation has very little human intervention there is significantly less risk in code and IP being exposed or openly shared by humans. Further, the output is consistent, unlike a manual approach where every person has a unique way of coding, the end result is accurate code. Code consistency is also beneficial to those working with it daily, as they will learn from the format and calls much more quickly reducing the learning curve and increasing adoption, while allowing them an opportunity to grow their skills.


We also hear feedback that human “turn over” in the manual conversion process is as high as 30%. Needing to switch gears, bring new people up to speed, and resolve issues left behind by those moving on elongates the migration process. People become bored with manually converting code daily for months at a time, leading them to find alternative opportunities or even shifting focus to other projects.

I've looked at other solutions claiming to be automated, but was disappointed. What differentiates the WWD approach?

We hear that from our customers too, and the results of “bake-off’s” performed by our customers lead us to understand the code we generate has far better coverage, is way more accurate, leverages industry best practices, and works without the need for manual intervention.

Having years of experience converting millions of lines of code, we have an unmatched depth of experience, which is incorporated into our automation. Of course, there’s always very complex code that we apply our SAS & PySpark experts to enhance the final output, improve performance, and integrate within the customers unique environment.

We've been told some SAS processes cannot run outside of SAS. What processes do you support?

SAS is a 50 year old technology, with many different components, many which have been acquired over the years. Of course we can’t convert every conceivable process built in a SAS product into PySpark, but we do have broad support & experience converting processes from the most popular components including:

  • Base and Macro
  • Stat
  • ETS
  • Graph
  • Database Access Engines
  • EGuide
  • EMiner Scoring models
  • DI Studio
  • SPDE & SPDS
  • Grid, Share & Connect

Out tasking code conversion has cost us significantly more due to change fees and scope creep. Is this the same case with your delivery model?

Often our competition jumps right into the code conversion process without knowledge of the specific environment or the complexities of legacy SAS code. Our business is built on the understanding of both SAS and PySpark which is a unique value add according to our customers. Without the large and diverse human presence in the code conversion process, automation allows us to focus on delivery of the output. It also allows us to focus on the any uniqueness or challenges that arise and put our expertise to use.

Our service is priced per the scope and documented understanding of the task. We are SLA driven and our goal is to ensure accuracy and deliver identical outcomes in the new platform. Our commitment to our customers is that we do not charge change fees, and work to ensure you are satisfied with the outcomes.

You can count on WiseWithData to deliver within the committed statement of work without going beyond the time frame or surprising extra costs.

What versions of PySpark do you support? Are there other requirements?

We currently support all PySpark platforms that are based on Apache Spark 3.0+. For the Python component, we support Python 3.7+. Some niche functions require additional Python libraries, all of which are contained in the Anaconda Python distribution.

Still have questions?