What is Koalas?

Koalas is an implementation of the pandas DataFrame API on top of Apache Spark.

Pandas is the go-to Python Library for data analysis, while Apache Spark is becoming the go to for big data processing. Koalas allows you leverage the simplicity of Pandas code with the power of Apache Spark.

Why do we need Koalas?

Koalas makes learning PySpark easier by allowing you to use pandas-like functions within Apache Spark.

Some advantages of this are:

  • Easier to maintain: You only need to maintain a single codebase which works on both Pandas and Spark.
  • Simplifies the code: Many people find Pandas code to be less bulky then PySpark code.
  • Reduce the learning curve: Most people are already familiar with Pandas code.

How does Koalas work?

Koalas is not a python built-in so to use it first has to be installed. This can be done with conda from the command line:

conda install -c conda-forge koalas

The following code snippet shows how to install koalas using pip.

pip install koalas

Using the koalas interface is very simple. The following code snippet shows how to use the koalas API in Python.

import databricks.koalas as ks

# A koalas DataFrame (kdf) can be made from a Pyspark DataFrame (sdf)   
kdf = ks.to_koalas(sdf)

# Or, a koalas DataFrame (kdf) can be made from a pandas DataFrame (pdf)
kdf = ks.from_pandas(pdf)

The Koalas dataframe acts as the bridge between Pandas and Spark. The link below shows a good example of how to leverage Koalas.

https://databricks.com/blog/2020/03/31/10-minutes-from-pandas-to-koalas-on-apache-spark.html

Future of Koalas

The Project is still in development (beta) and about 70% of pandas API is implemented in Koalas. With the increase in productivity this it can add to a developers work I believe it is definitely something to keep an eye on. Below is a link to the official documentation.

https://koalas.readthedocs.io/en/latest/index.html