Why PySpark is so Awesome !

Hey!

If you have heard about Apache Spark, you may also know that it provides support for a lot of programming languages, such as: R, Python, Scala. Each of these languages when combined with Spark, provides a way to perform distributed processing across nodes.

But, let's get to our intended topic of discussion ! Mirror, mirror, on the wall, who is the fairest among all ? Well, if I were the mirror, I would say that it is PySpark ! (Just kidding)

So what makes PySpark stand out from the rest (SparkR and Scala) ?

1. Strong buildings (PySpark) have strong foundations (Python)


Better readability, gentle learning curve, more popularity, have made Python to be viewed by new software developers as the programmer's paradise. With strong machine learning libraries and general purpose functionalities, it is the ultimate toolbox to explore and analyze data.

2. Ease of reading and writing Blob files


Large datasets / CSV files are stored as Blob on Azure Blob storage. PySpark has modules which help access these files and write on the Blob storage conveniently through simple commands (in contrast to SparkR, which doesn't yet provide a direct way to read Blob files).

3. A relief for Python programmers : toPandas( ) method


The toPandas() method, which converts a PySpark RDD to Python Pandas dataframe is indeed a blessing in disguise for Python programmers. Once, the Pandas dataframe is obtained, regular Python methods and transformations can be applied on it.

I wouldn't deny that a similar technique to convert the RDD into an R dataframe exists in SparkR as well.

4. Firm support for Data Science


The abundance of data science libraries which Python supports, outperforms Scala (which has minimal support for ML and data science). Even R functions can be called within PySpark (using magic commands), giving it the power to leverage R's functionalities.


So that's all for today ! Hope you enjoyed it ! 😁

Stay tuned for more articles !



Comments

Popular posts from this blog

Reading & Writing a file from / to Azure Blob Storage (Using PySpark)

Running Linux Commands on Azure Databricks (Using PySpark)

Column Pivot in PySpark