The Data Detective - Analytical Techniques And Programming

Posts

Column Pivot in PySpark

- October 26, 2019

Hey, welcome back ! This post is about pivoting a column in PySpark. Or to put it in simpler terms, it is about splitting a column with categorical values, into multiple dummy variable columns (similar to One-Hot Encoding). The column has its distinct values transposed into individual columns. The best part is that this can be done in just one line ! Let's say that we have a database Classroom and a column within that, as Gender. This column contains values: Male / Female. We intend to split this column to 2 columns: Gender_Male and; Gender_Female (each containing values as 1 and 0). This can be achieved by the following line: Combining with GroupBy Pivoting can also come in handy when we wish to group by certain columns and then perform a pivot to find the number of boys and girls per group. For example, let's say that we have a column "Age_Group", which has values 1,2,3. And now, we wish to know the number of boys and girls in every age group. The line bel...

Running Linux Commands on Azure Databricks (Using PySpark)

- October 13, 2019

Welcome back ! Today, we will explore the Databricks File Store, and techniques of performing certain operations on it. By the end of this tutorial, you will learn how to: - Implement Linux commands on Azure Dataricks - Use Databricks Utilities - Control the Azure Blob storage from Databricks notebook Databricks Utilities (DBUtils) provide a way for us to access storage objects and work with them. There are certain specific commands of DBUtils, called File system utilities, which specifically help us deal with the Databricks File System, thus enabling us to use Databricks as a File System. Let us take an example. Let's say that we want to move a file from one location to another. The command below would help us do just that ! You can explore more commands by the following line: But, caution ! These scripts are no less risky than running Linux commands (without fully understanding their method of use). Hence, please be careful and review these commands before firing th...

Reading & Writing a file from / to Azure Blob Storage (Using PySpark)

- October 13, 2019

Welcome back ! This is another PySpark tutorial wherein we will see how to access files from the Azure Blob Storage. Basically, we store large datasets or CSV files into the Blob storage, so that they can be retrieved in Spark notebook and processed on the cluster. First of all, we need to set the Spark session with appropriate credentials : The Storage account name and key can be found here: Then, we use the following command to load the Blob for WASB (Windows Azure Blob Storage): The multiline - true option will take care of redundant newlines within a record, and will return the correct number of rows with data. In order to see the Blob contents, you can issue the following command: You now have the "blob_data" PySpark RDD loaded with the Blob contents. Now, you can go ahead and do any kind of processing on this RDD. Just as we have seen how to read a file, we will now look at how to write an RDD back to the Blob storage. Below is the se...

Why PySpark is so Awesome !

- October 12, 2019

Hey! If you have heard about Apache Spark, you may also know that it provides support for a lot of programming languages, such as: R, Python, Scala. Each of these languages when combined with Spark, provides a way to perform distributed processing across nodes. But, let's get to our intended topic of discussion ! Mirror, mirror, on the wall, who is the fairest among all ? Well, if I were the mirror, I would say that it is PySpark ! (Just kidding) So what makes PySpark stand out from the rest (SparkR and Scala) ? 1. Strong buildings (PySpark) have strong foundations (Python) Better readability, gentle learning curve, more popularity, have made Python to be viewed by new software developers as the programmer's paradise. With strong machine learning libraries and general purpose functionalities, it is the ultimate toolbox to explore and analyze data. 2. Ease of reading and writing Blob files Large datasets / CSV files are stored as Blob on Azure Blob storage. Py...

Search This Blog