Posts

A Step-By-Step approach to analyze Data problems (Series 101 - Part 1)

Image
Hello ! In this post, we shall see the how to take a systematic approach towards solving data problems. The approach has been intended to be as generic as possible so as to envelope the entire super-set of Machine Learning problems. Each of these steps will be covered in depth in the upcoming posts. Without much ado, let us grab a cup of coffee and get into the details. Let us consider a real-world problem. The Problem Let's say that we wish to predict whether a certain song (new song) will be liked by an individual or not. Data Collection The first step is to identify the source of data which has relevant information, and it should be one which is sufficiently large enough for us to be able to do fair analysis. The amount of data which could be termed as "sufficient" could vary as per the use-case and the nature of the data itself. For our use-case, we consider data from Spotify. (I would be explaining in a later post as to how to retrieve data from Spotify

Column Pivot in PySpark

Hey, welcome back ! This post is about pivoting a column in PySpark. Or to put it in simpler terms, it is about splitting a column with categorical values, into multiple dummy variable columns (similar to One-Hot Encoding). The column has its distinct values transposed into individual columns. The best part is that this can be done in just one line ! Let's say that we have a database Classroom and a column within that, as Gender. This column contains values: Male / Female. We intend to split this column to 2 columns: Gender_Male and; Gender_Female (each containing values as 1 and 0). This can be achieved by the following line: Combining with GroupBy Pivoting can also come in handy when we wish to group by certain columns and then perform a pivot to find the number of boys and girls per group. For example, let's say that we have a column "Age_Group", which has values 1,2,3. And now, we wish to know the number of boys and girls in every age group. The line bel

Running Linux Commands on Azure Databricks (Using PySpark)

Welcome back ! Today, we will explore the Databricks File Store, and techniques of performing certain operations on it. By the end of this tutorial, you will learn how to: - Implement Linux commands on Azure Dataricks - Use Databricks Utilities - Control the Azure Blob storage from Databricks notebook Databricks Utilities (DBUtils) provide a way for us to access storage objects and work with them. There are certain specific commands of DBUtils, called File system utilities, which specifically help us deal with the Databricks File System, thus enabling us to use Databricks as a File System. Let us take an example. Let's say that we want to move a file from one location to another. The command below would help us do just that ! You can explore more commands by the following line: But, caution ! These scripts are no less risky than running Linux commands (without fully understanding their method of use). Hence, please be careful and review these commands before firing th

Reading & Writing a file from / to Azure Blob Storage (Using PySpark)

Image
Welcome back ! This is another PySpark tutorial wherein we will see how to access files from the Azure Blob Storage. Basically, we store large datasets or CSV files into the Blob storage, so that they can be retrieved in Spark notebook and processed on the cluster. First of all, we need to set the Spark session with appropriate credentials : The Storage account name and key can be found here: Then, we use the following command to load the Blob for WASB (Windows Azure Blob Storage): The multiline - true option will take care of redundant newlines within a record, and will return the correct number of rows with data. In order to see the Blob contents, you can issue the following command: You now have the "blob_data" PySpark RDD loaded with the Blob contents. Now, you can go ahead and do any kind of processing on this RDD. Just as we have seen how to read a file, we will now look at how to write an RDD back to the Blob storage. Below is the se

Why PySpark is so Awesome !

Hey! If you have heard about Apache Spark, you may also know that it provides support for a lot of programming languages, such as: R, Python, Scala. Each of these languages when combined with Spark, provides a way to perform distributed processing across nodes. But, let's get to our intended topic of discussion ! Mirror, mirror, on the wall, who is the fairest among all ? Well, if I were the mirror, I would say that it is PySpark ! (Just kidding) So what makes PySpark stand out from the rest (SparkR and Scala) ? 1. Strong buildings (PySpark) have strong foundations (Python) Better readability, gentle learning curve, more popularity, have made Python to be viewed by new software developers as the programmer's paradise. With strong machine learning libraries and general purpose functionalities, it is the ultimate toolbox to explore and analyze data. 2. Ease of reading and writing Blob files Large datasets / CSV files are stored as Blob on Azure Blob storage. Py