A Step-By-Step approach to analyze Data problems (Series 101 - Part 1)

Hello !

In this post, we shall see the how to take a systematic approach towards solving data problems. The approach has been intended to be as generic as possible so as to envelope the entire super-set of Machine Learning problems. Each of these steps will be covered in depth in the upcoming posts. Without much ado, let us grab a cup of coffee and get into the details. Let us consider a real-world problem.

The Problem

Let's say that we wish to predict whether a certain song (new song) will be liked by an individual or not.

Data Collection

The first step is to identify the source of data which has relevant information, and it should be one which is sufficiently large enough for us to be able to do fair analysis. The amount of data which could be termed as "sufficient" could vary as per the use-case and the nature of the data itself. For our use-case, we consider data from Spotify. (I would be explaining in a later post as to how to retrieve data from Spotify programmatically)

Now, there could a lot of data. Well, "a lot" in terms of "dimensions" and not the rows (the more number of rows, the better it is). Therefore, our current task is to do a preliminary scan and select which dimensions to extract from our data source. For our problem, we extract the sound features (acoustic, tone, danceability etc.) of a particular song / music, and dump it right into an Excel file. And thus, we would now have a "dataset" with us.

Data Exploration / Visualization


This is a seemingly minor task, but in reality, is the foundation of solving any kind of machine learning problem. This step is also termed as "EDA" (Exploratory Data Analysis). It is hard to suggest one specific approach for performing data exploration, however, this is an attempt to make the process as generic as possible.
  1. Data Availability Check - Often, we would come across variables which are sparsely populated. In other words, these variables contain a lot of NAs / Nulls, which would need to be dealt with. In some cases, we may even need to discard the variable. Hence, it always helps to get an idea of percentage availability of every variable in the dataset.
  2. Range of data - This applies to columns containing continuous variables (numeric data). Obtaining the minimum, maximum, mean, median of data helps us understand the data in a better way.
  3. Variable Distribution - This is with reference to columns containing categorical variables. These variables contain a very less number of distinct values. Getting an idea of how many times each distinct value occurs, helps us identify skewness in the dataset.
  4. Visualization - This basically means generating graphs, dashboards to educate ourselves about the "trends" / "patterns" in the data. (Tableau is a great tool for creating visualizations and dashboards)
  5. Outlier Detection - Certain values within a column can be exorbitantly large or extremely minute as compared to the other values in the column. These values may cause bias in the dataset, which could further lead to problems in the future. Outliers are detected through a diagram called Boxplot. (Will be covered in an upcoming post)

Feature Engineering

The term sounds fancy, but what it essentially means is: Garbage in, garbage out. Well, let us take an example. Let's say that you have Biology exam tomorrow. However, for some reason, you get confused about the timetable and end up believing that the exam you have to give next day is of subject History. So, you end up cramming all the dates, events and everything else. Next day, when the question paper arrives, you are in for a shock. You start remembering all about what you studied the previous night and very less about the paper in front. This is exactly what could happen if you feed inappropriate information to your Machine Learning algorithm. Hence, it is quintessential to study our variables carefully, and perform engineering on it, before feeding it into our actual model.

There could be 2 ways to do this: One, through variable reduction; and Two, through the creation of new variables using two or more variables (known as derived variables). Both these methods will be covered in a later post.

Algorithm / Model Selection

Selecting the right algorithm for a problem, out of the entire arsenal of Machine Learning algorithms can be a daunting task. But how do we know which to choose ? The first step is to understand the nature of our problem statement, and the kind of output we desire. In our case, we wish to know whether a user will like a particular song or not. This can be denoted in terms of 2 values: 0 (for dislike) and 1 (for like). Our output would be either of these 2 values. We call this a typical classification problem ( with 2 classes : 0 and 1)

Next step is to outlay all the possible algorithms which can give us this binary output. Although not in the scope of this introductory article, I am just mentioning the names of some of the algorithms which can used to solve a binary classification problem : Logistic Regression, Decision Tree

Model Tuning




Just as we need to tune the radio to the appropriate frequency to be able to hear the channel audio properly, we also need to perform certain tweaks on our selected algorithm to achieve the best results possible. This is an advanced topic which will be discussed in a further post.

Result Benchmarking

Once we have run the model and got the output, next is to find out how worthy our results are. This requires the use of statistical methods and metrics to compare our results with the standard benchmark. We term these as "statistical tests". Some of the metrics which we can use for our current problem statement are accuracy, precision, F1-score, to mention a few.


Thus, above are the basic steps, which carried out in a systematic maaner, can help us arrive at a good solution for our problem. Hope you enjoyed it.

So, that's it for today, folks ! Stay tuned for more articles ! Your feedback is most welcome.

Comments

Popular posts from this blog

Reading & Writing a file from / to Azure Blob Storage (Using PySpark)

Running Linux Commands on Azure Databricks (Using PySpark)

Column Pivot in PySpark