Reading & Writing a file from / to Azure Blob Storage (Using PySpark)

Welcome back ! This is another PySpark tutorial wherein we will see how to access files from the Azure Blob Storage. Basically, we store large datasets or CSV files into the Blob storage, so that they can be retrieved in Spark notebook and processed on the cluster.

First of all, we need to set the Spark session with appropriate credentials :
# Set Spark configuration
spark.conf.set(
"fs.azure.account.key.<storage-account-name>.blob.core.windows.net",
"<storage-account-access-key>"
)
view raw blogpost1 hosted with ❤ by GitHub

The Storage account name and key can be found here:


Then, we use the following command to load the Blob for WASB (Windows Azure Blob Storage):
# Load data from WASB
blob_data = spark.read.option("multiLine","true").format('csv').load("wasbs://container-name@karthik.blob.core.windows.net/user_data_with_segment_with_digital_data_45_K.csv", sep =",", header = "TRUE", encoding="UTF-8")
view raw blogpost2 hosted with ❤ by GitHub

The multiline - true option will take care of redundant newlines within a record, and will return the correct number of rows with data.

In order to see the Blob contents, you can issue the following command:
display(blob_data)
view raw blobpost3 hosted with ❤ by GitHub

You now have the "blob_data" PySpark RDD loaded with the Blob contents. Now, you can go ahead and do any kind of processing on this RDD.

Just as we have seen how to read a file, we will now look at how to write an RDD back to the Blob storage. Below is the set of lines to do so:
'''
Write the RDD into a file
'''
output_container_path = "wasbs://%s@%s.blob.core.windows.net" % ("container-name", "account-name")
output_blob_folder = "%s/RDD_Save_Test" % output_container_path
(RDD_data
.write
.mode("overwrite")
.option("header", "true")
.format("com.databricks.spark.csv")
.save(output_blob_folder))
view raw blogpost3-2 hosted with ❤ by GitHub
The RDD is written in the form of "part" within the folder named "". In case we wish to write the RDD into one single CSV file, we can use the "coalesce(1)" option. This approach does the needful in 2 steps:

1. Writes a single part file with its name starting with "part-"
2. Then moves this part file to a CSV file with our desired name

'''
Write the RDD into a file
'''
output_container_path = "wasbs://%s@%s.blob.core.windows.net" % ("container-name", "account-name")
output_blob_folder = "%s/RDD_Save_Test" % output_container_path
(RDD_data
.coalesce(1)
.write
.mode("overwrite")
.option("header", "true")
.format("com.databricks.spark.csv")
.save(output_blob_folder))
files = dbutils.fs.ls(output_blob_folder)
output_file = [x for x in files if x.name.startswith("part-")]
dbutils.fs.mv(output_file[0].path, "%s/RDD_data.csv" % output_container_path)
view raw blogpost3-3 hosted with ❤ by GitHub
(Caution: This is not usually recommended for very large files, as it is usually time consuming and fails to utilize Spark's feature of distributed processing)

Cheers! Now, we can communicate between the Blob storage and the notebook. That's it for today, and hope to see you soon !

Comments

Popular posts from this blog

Why PySpark is so Awesome !

Running Linux Commands on Azure Databricks (Using PySpark)