Reading & Writing a file from / to Azure Blob Storage (Using PySpark)
Welcome back ! This is another PySpark tutorial wherein we will see how to access files from the Azure Blob Storage. Basically, we store large datasets or CSV files into the Blob storage, so that they can be retrieved in Spark notebook and processed on the cluster.
You now have the "blob_data" PySpark RDD loaded with the Blob contents. Now, you can go ahead and do any kind of processing on this RDD.
Just as we have seen how to read a file, we will now look at how to write an RDD back to the Blob storage. Below is the set of lines to do so:
The RDD is written in the form of "part" within the folder named "". In case we wish to write the RDD into one single CSV file, we can use the "coalesce(1)" option. This approach does the needful in 2 steps:
1. Writes a single part file with its name starting with "part-"
2. Then moves this part file to a CSV file with our desired name
(Caution: This is not usually recommended for very large files, as it is usually time consuming and fails to utilize Spark's feature of distributed processing)
Cheers! Now, we can communicate between the Blob storage and the notebook. That's it for today, and hope to see you soon !
First of all, we need to set the Spark session with appropriate credentials :
Then, we use the following command to load the Blob for WASB (Windows Azure Blob Storage):
The multiline - true option will take care of redundant newlines within a record, and will return the correct number of rows with data.
In order to see the Blob contents, you can issue the following command:
In order to see the Blob contents, you can issue the following command:
You now have the "blob_data" PySpark RDD loaded with the Blob contents. Now, you can go ahead and do any kind of processing on this RDD.
Just as we have seen how to read a file, we will now look at how to write an RDD back to the Blob storage. Below is the set of lines to do so:
The RDD is written in the form of "part" within the folder named "". In case we wish to write the RDD into one single CSV file, we can use the "coalesce(1)" option. This approach does the needful in 2 steps:
1. Writes a single part file with its name starting with "part-"
2. Then moves this part file to a CSV file with our desired name
(Caution: This is not usually recommended for very large files, as it is usually time consuming and fails to utilize Spark's feature of distributed processing)
Cheers! Now, we can communicate between the Blob storage and the notebook. That's it for today, and hope to see you soon !
Comments
Post a Comment