MatticusAU Blog: May 2018

A recent project I have worked on was using CSV files as part of an ETL process from on-premises to Azure and to improve performance further down the stream we wanted to convert the files to Parquet format (with the intent that eventually they would be generated in that format). I couldn't find a current guide for stepping through that process using Azure HDInsight so this post will provide that.

Scripts and what samples used in this guide are available https://github.com/Matticusau/SQLDemos/tree/master/HDInsightConvertCSVtoParquet

To follow this blog post make sure you have:

Create a resource group in your Azure Subscription
Create a Storage Account within the resource group
Create an Azure HDInsight resource the same resource group (you can use that storage account for HDInsight)
Upload the sample GZip compressed CSV files from the SampleData folder to the Storage Account using Azure Storage Explorer. In my case I uploaded to a container "DataLoad"

The work that we will perform will be within the Jupiter Notebook.

From your Azure Portal locate the HDInsight resource, click the Cluster dashboard quick link

Now select the Jupiter Notebook

This will open a new tab/window.

Authenticate as the cluster administrator.

Create a new PySpark Notebook.

Paste the following lines and press Shift+Enter to run the cell.

from pyspark.sql import *

from pyspark.sql.types import *

Now we can import the CSV into a table. You will need to adjust the path to represent your storage account, container and file. The syntax of the storage path is wasb://mycontainer@myaccount.blob.core.windows.net/foldername/filename

# import the COMPRESSED data

csvFile = spark.read.csv('wasb://dataload@mlbigdatastoracc.blob.core.windows.net/SalesSample_big.csv.gz', header=True, inferSchema=True)

csvFile.write.saveAsTable("salessample_big")

Press Shift+Enter to run the cell

Once complete you can use the SQL language to query the table you imported the data to. This will create a dataframe to host the output as we will use this to write the parquet file.

dftable = spark.sql("SELECT * FROM salessample_big")

dftable.show()

The final step is to export the dataframe to a parquet file. We will also use the gzip compression.

dftable.write.parquet('wasb://dataload@mlbigdatastoracc.blob.core.windows.net/SalesSample2.parquet',None, None , "gzip")

The complete Jupiter Notebook should look like:

In your storage account you should have a Parquet export of the data (note that this format is not a single file as shown by the file, folder and child files in the following screen shots.

In this example you may notice that the compressed file sizes are not much different, yet the parquet file is slightly more efficient. You experience may vary as it depends on the content within the CSV file.

Some reference material worth checking out if this is something you are working on:

https://docs.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-jupyter-spark-sql-use-portal

https://spark.apache.org/docs/latest/api/python/

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame

Legal Stuff: The contents of this blog is provided “as-is”. The information, opinions and views expressed are those of the author and do not necessarily state or reflect those of any other company with affiliation to the products discussed. This includes any URLs or Tools. The author does not accept any responsibility from the use of the information or tools mentioned within this blog, and recommends adequate evaluation against your own requirements to measure suitability.

Sunday, May 27, 2018

Convert CSV files to Parquet using Azure HDInsight