Sunday, May 27, 2018

Convert CSV files to Parquet using Azure HDInsight

A recent project I have worked on was using CSV files as part of an ETL process from on-premises to Azure and to improve performance further down the stream we wanted to convert the files to Parquet format (with the intent that eventually they would be generated in that format). I couldn't find a current guide for stepping through that process using Azure HDInsight so this post will provide that.

Scripts and what samples used in this guide are available

To follow this blog post make sure you have:

  1. Create a resource group in your Azure Subscription
  2. Create a Storage Account within the resource group
  3. Create an Azure HDInsight resource the same resource group (you can use that storage account for HDInsight)
  4. Upload the sample GZip compressed CSV files from the SampleData folder to the Storage Account using Azure Storage Explorer. In my case I uploaded to a container "DataLoad"
The work that we will perform will be within the Jupiter Notebook. 

From your Azure Portal locate the HDInsight resource, click the Cluster dashboard quick link

Now select the Jupiter Notebook

This will open a new tab/window.

Authenticate as the cluster administrator.

Create a new PySpark Notebook.

Paste the following lines and press Shift+Enter to run the cell.

from pyspark.sql import *
from pyspark.sql.types import *

Now we can import the CSV into a table. You will need to adjust the path to represent your storage account, container and file. The syntax of the storage path is wasb://

# import the COMPRESSED data
csvFile ='wasb://', header=True, inferSchema=True)

Press Shift+Enter to run the cell

Once complete you can use the SQL language to query the table you imported the data to. This will create a dataframe to host the output as we will use this to write the parquet file.

dftable = spark.sql("SELECT * FROM salessample_big")

The final step is to export the dataframe to a parquet file. We will also use the gzip compression.

dftable.write.parquet('wasb://',None, None , "gzip")

The complete Jupiter Notebook should look like:

In your storage account you should have a Parquet export of the data (note that this format is not a single file as shown by the file, folder and child files in the following screen shots.

In this example you may notice that the compressed file sizes are not much different, yet the parquet file is slightly more efficient. You experience may vary as it depends on the content within the CSV file.

Some reference material worth checking out if this is something you are working on:

Legal Stuff: The contents of this blog is provided “as-is”. The information, opinions and views expressed are those of the author and do not necessarily state or reflect those of any other company with affiliation to the products discussed. This includes any URLs or Tools. The author does not accept any responsibility from the use of the information or tools mentioned within this blog, and recommends adequate evaluation against your own requirements to measure suitability.


  1. There is good news for you. Now, you need not to spend those big bucks for acquiring your dream watch. Your answer is Replica watches. Yes come here buy replica watches UKreplica rolex Sea-Dweller Oystersteel Black Dial Watch m126600-0001replica rolex Datejust 31 Everose gold White mother-of-pearl Dial Watch m278245-0014etc.

  2. Velkommen til bedste ur til salg her. Swiss Replica ure på salg, replika Cartier ureRolex, Audemars Piguet, Hublot, Panerai og mange flere schweiziske håndlavede replika ure med schweiziske klonbevægelser.

  3. This is good news for you. Now, you need not to spend those big bucks for acquiring your dream watch. Your answer isreplica watches UK. Yes, it makes you buy your desired luxury within your budget here has replica rolex watches etc.

  4. Balenciaga a présenté son entraîneur de vitesse le plus populaire en rouge avec le logo sur l’étiquette en blanc, qui est divisé sur ses sommets synthétiques. La directrice de la création, imitation balenciaga pas cherDemna Gvasalia, décompose cette fois une itération noire avec une semelle intermédiaire blanche pour un look Chaussure Balenciaga Hommeséquilibré. La paire est équipée de nylon textile / spandex chaussette supérieure, nervuré tricot, hukommelsesåleteknologi et semelle en caoutchouc avec une meilleure absorption des chocs.

  5. For those that have had an extraordinary cell phone or transportable video player. rtf converter

  6. High quality from a trust worthy replica raybans sunglasses here can offer you the best in high quality, including fake raybans clubmaster , etc.

  7. Highly covetable shoes are the replica sneakers heart and soul of Jimmy Choo. In the early ’90s, Choo was a bespoke shoemaker in the East End of London, replica jimmy choo adored by those in the know, jet-setters and Princess Diana, among various other high-profile clients.

  8. The arrangement is simple, you simply need to change over the .swf document to the configuration is amicable with macintosh, as mp4, mov, so you can paly it on quicktime.

  9. Longines watches have been a landmark of the replica watches uk industry since their brand began in 1832 in Saint-Imier. Today, the replica Longines 1832 watches brand offers a wide variety of watches for men and women. Watches manufactured by Longines have been utilized throughout history for everything from exploration ventures to the expression of elegance. In addition to being famous for their ties to the racing and equestrian worlds, Longines watches are globally recognized for their timeless chic sophistication. we have a variety of Longines watches for sale, all new, beautiful, and guaranteed to be genuine. Even better, when shopping our Longines watches online, you know you are getting the best value. Browse our selection to find the Internet’s best Longines watch price.