MatticusAU Blog: 2022

Recently I had to validate the row counts of some existing tables in PBI Data Flows that I was converting over to a new source. So wanted to make sure the row counts matched between the current and new source systems.

A simple way to do this is to add a blank query and use the Table.FromRecords and Table.RowCount functions:

let

  Source = Table.FromRecords({

        [Table = "Table 1 Original", RowCount = Table.RowCount(#"Table 1 Original")],

        [Table = "Table 1 New", RowCount = Table.RowCount(#"Table 1 New")],

        [Table = "Table 2 Original", RowCount = Table.RowCount(#"Table 2 Original")],

        [Table = "Table 2 New", RowCount = Table.RowCount(#"Table 2 New")]

})

in

  Source

While designing a data platform using Azure Synapse (or Azure Data Factory) it is quite common to want to use variables and/or parameters to control various aspects of your code execution. One common example of this is parameterizing your pipelines for environment configuration control, that is to distinguish between Dev, Test, Prod.

There are many articles already about how to parameterize an Azure Notebook, so I won't rehash that in detail. Here is my favourite for reference:

What isn't well documented is the Notebook default language will be the language context that the parameter is set in. Regardless of if you use a language magic command in your Parameterized Cell.

It makes sense if you stop and think about it. It can catch you out though if your Notebooks require multiple languages to complete their tasks. For example, let's assume this scenario.

Scenario: To interact with ADLS and other sources you use PySpark, but for creating data views etc you prefer Spark SQL due to language familiarity (TSQL), code formating, and maybe code migration. Due to the bulk of your code cells being Spark SQL that is the language you have set the Notebook to, therefore your PySpark cells use the %%pyspark magic command.

My pipeline has a parameter

The Notebook activity is passing that parameter through via dynamic values

The notebook is set to Spark SQL as the default language

The following screenshots have used the pipeline execution and notebook snapshot to trace the output/state of each step and highlight what is happening:

We can see the Parameterized Cell defined in PySpark code and then the Runtime Parameter being passed in and set in Spark SQL

Using some simple output commands in those respective languages we trace the current values.

This shows that the PySpark variable defined in our Parameterized Cell was not replaced and actually carries the value of 'dev'. The parameter passed into the Notebook was set in Spark SQL and the output of that language correctly has the value 'prod'.

That's not necessarily a problem, except if we need to reference the Parameter value from PySpark (or another language). We can recast the value from Spark SQL to PySpark with some simple code.

%%pyspark
param_df = spark.sql('SELECT ${env_selection}')
env_selection = param_df.columns[0]

Now we can see the Print function in our PySpark code return the correct value from the Parameter.

This behavior also applies to any variables you define within a specific language via the magic command.

Obviously, the right outcome here is to really think about what your Notebook's default language will be set to based on the requirements and behaviors. As I stated above though there are times you may need to use multiple languages and pass variable/parameter values between languages to achieve some functionality.

Thursday, May 12, 2022

Validate the row counts of PBI tables in Data Flows

Friday, March 4, 2022

Azure Synapse Notebook variable & parameter language context