Thursday, May 12, 2022

Validate the row counts of PBI tables in Data Flows

Recently I had to validate the row counts of some existing tables in PBI Data Flows that I was converting over to a new source. So wanted to make sure the row counts matched between the current and new source systems.

A simple way to do this is to add a blank query and use the Table.FromRecords and Table.RowCount functions:

let
  Source = Table.FromRecords({
        [Table = "Table 1 Original", RowCount = Table.RowCount(#"Table 1 Original")],
        [Table = "Table 1 New", RowCount = Table.RowCount(#"Table 1 New")],
        [Table = "Table 2 Original", RowCount = Table.RowCount(#"Table 2 Original")],
        [Table = "Table 2 New", RowCount = Table.RowCount(#"Table 2 New")]
    }) 
in
  Source


Friday, March 4, 2022

Azure Synapse Notebook variable & parameter language context

While designing a data platform using Azure Synapse (or Azure Data Factory) it is quite common to want to use variables and/or parameters to control various aspects of your code execution. One common example of this is parameterizing your pipelines for environment configuration control, that is to distinguish between Dev, Test, Prod.

There are many articles already about how to parameterize an Azure Notebook, so I won't rehash that in detail. Here is my favourite for reference:


What isn't well documented is the Notebook default language will be the language context that the parameter is set in. Regardless of if you use a language magic command in your Parameterized Cell.

It makes sense if you stop and think about it. It can catch you out though if your Notebooks require multiple languages to complete their tasks. For example, let's assume this scenario.

Scenario: To interact with ADLS and other sources you use PySpark, but for creating data views etc you prefer Spark SQL due to language familiarity (TSQL), code formating, and maybe code migration. Due to the bulk of your code cells being Spark SQL that is the language you have set the Notebook to, therefore your PySpark cells use the %%pyspark magic command.

My pipeline has a parameter



The Notebook activity is passing that parameter through via dynamic values


The notebook is set to Spark SQL as the default language


The following screenshots have used the pipeline execution and notebook snapshot to trace the output/state of each step and highlight what is happening:

We can see the Parameterized Cell defined in PySpark code and then the Runtime Parameter being passed in and set in Spark SQL

Using some simple output commands in those respective languages we trace the current values.



This shows that the PySpark variable defined in our Parameterized Cell was not replaced and actually carries the value of 'dev'. The parameter passed into the Notebook was set in Spark SQL and the output of that language correctly has the value 'prod'.

That's not necessarily a problem, except if we need to reference the Parameter value from PySpark (or another language). We can recast the value from Spark SQL to PySpark with some simple code.

%%pyspark
param_df = spark.sql('SELECT ${env_selection}')
env_selection = param_df.columns[0]

Now we can see the Print function in our PySpark code return the correct value from the Parameter.



This behavior also applies to any variables you define within a specific language via the magic command.


Obviously, the right outcome here is to really think about what your Notebook's default language will be set to based on the requirements and behaviors. As I stated above though there are times you may need to use multiple languages and pass variable/parameter values between languages to achieve some functionality.


Thursday, November 15, 2018

Power BI Resources

The following is a list of resources useful for Power BI training and product news:

Product News


Power BI Blog
https://powerbi.microsoft.com/en-us/blog/


Syntax and Language Reference

Dax Reference

Social Media Resources

Follow these resources on twitter:
Power BI @MSPowerBI 
Will Thompson @Will_MI77

Training and Other Resources

Youtube channels (for training and learning):


Legal Stuff: The contents of this blog is provided “as-is”. The information, opinions and views expressed are those of the author and do not necessarily state or reflect those of any other company with affiliation to the products discussed. This includes any URLs or Tools. The author does not accept any responsibility from the use of the information or tools mentioned within this blog, and recommends adequate evaluation against your own requirements to measure suitability.

Saturday, July 7, 2018

PowerBI calling a REST Api for each row in a table.

Take this scenario.
Your writing a PowerBI report to pull data from a REST Api. The Api doesn't provide a search route which returns all the rows and data you require in one call. You can retrieve a list of ID's for the object type you require but a separate call is required to get the individual details per row. Or you may have an existing table with the IDs stored, and require additional details which is provided by an API call per row.

I was asked about this scenario by a colleague, and it is definitely able to be solved with PowerBI. I will urge caution though as if you have large recordset on the base table, and you plan to do an API call per row, that will be slow and result in a large number of calls to the API in a short duration which could be flagged as an attack by some monitoring systems.

Demo API

For this blog I have provided a demo NodeJS API that you can download and run in VSCode, but you could also use any public API, such as my SQL Versions API (details on that will follow at a later time).

To run the API, follow the instructions in README.md.


Demo Data

The demo API has some static sample data. A collection of Course Instructors, which has a mapping to the Course collection and also a mapping to a Person collection. The complete sample data was taken from an old .Net app sample (https://msdn.microsoft.com/library/bb399731(v=vs.100).aspx).

Here is a snippet of the structure and data:

CourseInstructor = [
    {
      "CourseID": "1045",
      "PersonID": "5"
    },
    ...
];
Course = [
    {
        "CourseID": "1045",
        "Title": "Calculus",
        "Credits": "4",
        "DepartmentID": "7"
    },
    ...
];
Person = [
    {
        "PersonID": "1",
        "LastName": "Abercrombie",
        "FirstName": "Kim",
        "HireDate": "1995-03-11 00:00:00.000",
        "EnrollmentDate": null
    },
    ...
];



Lets get start

1. Get the base source
For this walk through lets start with the CourseInstructor data. This is a list of CoursID's and PersonIDs, but no other details. We will then use the individual API calls to build the completed table.
In the Demo API the API route is http://localhost:3000/api/courseinstructor.

In PowerBI select Get Data

In the form type "web" to narrow the results and locate the


In the next screen enter the API route (e.g. http://localhost:3000/api/courseinstructor). This should be a HTTPS uri for security.



If your API needs Authorization, or a Token, then you should set the appropriate headers. I have an example in one completed solution at the end.

2. Convert to table
The results will first be shown in a list as PowerBI will use the Json.Document() function to retrieve the data.



Click the "To Table" button in the menu ribbon.



In the To Table dialog you should be able to leave the defaults. However if the API JSON data isn't very structured, in that each document has different elements and size, then you may need to experiment with the error handling options.




3. Expand the columns
The results will now be a typical referenced table which you need to expand the columns on. Click on the expand columns button.


Select the required columns. This will depend on your API data, in the demo data cases we want both columns as we will add additional API routes to get that data. 

My tip is to untick the "Use original column name as prefix" option. Or if you want to use that make sure you have renamed the column to a meaningful name prior to this step. You can also always add a meaningful name by editing the step settings afterwards, but before you proceed with other steps.


The results should now be expanded as required.





4. Expand the data with additional API calls per row
Now for the fun part. We now have rows of data which contain our ID(s). To expand this data we need to make additional API calls for each row. To achieve this we add a custom column, and dynamically build the API Uri.

Click the Add Column tab in the ribbon, and click Custom Column.



Give the column a meaning full name (e.g. CourseCollection).

Here is the tricky part, we have to dynamically build the API Uri. To do this we can use the Text.Combine() function to concatenate the uri and add the column value where the parameter should be. For our demo data the completed Uri will be http://localhost:3000/course/[CourseID]

Adjust the following expression as required:

= Json.Document(
  Web.Contents(
    Text.Combine({"http://localhost:3000/api/course/",[CourseID]})
  )
)

If you need to provide headers for authorization then use the parameters of the web.Contents() function, for example:

= Json.Document(
  Web.Contents(
    Text.Combine({"http://localhost:3000/api/course/",[CourseID]}), [Headers=[#"Z-API-Token"=""]])))


NOTE: If prompted regarding the data privacy, click continue and set as appropriate.

IMPORTANT: The process of updating the table may take some time depending on the number of rows and speed of the API.

The results should now include the custom column (e.g. CourseCollection). 



This table needs to be expanded, the same as earlier by clicking on the expand columns button and then select the desired columns to include. In my demo data case, I am only interested in the Title column.



Your table should now include the data from the additional API call.


Continue to model the data as you require, or additional API calls. In the case of the demo data we will repeat these steps but for the Person column and the API Uri route http://localhost:3000/person/[PersonID]

The completed table will look like:


The query steps will be similar to the following:



The complete M Query for creating the query will be similar to this. This is available from the Advanced Editor of a query, and is a working query for the demo API.

let
    Source = Json.Document(Web.Contents("http://localhost:3000/api/courseinstructor")),
    #"Converted to Table" = Table.FromList(Source, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
    #"Expanded Column1" = Table.ExpandRecordColumn(#"Converted to Table", "Column1", {"CourseID", "PersonID"}, {"CourseID", "PersonID"}),
    #"Added CourseCollection" = Table.AddColumn(#"Expanded Column1", "CourseCollection", each Json.Document(
  Web.Contents(
    Text.Combine({"http://localhost:3000/api/course/",[CourseID]})
  )
)),
    #"Expanded CourseCollection" = Table.ExpandRecordColumn(#"Added CourseCollection", "CourseCollection", {"Title"}, {"Title"}),
    #"Added PersonCollection" = Table.AddColumn(#"Expanded CourseCollection", "PersonCollection", each Json.Document(
  Web.Contents(
    Text.Combine({"http://localhost:3000/api/person/",[PersonID]})
  )
)),
    #"Expanded PersonCollection" = Table.ExpandRecordColumn(#"Added PersonCollection", "PersonCollection", {"LastName", "FirstName"}, {"LastName", "FirstName"})
in
    #"Expanded PersonCollection"



Demo Files and Source

I have provided a completed PBIX in the github repository. 







Legal Stuff: The contents of this blog is provided “as-is”. The information, opinions and views expressed are those of the author and do not necessarily state or reflect those of any other company with affiliation to the products discussed. This includes any URLs or Tools. The author does not accept any responsibility from the use of the information or tools mentioned within this blog, and recommends adequate evaluation against your own requirements to measure suitability.


Saturday, June 9, 2018

New SQL Version Life Cycle tool

Introducing my latest SQL Server Life Cycle and Versions lookup tool.

https://sqlversions.azurewebsites.net

Back in 2016 I released the first version of this app and the goals were the same. Provide an interface to easily lookup a SQL version number and find out the support status of that released. All backed with an API that could be queried from scripts and other tools.

This new release is a complete re-write and includes a modern responsive site written in Angular (because I felt the need to learn).

With this initial release I am providing the following capabilities in the UI:

Version Search

The search feature allows you to lookup a specific release of SQL Server or version number and then see all the releases that match that search result. In the results you can see the release date, mainstream and extended support dates.



Clicking on a row in the results will open the



Life Cycle

The life cycle search page is just a page to quickly lookup when certain branches of a product release will end mainstream support or extended support. This was created as while onsite consulting I sometimes need the ability to quickly look up life cycle information rather than a specific version. It was also a "wish list feature" from a customer I was working with at the time.




Health Check [beta]

The health check is a beta release of my end goal which is to allow you to enter your version number and get basic recommendations about how to upgrade. The initial release simply recommends to either start planning to upgrade or to urgently upgrade depending on that versions support status. My vision for this feature is to provide guidance about updates that have been released on the same branch to stay current and also life cycling plans for upgrades to service packs or next release.




API

One of my main goals with this app was to ensure that it had an API which could be queried and then used to extend other tools and scripts. For example you could incorporate it into a Health Check Script to retrieve the version number from the SQL Instance and then call the API to get information about the life cycle and support of that release. Or you could use Power BI taking data from your CMDB and performing a lookup against the API for the rows or a search against the API and store that as a query with a relationship link.

The following code provides an example of using the API in a Health Check script. The code is available at https://gist.github.com/Matticusau/5778b90507cb7274deebc12cf4360c1c

# Import the SQL Module
Import-Module SqlServer;

# Get the version number via appropriate method
$VersionNumber = Invoke-Sqlcmd -ServerInstance $SqlServerName -Query "SELECT SERVERPROPERTY('PRODUCTVERSION')";
$VersionNumber = $VersionNumber.Column1;

# Call the API to get the version information
$VersionData = Invoke-RestMethod -Uri "http://sqlserverbuildsapi.azurewebsites.net/api/builds?version=$($VersionNumber)";

# Want to improve your Health Check script, calculate the health of Support Status
if ($VersionData.ExtendedSupportEnd -le (Get-Date)) {$SupportStatus = 'Critical'}
elseif ($VersionData.MainstreamSupportEnd -le (Get-Date)) {$SupportStatus = 'Warning'}
else {$SupportStatus = 'Ok'}

# format the output data string
$OutputData = @"
Instance = $($Instance.Name)
Version = $($VersionData.BuildVersion)
Product = $($VersionData.ProductName)
Branch = $($VersionData.BranchName)
Update = $($VersionData.BuildUpdate)
MainstreamSupportEnd = $($VersionData.SupportEndMainstream)
ExtendedSupportEnd = $($VersionData.SupportEndExtended)
SupportStatus = $($SupportStatus)
"@

# Return the hashtable
$OutputData

Soon, I will update the samples repository I provided for the previous version to reflect the new API syntax.




Lastly a note on the data. There is no dynamic link from this data to any Microsoft or other site. The data is provided "as-is" and manually maintained by myself and a few trusted peers. We do our best to make sure it is up to date and accurate but for any business critical or commercial decision make sure you refer to the official sources.

If you like this tool, or have some ideas for improvements, or even notice inaccuracies in the data please let me know.


Legal Stuff: The contents of this blog is provided “as-is”. The information, opinions and views expressed are those of the author and do not necessarily state or reflect those of any other company with affiliation to the products discussed. This includes any URLs or Tools. The author does not accept any responsibility from the use of the information or tools mentioned within this blog, and recommends adequate evaluation against your own requirements to measure suitability.

Sunday, May 27, 2018

Convert CSV files to Parquet using Azure HDInsight

A recent project I have worked on was using CSV files as part of an ETL process from on-premises to Azure and to improve performance further down the stream we wanted to convert the files to Parquet format (with the intent that eventually they would be generated in that format). I couldn't find a current guide for stepping through that process using Azure HDInsight so this post will provide that.

Scripts and what samples used in this guide are available https://github.com/Matticusau/SQLDemos/tree/master/HDInsightConvertCSVtoParquet

To follow this blog post make sure you have:

  1. Create a resource group in your Azure Subscription
  2. Create a Storage Account within the resource group
  3. Create an Azure HDInsight resource the same resource group (you can use that storage account for HDInsight)
  4. Upload the sample GZip compressed CSV files from the SampleData folder to the Storage Account using Azure Storage Explorer. In my case I uploaded to a container "DataLoad"
The work that we will perform will be within the Jupiter Notebook. 

From your Azure Portal locate the HDInsight resource, click the Cluster dashboard quick link


Now select the Jupiter Notebook


This will open a new tab/window.

Authenticate as the cluster administrator.

Create a new PySpark Notebook.


Paste the following lines and press Shift+Enter to run the cell.


from pyspark.sql import *
from pyspark.sql.types import *



Now we can import the CSV into a table. You will need to adjust the path to represent your storage account, container and file. The syntax of the storage path is wasb://mycontainer@myaccount.blob.core.windows.net/foldername/filename


# import the COMPRESSED data
csvFile = spark.read.csv('wasb://dataload@mlbigdatastoracc.blob.core.windows.net/SalesSample_big.csv.gz', header=True, inferSchema=True)
csvFile.write.saveAsTable("salessample_big")


Press Shift+Enter to run the cell

Once complete you can use the SQL language to query the table you imported the data to. This will create a dataframe to host the output as we will use this to write the parquet file.

dftable = spark.sql("SELECT * FROM salessample_big")
dftable.show()

The final step is to export the dataframe to a parquet file. We will also use the gzip compression.

dftable.write.parquet('wasb://dataload@mlbigdatastoracc.blob.core.windows.net/SalesSample2.parquet',None, None , "gzip")

The complete Jupiter Notebook should look like:



In your storage account you should have a Parquet export of the data (note that this format is not a single file as shown by the file, folder and child files in the following screen shots.





In this example you may notice that the compressed file sizes are not much different, yet the parquet file is slightly more efficient. You experience may vary as it depends on the content within the CSV file.


Some reference material worth checking out if this is something you are working on:

Legal Stuff: The contents of this blog is provided “as-is”. The information, opinions and views expressed are those of the author and do not necessarily state or reflect those of any other company with affiliation to the products discussed. This includes any URLs or Tools. The author does not accept any responsibility from the use of the information or tools mentioned within this blog, and recommends adequate evaluation against your own requirements to measure suitability.

Saturday, April 21, 2018

New SqlOps Extensions

Announcing my new SqlOps Extensions

AlwaysOn Insights
https://github.com/Matticusau/sqlops-alwayson-insights/releases

MSSQL Instance Insights
https://github.com/Matticusau/sqlops-mssql-instance-insights/releases

MSSQL Db Insights
https://github.com/Matticusau/sqlops-mssql-db-insights/releases


Previously I had released a single extension, which logically was actually three separate extensions but due to the methods used at the time it could be released as a single extension. With the added support for Extension Marketplace in SqlOps March release it made senses to break these out. Going forward this will also help with the life cycle of the extensions to manage them individually.

If you want to know how I wrote these extensions check out the following posts by Kevin Cunnane

Writing a SQL Operations Studio Extension in 15 minutes
https://medium.com/@kevcunnane/writing-a-sql-operations-studio-extension-in-15-minutes-7dfd24a74dfe

Publishing an extension for SQL Operations Studio
https://medium.com/@kevcunnane/publishing-an-extension-for-sql-operations-studio-f5a5b323c13b