Skip to content

Ingest and Explore LEGO Datasets using Pipelines and Serverless SQL in Azure Synapse Analytics

Woman standing next to a projector showing the Azure Synapse Analytics logo.

In the April session of the Azure Synapse Analytics and Microsoft MVP series, I got to show one of my favorite demos: ingesting and exploring LEGO datasets! 🤓 In just a few minutes, I grab a compressed file from the web, decompress it, store it as a Parquet file in my data lake, and run SQL queries on it using Serverless SQL. It really is that simple:

If you want to try this demo yourself, you can follow the detailed steps below. Have fun! 😃

Prerequisites

Azure Subscription: If you don’t already have an Azure Subscription, you can create a free account on azure.microsoft.com/free. Some of the Azure services will always be free, while some are free for the first 12 months. You get $200 worth of credits that last 30 days so you can test and learn the paid Azure services.

Azure Synapse Analytics workspace: You can follow Microsoft’s Quickstart or Microsoft’s other Quickstart 😅 While creating the Azure Synapse Analytics workspace, you will also create an Azure Data Lake Storage Gen2 Account. In my video, I had created a separate Data Lake for my demo data. However, in the steps below, I will keep it simple and just use the primary storage account.

LEGO Datasets from Rebrickable

LEGO! Yay! I love LEGO. Rebrickable will show you which LEGO sets you can build from the sets and parts you already own.

They also have a database of all official LEGO sets and parts (including themes and colors) that you can download for free from rebrickable.com/downloads. The database consists of multiple compressed CSV files which are updated daily:

  • colors.csv.gz
  • elements.csv.gz
  • inventories.csv.gz
  • inventory_parts.csv.gz
  • inventory_sets.csv.gz
  • inventory_minifigs.csv.gz
  • minifigs.csv.gz
  • part_categories.csv.gz
  • part_relationships.csv.gz
  • parts.csv.gz
  • sets.csv.gz
  • themes.csv.gz

Ingesting LEGO Datasets using Pipelines in Azure Synapse Analytics

In this example, we are going to ingest the sets.csv.gz file from Rebrickable into a container called raw in our Azure Data Lake Storage Account.

From the Azure Synapse Analytics Home page, click Ingest:

Screenshot of the Home page in Azure Synapse Analytics with the Ingest task highlighted.

This opens the Copy Data Tool. Let’s walk through each step!

1. Properties

On the Properties page, choose the built-in copy task. This will create a single pipeline. (The other option, metadata-driven copy task, was released in July 2021. It will create a full, dynamic, metadata-driven framework.) Keep the default run once now option:

Screenshot of the Copy Data Tool step 1, the Properties page.

Click next to move on to the Source properties.

2. Source

On the Source Dataset page, choose HTTP from the source type dropdown, then click new connection:

Screenshot of the Copy Data Tool step 2a, the Source Dataset page.

Give the linked service a name and description, and use the base URL https://cdn.rebrickable.com/media/downloads/. (You can find this URL by inspecting the links on rebrickable.com/downloads. Keep the last slash.) Change authentication type to anonymous. Click create:

Screenshot of the New Connection pane with the properties filled out.

Since we specified the base URL in the linked service, we only have to specify the file name sets.csv.gz in the relative URL. Keep the other default options. Click next:

Screenshot of the Copy Data Tool step 2a, the Source Dataset page with the properties filled out.

On the Source Configuration page, we have to specify the file format settings. If we immediately click on preview data, nothing really makes sense…

Screenshot of the Copy Data Tool step 2b, the Source Configuration page with the default file format settings.

Screenshot of the Copy Data Tool step 2b, the Source Configuration page, previewing the data.

Let’s fix that! Change the compression type to gzip (.gz), and set the compression level to optimal. Click detect text format:

Screenshot of the Copy Data Tool step 2b, the Source Configuration page with the correct file format settings configured.

Verify that the first row as header option gets selected automatically. Then click preview data again. Tadaaa! Magic! We can now preview the data inside the gzipped file without having to download or unzip it manually! Amazing 🤩

Screenshot of the Copy Data Tool step 2b, the Source Configuration page with the correct file format settings configured.

Screenshot of the Copy Data Tool step 2b, the Source Configuration page, previewing the data with the correct settings configured.

Click next to move on to the Target properties.

3. Target

On the Target / Destination Dataset page, choose your WorkspaceDefaultStorage connection:

Screenshot of the Copy Data Tool step 3a, the Target Dataset page.

Specify raw/rebrickable as the folder path, and sets.parquet as the file name. Keep the other default options. Click next:

Screenshot of the Copy Data Tool step 3b, the target configuration page.

Verify that the file format is automatically set to Parquet with the snappy compression after specifying .parquet as the file name in the previous step:

Screenshot of the Copy Data Tool step 3b, the target configuration page showing the file format settings.

Click next to move on to the Settings.

4. Settings

On the Settings page, give the task a better name, for example Copy_Rebrickable_Sets:

Screenshot of the Copy Data Tool step 4, the settings page, configured to skip and log incompatible rows.

Click next to move on to Review and Finish.

5. Review and Finish

On the Review page, you will see a pretty graphic illustrating that you are copying data from an HTTP source to an Azure Data Lake Storage Gen2 target:

Screenshot of the Copy Data Tool step 5, the Review page.

The dataset names have been given a generic name with a random suffix. Click on edit next to each of them and give them more appropriate names. Remember to click save:

Screenshot of the Copy Data Tool step 5, the Review page while editing the dataset names.

⚠️ Make sure that everything looks ok before you click next, because once you move on to Deployment, things will happen!

The final step, Deployment, will create the datasets and pipeline. Since we chose the “run once now” setting in the first Properties step, the pipeline will be executed immediately after deployment:

Screenshot of the Copy Data Tool step 5, the Deployment page, with the deployment in progress.

Once the deployment is complete, click finish:

Screenshot of the Copy Data Tool step 5, the Deployment page, with the deployment completed.

Let’s explore our newly ingested data! 😃

Exploring LEGO Datasets using Serverless SQL in Azure Synapse Analytics

From the Home page, click on the database icon to open the Data hub:

Screenshot of the Home page in Azure Synapse Analytics with the Data hub menu option highlighted.

In the Data hub, click on linked, and then navigate down to the new raw container inside your primary storage account:

Screenshot of the Data hub in Azure Synapse Analytics showing the expanded Data Lake hierarchy.

When you click on a container, you will open the storage explorer view. Double-click on the new rebrickable folder:

Screenshot of the Azure Synapse Analytics storage explorer showing the contents of the raw container.

Right-click on the new sets.parquet file, choose new SQL script and click on select top 100 rows:

Screenshot of the file menu in the Azure Synapse Analytics storage explorer.

Querying the LEGO Datasets

This will open a new Serverless SQL tab with the OPENROWSET code auto-generated, so you don’t have to remember the syntax yourself:

Screenshot of the SELECT TOP 100 * FROM OPENROWSET query.

Execute the query by clicking on run, or by pressing Shift+Enter or F5. You can also click on the properties button in the top right corner to close the properties pane:

Screenshot of the Serverless SQL tab with the Run and Properties buttons highlighted.

You are now viewing the content of the sets.parquet file as if it were any other table in a database! 🤩

Screenshot of the Serverless SQL results.

And like with any other table in a database, you can query it exactly the way you want, for example by selecting specific columns and adding WHERE clauses:

Screenshot of the modified Serverless SQL query and results.

Summary

In this post, we walked through one of my favorite demos: ingesting and exploring LEGO datasets. It really only takes minutes to grab a compressed file from the web, decompress it, store it as a Parquet file in my data lake, and run SQL queries on it using Serverless SQL. I love it! 🤓

Share or Comment?

About the Author

Professional headshot of Cathrine Wilhelmsen.Cathrine Wilhelmsen is a Microsoft Data Platform MVP, BimlHero Certified Expert, international speaker, author, blogger, organizer, and chronic volunteer. She loves data and coding, as well as teaching and sharing knowledge - oh, and sci-fi, coffee, chocolate, and cats 🤓