In the previous post, we looked at the different Azure Data Factory components. In this post, we’re going to tie everything together and start making things happen. Woohoo! First, we will get familiar with our demo datasets. Then, we will create our Azure Storage Accounts that we will copy data into. Finally, we will start copying data using the Copy Data Wizard.
First, let’s get familiar with the demo datasets we will be using. I don’t know about you, but I’m a teeny tiny bit tired of the AdventureWorks demos. (I don’t even own a bike…) WideWorldImporters is at least a little more interesting. (Yay, IT joke mugs and chocolate frogs!) But! Let’s use something that’s not already in relational database format.
Let me present… *drumroll* 🥁
LEGO Data from Rebrickable
LEGO! Yay! I love LEGO. Rebrickable is an online service that will show you which LEGO sets you can build from the sets and parts you already own. Fun! :)
They also have a database of all official LEGO sets and parts (including themes and colors) that you can download for free as CSV files or JSON files.
The CSV files are automatically generated at the start of each month and can be found on rebrickable.com/downloads. The available CSV files are:
The JSON files can be downloaded through their API. This is free to use, but requires an account and an API key to use.
Alternative: Movie and TV Data from IMDb
Movies! Yay! I also love movies. If you prefer working with movie or TV data, you can get a subset of the IMDb data from datasets.imdbws.com. These files are refreshed daily, and you can find the descriptions and details on imdb.com/interfaces. Make sure you read the terms and conditions for usage. I will (most likely) not use these datasets in my posts, but they are fun to experiment with.
Create Azure Storage Accounts
Before we can start copying data, we need to create the Azure Storage accounts to copy data into. We’re going to create two accounts:
- Azure Blob Storage (General Purpose v2)
- Azure Data Lake Storage Gen2 (General Purpose v2 with hierarchical namespaces enabled)
The Azure Data Lake Storage Gen2 account will be used for data storage, while the Azure Blob Storage account will be used for logging errors.
To create both storage accounts, go to the Azure Portal and navigate to the Create Storage Account page. Choose the same resource group and location you used while creating your Azure Data Factory. Choose a name for each storage account, I chose cathrinewblob and cathrinewadls. Change replication to locally-redundant storage (LRS) to keep the cost down for these demos. When creating the Azure Data Lake Storage Gen2 account, make sure you click on the Advanced tab and enable hierarchical namespace.
The only difference between the two storage accounts is the hierarchical namespace:
Download and Install Azure Storage Explorer
I also recommend downloading and installing Azure Storage Explorer. It’s completely free! You can view the content of your storage accounts in the browser, but I much prefer using Azure Storage Explorer.
Oh, and, it looks nice :)
Ok. Finally! Are you ready to copy some data?
Copy Data Wizard
In this example, we are going to copy the themes.csv file from Rebrickable into a blob container called lego in our Azure Data Lake Storage Gen2 account.
From the Azure Data Factory Home page, click copy data:
This opens the Copy Data Wizard. Let’s walk through each step!
On the Properties page, give the pipeline a name and description. Keep the default “run once now” option:
Click next to move on to the Source properties.
On the Source page, we will first create a new linked service to Rebrickable, then create a new dataset to represent the themes.csv file.
Click create new connection:
Search and select the HTTP Linked Service:
Give the linked service a name and description, and use the base URL cdn.rebrickable.com/media/downloads/. (You can find this URL by inspecting the links on rebrickable.com/downloads. Keep the last slash.) Change authentication type to anonymous. Click create:
The linked service has now been created, yay! Make sure it’s selected and click next to move on to the dataset properties:
Since we specified the base URL in the Linked Service, we only have to specify the file name themes.csv.gz in the relative URL. Keep the other default options. Click next:
This next part feels kind of like magic, especially if you have been working with SQL Server Integration Services (SSIS) in the past. The Copy Data Wizard now inspects the file and tries to figure out the file format for us. But… since we are working with a gzipped file, it doesn’t make a whole lot of sense yet…
Let’s fix that! Change the compression type to gzip. Tadaaa! Magic! Without us doing anything else manually, the copy data wizard unzips the CSV file for us and shows us a preview of the content:
If you are working with a raw CSV file, the copy data wizard can detect the file format, the delimiter, and even that we have headers in the first row. But since we are working with a gzipped file, we have to configure these settings manually. Choose first row as header:
If the headers are not detected correctly on the first attempt, try clicking detect text format again:
You can now preview the schema inside the gzipped file. Beautiful! :D
Click next to move on to the Destination properties.
On the Destination page, we will first create a new linked service to our Azure Data Lake Storage Gen2 account, then create a new dataset to represent the themes.csv file in the destination.
Click create new connection:
Select the Azure Data Lake Storage Gen2 linked service:
Give the linked service a name and description. Select your storage account name from the dropdown list. Test the connection. Click create:
The second linked service has now been created, yay! Make sure it’s selected, and click next to move on to the dataset properties:
Specify lego as the folder path, and themes.csv as the file name. Keep the other default options. Click next:
Enable add header to file and keep the other default options:
Click next to move on to the Settings.
On the Settings page, we will configure the fault tolerance settings. This is another part that feels like magic. By changing a setting, we can enable automatic handling and logging of rows with errors. Whaaat! :D In SQL Server Integration Services (SSIS), this had to be handled manually. In Azure Data Factory, you literally just enable it and specify the settings. MAGIC! :D
Change the fault tolerance settings to skip and log incompatible rows:
At this time, error logging can only be done to Azure Blob Storage. Aha! So that’s why we created two storage accounts earlier ;) Click new:
The Copy Data Wizard is even smart enough to figure out that it needs to create an Azure Blob Storage connection. Good Copy Data Wizard :D Give the linked service a name and description. Select your storage account name from the dropdown list. Test the connection. Click create:
Specify lego/errors/themes as the folder path:
Click next to move on to the Summary.
On the Summary page, you will see a pretty graphic illustrating that you are copying data from an HTTP source to an Azure Data Lake Storage Gen2 destination:
Click next to move on to Deployment.
The final step, Deployment, will create the datasets and pipeline. Since we chose the “run once now” setting in the Properties step, the pipeline will be executed immediately after deployment:
Once the deployment is complete, we can open the pipeline on the Author page, or view the execution on the Monitor page. Click monitor:
Success! ✔🥳 Our pipeline executed successfully.
We can now open Azure Storage Explorer and verify that the file has been copied from Rebrickable:
In this post, we first explored the demo datasets that we used as our source. Then, we created our Azure Storage accounts for storing data and logging errors. Finally, we used the Copy Data Wizard to download a gzipped CSV file from our demo datasets, unzip it, and load the CSV file into our storage account.
The Copy Data Wizard created all the factory resources for us: one pipeline with a copy data activity, two datasets, and two linked services. This guided experience is a great way to get started with Azure Data Factory.
Next, we will go through each of these factory resources in more detail, and look at how to create them from the Author page instead of through the Copy Data Wizard. First, let’s look at pipelines!