Skip to content

Copy Data Activity in Azure Data Factory

This post is part 6 of 25 in the series Beginner's Guide to Azure Data Factory

In the previous post, we went through Azure Data Factory pipelines in more detail. In this post, we will dig into the copy data activity. How does it work? How do you configure the settings? And how can you optimize performance while keeping costs down?

Copy Data Activity

The copy data activity is the core (*) activity in Azure Data Factory.

(* Cathrine’s opinion 🤓)

You can copy data to and from more than 80 Software-as-a-Service (SaaS) applications (such as Dynamics 365 and Salesforce), on-premises data stores (such as SQL Server and Oracle), and cloud data stores (such as Azure SQL Database and Amazon S3). During copying, you can define and map columns implicitly or explicitly, convert file formats, and even zip and unzip files – all in one task.

Yeah. It’s powerful :) But how does it really work?

Copy Data Activity Overview

The copy data activity properties are divided into six parts:

Screenshot of the copy data activity with the properties pane, highlighting the different property pages / tabs

General

First things first! Let’s change the name into something more descriptive than the random “Copy_blc” that was auto-generated in the copy data wizard:

Screenshot of the general properties of the copy data activity

Aaaaah, that’s better! Now I can focus 😂 Alright. Where were we?

In addition to the name and description, you can change the copy data activity policies for activity timeout, retry attempts, retry interval, and whether or not to log input and output details. (You probably don’t want to log anything containing sensitive data.) You can hover the little information icons for more details:

Screenshot of hovering over the timeout policy of the copy data activity, explaining the format of the value

These policies are available for the execution activities, meaning most activities except the control activities.

…wait. Policies?

Yeah :) Let me show you! Click the view source code button on the copy data activity:

Screenshot of the copy data activity, highlighting the view source code button

This opens up the JSON code view for the activity. Here, you can see that these properties are grouped under “policy“:

Screenshot of the copy data activity policies

The fact that they are called policies isn’t really important. However, I wanted to show you an example of things you can find and discover in the JSON code that might not be obvious from the graphical user interface. It may be worth peeking at the code when learning Azure Data Factory to understand things even better.

(Or not. Maybe I’m just weird :D)

Source and Sink

It’s easy to explain the concept of the source and sink properties. It’s where you specify the datasets you want to copy data from and to. But it’s nearly impossible for me to go through the properties in detail. Why? Because the properties completely depend on the type of dataset and data store you are using.

Here’s an example of four different datasets used as a source:

Screenshot of the copy data activity source properties for a CSV file
Screenshot of the copy data activity source properties for a JSON file
Screenshot of the copy data activity source properties for an Azure SQL Database table
Screenshot of the copy data activity source properties for an Amazon Redshift table

And here’s an example of four different datasets used as a sink:

Screenshot of the copy data activity sink properties for a CSV file
Screenshot of the copy data activity sink properties for a JSON file
Screenshot of the copy data activity sink properties for an Azure SQL Database table
Screenshot of the copy data activity sink properties for a Common Data Service for Apps

And with over 80 connectors… phew! I don’t even know the details of all the connectors. More importantly, you don’t want me trying to explain ALL THE THINGS. I’m already rambling enough as it is :D

So! Here’s what I do, and what I recommend you do: Bookmark the Azure Data Factory connector overview. When you use a new dataset or data store, reference that overview, and read up on the details. Then go click and type things in the copy data activity until you configure it the way you need to for your solution. Easy! ;)

Mapping

In the copy data activity, you can map columns from the source to the sink implicitly or explicitly.

Implicit mapping is the default. If you leave the mappings empty, Azure Data Factory will do its best to map columns by column names:

Screenshot of implicit mapping in the copy data activity

Explicit mapping is when you decide how to map columns from the source to the sink. You have to do this if the source and sink are different:

Screenshot of explicit mapping in the copy data activity

You can also remove mappings, for example if you are copying data to a text file and only want to keep a subset of the columns from the source. Hover over the mapping, and click the delete button that appears next to it:

Screenshot of deleting an explicit mapping in the copy data activity

You can also preview the source dataset:

Screenshot of explicit mapping in the copy data activity, highlighting the preview button

If you have deleted a mapping, the preview will only show you the remaining columns:

Screenshot of the previewing source data

You have full control and flexibility when mapping columns from the source to the sink. I still recommend using implicit mapping whenever possible, because you will have one less thing to configure and maintain.

(Saving clicks and time, yay! :D)

Settings

You can adjust the copy data activity settings for data integration units, degree of copy parallelism, fault tolerance, and staging:

Screenshot of the copy data activity settings

Data Integration Units (DIUs)

Azure Data Factory runs on hardware managed by Microsoft. You can’t configure this hardware directly, but you can specify the number of Data Integration Units (DIUs) you want the copy data activity to use:

Screenshot of the data integration units setting

One Data Integration Unit (DIU) represents some combination of CPU, memory, and network resource allocation. I have absolutely no idea what 1 DIU actually is, but it doesn’t really matter. What matters is that the more DIUs you specify, the more power you throw at the copy data activity.

And the more power you throw at the copy data activity, the more you pay for it. (Or like my 9-year old niece would say: 💪 = 💰)

But what number of DIUs is Auto, the default value? Maybe the information button can help:

Screenshot of the data integration unit setting, highlighting the description

“Value can be 2-256. When you choose ‘Auto’, this Data Factory dynamically applies the optimal DIU setting based on your source-sink pair and data pattern.”

So… if the DIU value is set to Auto, the number of DIUs used starts at 2? Nope. If we click the information button, we are redirected to the official documentation:

Screenshot of the documentation, highlighting that the default number of DIUs start at 4

If you leave the Data Integration Unit value to Auto, the copy data activity uses a minimum of 4 DIUs.

For larger datasets, this might be optimal! But for smaller datasets, you might be wasting money leaving the setting to Auto. This is why I recommend starting with 2 DIUs, and then scaling up if and when necessary :)

Degree of Copy Parallelism

The degree of copy parallelism value specifies the maximum number of connections that can read from your source or write to your sink in parallel:

Screenshot of the degree of copy parallelism setting

In most cases, I tweak the DIUs, but leave this setting to Auto and let Azure Data Factory decide how to chunk up and copy my data :) If I change this value, I usually set it explicitly to 1 to reduce the load on my sources. This could have a huge effect on your solution, though, so make sure you read the documentation and do some testing!

Fault Tolerance

In our example, we set the fault tolerance to “skip and log incompatible rows” in the copy data wizard:

Screenshot of the fault tolerance settings

You can also set this to “abort activity on first incompatible row” or “skip incompatible row“. Choose the one that is most appropriate for your use case.

Staging

Alright, this is another interesting feature. One-click staging? Whaaaaat!

Screenshot of the staging settings

You can enable staging to temporarily store data in blob storage, before loading into the final destination. This is mainly used for optimal loading into Azure Synapse Analytics (formerly Azure SQL Data Warehouse) via PolyBase.

Now, there are two things you need to be aware of here. One, you only store the staging data temporarily. All the staging data is deleted once the copy data activity finishes. If you want to keep the staging data, you need to build your own solution. And two, behind the scenes, this works like using two copy data activities.

You may save time not having to configure a second copy data activity yourself, but you will still pay for that second execution :)

Copy Data Process

When you copy binary files and other files as-is, the copy data process is fast and simple, like copying files on your computer. You take one file, and copy it into a different location:

Illustration of copying a simple file from source to sink

However, the copy data activity is powerful. You are not limited to copying files as-is :) The copy data activity can do some pretty cool things during copying:

Illustration of copying a complex file from source to sink

Serialization and deserialization can be explained as converting file formats during copying. For example, we can use a CSV file as a source and copy it into an Azure SQL Database destination. To do that, we need to read the source data in CSV format and transfer it as bits and bytes over the network (serialization), then convert that stream of bits and bytes to a SQL format so we can load it into our database (deserialization).

Similarly, compression and decompression can be explained as zipping and unzipping files during copying. You don’t need to first copy the file and then zip or unzip it. You just specify that the source is zipped and the destination is unzipped, or vice versa.

The cool thing is, you don’t even need to know that these things happen! (Well, I mean, except when you’re reading my post and I call it out because I’m a total geek and I find this interesting…) You don’t need to know what these processes are called or how they work. All you need to do is tell the copy data activity what you want to achieve, and it does the heavy lifting for you. Yay!

Summary

In this post, we looked at the copy data activity, how it works, and how to configure its settings. We also called out a few things that affect the cost, like the number of Data Integration Units (DIUs) and enabling staging.

In the next post, we’re going to look more at our source and sink datasets!

🤓

About the Author

Cathrine Wilhelmsen is a Microsoft Data Platform MVP, BimlHero Certified Expert, Microsoft Certified Solutions Expert, international speaker, author, blogger, and chronic volunteer who loves teaching and sharing knowledge. She works as a Senior Business Intelligence Consultant at Inmeta, focusing on Azure Data and the Microsoft Data Platform. She loves sci-fi, chocolate, coffee, craft beers, ciders, cat gifs and smilies :)