Skip to content

Source Control in Azure Data Factory

This post is part 18 of 25 in the series Beginner's Guide to Azure Data Factory

Raise your hand if you have wondered why you can only publish and not save anything in Azure Data Factory 🙋🏼‍♀️ Wouldn’t it be nice if you could save work in progress? Well, you can. You just need to set up source control first! In this post, we will look at why you should use source control, how to set it up, and how to use it inside Azure Data Factory.

And yeah, I usually recommend that you set up source control early in your project, and not on day 18… However, it does require some external configuration, and in this series I wanted to get through the Azure Data Factory basics first. But by now, you should know enough to decide whether or not to commit to Azure Data Factory as your data integration tool of choice.

Get it? Commit to Azure Data Factory? Source Control? Commit? 🤓

Ok, that was terrible, I know. But hey, I’ve been writing these posts for 18 days straight now, let me have a few minutes of fun with Wil Wheaton 😂

Aaaaanyway!

Authoring Modes in Azure Data Factory

So far, we’ve been working in the Azure Data Factory mode:

Screenshot of the Azure Data Factory interface, highlighting the Azure Data Factory authoring mode

If we haven’t set up source control yet, we can do that from the authoring mode menu:

Screenshot of the Azure Data Factory interface, highlighting the menu option for setting up a source control code repository

But once we have set up source control, we can switch between the Azure Data Factory mode and the Source Control mode:

Screenshot of the Azure Data Factory interface, highlighting the menu for switching between authoring modes

But what’s the difference between these two modes?

Azure Data Factory Mode

When I compare the two authoring modes, I usually refer to the Azure Data Factory mode as the “production mode“. In this mode, you have to publish to save, and that requires everything to validate first. That’s because when you publish, you deploy your solution from the user interface to the Azure Data Factory service. Or the way I think about it, you deploy “into production“.

Source Control Mode

Just like I refer to the Azure Data Factory mode as “production mode”, I refer to the source control mode as “development mode“. In this mode, you add an additional step to your development process. First, you save your changes in the source control repository, and then you publish from the source control repository to the Azure Data Factory service.

Saving vs. Publishing

We can illustrate saving and publishing using the Azure Data Factory mode and the source control mode like this:

Illustration showing that you publish directly when using the Azure Data Factory mode, while you first save and then publish using the source control mode

By using source control in Azure Data Factory, you get the option to save your work in progress. This is because all you’re really doing is saving the JSON files behind the scenes to the code repository :)

Source Control Options

If you click the set up code repository button, the repository settings pane will open, and you can choose the repository type:

Screenshot of the Azure Data Factory interface, highlighting the repository type in the repository settings pane

You can choose either Azure DevOps Git or GitHub. From here, I will assume that you already have one of these accounts and have the rights to create new projects and repositories :)

Azure DevOps

First, let’s go through how to set up an Azure DevOps code repository, connect our Azure Data Factory to it, then create and save and publish a new dataset. I’m assuming that your user has access to both Azure Data Factory and Azure DevOps.

Warning! There be screenshots. Many, many, many screenshots 🤓

Creating an Azure DevOps Code Repository

First, log into Azure DevOps and choose the organization. I have one called cathrinew-devops. Create a new project:

Screenshot of creating a new Azure DevOps project

Go to repos -> files:

Screenshot of navigating to the Azure DevOps code repository

A Git code repository must always contain at least one file. Create (initialize) the code repository by adding a README file to it:

Screenshot of initializing the Azure DevOps code repository

We now have our empty code repository, ready to go! (I’m going to ignore the friendly TODO instructions on what to add to my README file for now. But in a real project, I would totally listen to the smart advice and add helpful explanations and descriptions 😉)

Screenshot of the empty code repository in Azure DevOps

Connecting to an Azure DevOps Code Repository

Back in Azure Data Factory, click through the settings and specify the Azure DevOps account, project name, and git repository name. I always use master as the collaboration branch, and keep / as the root folder. Then, add the existing pipelines, datasets, and so on to the code repository by checking import existing Azure Data Factory resources to the collaboration branch:

Screenshot of connecting to an Azure DevOps code repository in Azure Data Factory

From now on, whenever you open Azure Data Factory, you will have to choose a branch to work in. Notice the new save all button, YAY! 🥳

Screenshot of the new source control buttons on the toolbar

If we switch back to Azure DevOps and refresh the code repository, we will see that all the imported Azure Data Factory resources:

Screenshot of the imported Azure Data Factory resources in the Azure DevOps code repository

We don’t want to work directly in the master branch, though. Let’s create a new branch:

Screenshot of the create new branch button

I like to name my branches after the feature I’m working on. In this example, I want to create a dataset for the sets.csv file:

Screenshot of the new branch pane

After we have created our new dataset, we can save all or save the dataset:

Screenshot of the new save all and save buttons on the toolbar

Woop woop! Saved!

Screenshot of a saved dataset

But if we try to publish, we will be told that we can only publish from master, from the collaboration branch. This is a good thing thing! This ensures that everything has to be working in master before we can publish to the Azure Data Factory service:

Screenshot of warning that you can only publish from the master branch

Creating an Azure DevOps Pull Request

When you click on merge the changes to master, you will be taken back to Azure DevOps, where you can create a new pull request. This will merge the changes from the sets branch into master:

Screenshot of a pull request in Azure DevOps

Once the pull request has been created, you can complete it. Ideally, you want someone else to review and complete it, but let’s just pretend you’re a coworker for now :)

Screenshot of the pull request in Azure DevOps, highlighting the complete button

If you are done with developing the feature, you can also choose to delete the branch:

Screenshot of the complete pull request screen in Azure DevOps, highlighting the delete branch option

Tadaaa! We have completed our first pull request:

Screenshot of a completed pull request in Azure DevOps

When we switch back to Azure Data Factory, we will be asked to choose a working branch, since the sets branch was deleted. Let’s choose master:

Screenshot of choosing a new branch in Azure Data Factory

Publishing from Master

Now, we can publish:

Screenshot of publishing from the collaboration branch in Azure Data Factory

But! And this is cool :) Instead of just publishing, we can now see what is getting published, and whether it’s new, edited, or deleted:

Screenshot of the pending changes pane in Azure Data Factory

Tadaaa! We have published from the collaboration branch:

Screenshot of successfully publishing from the collaboration branch in Azure Data Factory

But… what if we prefer working with GitHub? Or what if we want to change the code repository? We can easily do that :)

GitHub

Next, let’s go through how to set up a GitHub code repository, connect our Azure Data Factory to it, then create and save and publish another dataset. I’m assuming that your user already has a GitHub account.

Creating a GitHub Code Repository

First, log into GitHub and create a new code repository:

We now have our empty code repository, ready to go!

Disconnecting from an Existing Code Repository

Next, we need to disconnect from the Azure DevOps code repository. If you are starting from scratch with GitHub, you can skip this part :) Go to the Home page and click on Git repo settings:

Screenshot of the Azure Data Factory home page, highlighting the Git repo settings button

Click remove Git:

Screenshot of the existing repository settings page, highlighting the remove Git button

Always read the warnings! :) Your Azure DevOps code repository will not be deleted, but you should publish all changes from it before disconnecting. Type the name of your Azure Data Factory and click Confirm:

Screenshot of the existing repository settings, highlighting the warning message shown when removing an existing repository

Connecting to a GitHub Code Repository

Click set up code repository:

Screenshot of the Azure Data Factory home page, highlighting the set up code repository button

Choose GitHub and then log into your GitHub account:

Screenshot of the repository settings asking to log into GitHub

Specify the GitHub account and git repository name. I always use master as the collaboration branch, and keep / as the root folder. Then, add the existing pipelines, datasets, and so on to the code repository by checking import existing Azure Data Factory resources to the collaboration branch:

Screenshot of the repository settings, connecting to GitHub

We are now connected to the GitHub code repository, woohoo!

Screenshot of the Azure Data Factory interface, highlighting the GitHub toolbar

If we switch back to GitHub and refresh the code repository, we will see that all the imported Azure Data Factory resources:

Screenshot of GitHub, showing the imported Azure Data Factory resources

Let’s create another new branch:

Screenshot of creating a new branch

After we have created the new dataset, we can create a pull request:

Screenshot of the create new pull request button

Creating a GitHub Pull Request

When you create a new pull request, you will be taken back to GitHub. This will merge the changes from the colors branch into master. Compare the changes, and click create pull request:

Screenshot of GitHub, highlighting the create pull request button

Review the pull request, and click create pull request:

Screenshot of GitHub, highlighting the create pull request button

Once the pull request has been created, you can merge it. Ideally, you want someone else to review it first, but let’s just pretend you’re a coworker for now :)

Screenshot of GitHub, highlighting the merge pull request button

You can delete the branch as well:

Screenshot of GitHub, highlighting the delete branch button

Back in Azure Data Factory, you can do the whole publishing loop again :)

Summary

In this post, we looked at why you should use source control, how to set up source control using Azure DevOps and GitHub, and how to use it inside Azure Data Factory.

When we set up source control, you may have noticed another new thing pop up in the interface…

Screenshot of the Azure Data Factory interface, highlighting the templates menu under factory resources

Guess what we will look at in the next post? Yep! Templates!

🤓

About the Author

Cathrine Wilhelmsen is a Microsoft Data Platform MVP, BimlHero Certified Expert, Microsoft Certified Solutions Expert, international speaker, author, blogger, and chronic volunteer who loves teaching and sharing knowledge. She works as a Senior Business Intelligence Consultant at Inmeta, focusing on Azure Data and the Microsoft Data Platform. She loves sci-fi, chocolate, coffee, craft beers, ciders, cat gifs and smilies :)