Whether you are changing naming conventions, migrating to a new tenant or handing over development resources to a customer, there are several scenarios where you may want or need to create a new data factory to develop in, or change the source of your data factory content to a new repo. Azure Data Factory’s git integration makes it easy for you to manage all the content without the headache.
You’re using git integrated mode, right?#
If you’re already set up with git integration, jump to the next section.
Azure Data Factory has two modes it operates in. Data Factory mode and Git integrated mode, which can use either Github or Azure DevOps.
If you haven’t set your data factory to git integrated mode yet, do it now and here’s why:
- The most significant benefit to using git-integrated mode in Azure Data Factory is the ability to save your work without publishing it (which requires it to be in a working, validated state).
- Using git integrated mode opens up the development bandwidth brought with git source control. Multiple developers can create pipelines in different branches without impacting each other.
- You have clear gates for code review by having to do a pull request into the master branch before publishing.
- Once published, ADF creates an ARM template of your data factory in git which can be used to deploy to other data factories (in a CI/CD process perhaps)
Microsoft summarise this well here: Advantages of Git integration
Setting up git integration in your data factory#
- From your data factory home page there’s a big “Set up code repository” button, click that. Or you can set the git integration from the Manage page on the left-hand menu blade.
- You now need to provide the info for the Github or Azure DevOps account that you want to use. You’ll have the choice to create a new repository or use an existing one.
It’s usually permissions that will trip you up here if you’re Azure resources and git provider is quite locked down.
Alternatively, you can create a git-integrated data factory from scratch using PowerShell. Here’s an example using GitHub:
New-AzDataFactoryV2 -ResourceGroupName 'resourcegroupname' -Name 'datafactoryName' -Location 'yourregion' `
-HostName 'https://github.com' -AccountName 'githubaccountname' -RepositoryName 'chosenreponame' -CollaborationBranch 'main' -RootFolder '/optional subfolder'
Moving a git repo to another Data Factory#
Let’s run through the steps to take your git repo to a new Azure Data Factory.
One thing to consider here is that you could operate with more than one data factory connected to a single git repo (it doesn’t mean you should, and refer to the Things that dont work section below for the caveats). The publish process creates a specific folder for each data factory in the adf_publish branch so it should allow you to publish from different data factories under the same branch but I can’t see any distinct benefits or use cases for doing this. Also, running fromt he same collaboration branch is asking for merge conflicts! Please let me know in the comments if you have a scenario for this though.
- Similar to setting up a new data factory with git integration, attach your account for either GitHUb or Azure DevOps and select Use existing under the Git repository name. Im using Azure DevOps for this example.
- If you used a sub-folder in your repository, be sure to add it into the Root folder textbox and uncheck the Import existing Data Factory resources to repository checkbox. Standard process is to set master/main as your collaboration branch unless you’ve set it up differently with your original data factory.
You now have two data factories connected to the same git repo. It’s that easy.
Next steps are to remove the git integration from the original data factory. That’s also the first step in the next section, if you want to change the git provider, account or repo you use so I’ll cover that below.
Things that DON’T work#
There are a few components that will need some attention in your new data factory before you complete the switch over
- Self-hosted integration runtimes will need to be recreated. There is no options to re-point or even share from the original data factory (though that would’nt be a good idea anyway).
- If you are using Azure Key Vault for securing your data source credentials and connection strings, you’ll need to add the new data factory to your key vault’s Access Policy and test this out.
- Depending on the other linked services you’ve implemented, you should test them all to ensure no further config updates are needed.
Cleanup#
If you’ve published using your original data factory there will be a folder named for this data factory in the adf_publish branch of your git repo. When you publish from your new data factory, it will create a new folder so there will be no conflict here. You may want to remove the old folder to prevent any conflicts with CI/CD processes or any other automations.
If you see issues with publishing resources where data factory is trying to just update resources that don’t exist in the adf_publish branch, follow these troubleshooting steps to essentially re-attach the repo and merge changes safely.
Changing Git Repo on a Data Factory#
As I mentioned in the previous section, the last step there, is the first step to change the repo your data factory content is stored.
- Start by hitting Disconnect from the Manage page on the left-hand menu blade under the Git configuration header.
- You’ll get a menu blade pop out from the right with a big angry message warning you about this. It’s not entirely justified as everything is in source control. If you “accidentally” removed the git integration, you can just add it back and nothing is lost.
It’s warning you that if anything isn’t puiblished, it’s not hanging around.
- Once you type the data factory name in, it removed the integration and you are left with only published resources.
- There’s no waiting about. We can click right into Set up code repository from the same screen. Again, we provide the right repository type, your account info, and whether we want a new or existing repo. Crucially, leave the Import existing resources checkbox ticked.
- We can now bring the existing resources into the collaboration branch (usually master or main), create a new branch or use another, existing one.
Microsoft has detailed these same steps on the page I link to earlier. Switch to a different Git repository
It’s interesting to note that this method could be used to merge the pipelines and content from two different data factories together. You can do that by utilising separate branches and merging together in git. Though you may have issues working through potential conflicts so bare that in mind.
I hope this has revealed what could be a very stressful process as being very straight forward. I’d be interested in hearing about any interesting scenarios where you’d need to do this, in the comments.
Thanks, Craig