Skip to main content

Design an Azure Data Platform that InfoSec will love

·655 words·4 mins

Reference architectures are great! You’ve got all of the key components in there, nice and clear. Colourful lines showing how data moves through each stage, product, or service. Great for a slide deck or a proposal to get rid of that old creaking data warehouse and into a shiny new Data Lakehouse.

Not so great for the finer details demanded by security operations teams however.

A data lakehouse example architecture

An example Data Lakehouse Platform

Diagrams like the one above are perfect to demonstrate a technology stack and summarise a platform but what about network security, what about all the public endpoints and standard ports all sitting there in the PUBLIC cloud?

I’m not aiming for clickbait here and in reality, the vast majority of Azure-based resources and services are protected through Azure Active Directory (AAD) Role-based Access Control (RBAC) on top of the ability to set up IP whitelisting which is just fine for many organisations and use cases BUT they won’t put every security team at ease.

For the rest, there are many more security layers and features we can apply. I will walk through how Data Lake Storage and Azure SQL differ from Databricks, how Data Factory should be secured, but also pick out the features that are just going to cause you pain. Lets start with the foundations for all of those services.

Networking
#

Thinking about the architecture above, one of the most common ways to secure these resources is by wrapping them all up in a virtual network (Vnet). This gives us the ability to control what traffic comes in and what traffic goes out (if any) of our network. That VNet can then be peered to other VNets and your internal network to facilitate connectivity, with subnets inside for specific resource types.

Now, I’m not a network engineer and exact configurations will always differ across organisations, so your implementation may vary. Microsoft’s Azure Networking Architectures is a good place to start if this is all new to you.

That’s going to look a little like this if we are building out a diagram for our secure Azure Data Platform

An Azure virtual network with a subnet

A VNet and subnet example

With this baseline we’ve now inadvertently started restricting the tools we can work with and how we can deploy some resources and this is the biggest risk to that initial reference architecture.

Once you start digging deeper and securing the platform to meet your organisations security policies, some features start to go away and others need much more complex implementations

Difficult DevOps
#

The simplicity of using Azure DevOps for deployments is one of the reasons I’ve rarely ventured away from it. Sure, its frustrating getting YAML files right but just hitting run and letting Microsoft worry about the build agent makes up for that.

That happy path isn’t possible when working with deployments inside VNets as described in the Networking section on this Microsoft Docs page. It’s possible to stick with a Microsoft-hosted build agent but you’re left opening and updating IP ranges every week! A hosted build agent becomes a much simpler approach, which means we’d need to provision one or more virtual machines within our Vnet to use as a build agent. That’s additional cost, resources, and administration. It’s not an insurmountable task but it’s an oft overlooked one.

Platform Resources
#

In the rest of this series, I’ll look at all of the resources common to our shiny new Data Lakehouse platform architecture and what you need to think about to get it past your security team

Other posts in this series#

Related

CI/CD for Azure Data Factory: Adding a production deployment stage
·1669 words·8 mins
Azure Data Factory - Production deployment
My break time browsing list for 22nd Oct
·314 words·2 mins
I’m continuing the theme of #Hacktoberfest this week with another link on contributing to open source down the bottom.
My break time browsing list for 8th Oct
·311 words·2 mins
Last week we had SQLBits running, virtually, from Tuesday through Saturday.