39.1 AKS Cluster Creation with az CLI and Terraform

Alright, let’s get our hands dirty. You’re about to create an AKS cluster, which is essentially you renting a fully-managed Kubernetes control plane from Microsoft. The magic here is that they handle the API server, scheduler, etcd, and all those other finicky control plane components that you really don’t want to get a 3 AM page about. You just manage the worker nodes. It’s a fantastic division of labor.

Now, you’ve got two primary paths to make this happen: the quick-and-dirty az cli for when you need to test something now, and the sober, responsible Terraform path for when you need something repeatable, version-controlled, and actually sane. We’ll do both. Strap in.

The `az cli` Firestarter Method

This is your tactical tool. Perfect for a quick proof-of-concept, a temporary environment, or just reminding yourself how this whole thing works. The command is deceptively simple, which is both a blessing and a curse. Here’s the minimal version that will actually get you a running cluster.

az group create --name myResourceGroup --location eastus
az aks create \
  --resource-group myResourceGroup \
  --name myAwesomeCluster \
  --node-count 2 \
  --node-vm-size Standard_D2s_v3 \
  --generate-ssh-keys

Boom. A few minutes later, you have a cluster. But let’s be honest, this is the equivalent of building a house with just a hammer. It works, but you’re missing about a thousand important details. Why Standard_D2s_v3? Why only two nodes? What about network policy? Azure RBAC? You get the defaults, and frankly, some of Azure’s defaults are…questionable.

Let’s do it like a pro who has been burned before. This command is longer because sanity is verbose.

az aks create \
  --resource-group myResourceGroup \
  --name myLessRiskyCluster \
  --location eastus \
  --kubernetes-version 1.27.3 \ # Always pin a version. NEVER use "stable".
  --node-count 3 \ # Three is the minimum for a resilient nodepool. Two is a trap.
  --node-vm-size Standard_D4s_v3 \ # D2s_v3 is anaemic. D4 is a decent starting point.
  --node-osdisk-size 100 \ # The default 30GB is a joke. Your containers will hate you.
  --max-pods 50 \ # Adjust this based on your needs and the VM's NIC limits.
  --network-plugin azure \ # Azure CNI is the way to go for serious work. kubenet is...simpler.
  --network-policy calico \ # You want network policies. Trust me.
  --enable-aad \ # Integrate with Azure AD for auth. Non-negotiable for real security.
  --enable-azure-rbac \ # Use Azure RBAC for K8s auth instead of building your own roles.
  --attach-acr myacr \ # Pull images from this Azure Container Registry seamlessly.
  --tags "Environment=Dev" "Owner=my.email@company.com" # Tag everything. Always.

The --enable-aad and --enable-azure-rbac flags are critical. They flip the entire authentication model from using static, boring kubeconfig credentials to hooking directly into Azure Active Directory. This means you can assign Kubernetes permissions to Azure AD users and groups directly in Azure’s IAM blade. It’s a million times more manageable.

The Terraform “Grown-Up” Method

The az cli command is a one-night stand. Terraform is a marriage. It’s declarative, version-controlled, and ensures your infrastructure is reproducible. Here’s a basic but robust main.tf to get you started.

terraform {
  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "~> 3.0"
    }
  }
}

provider "azurerm" {
  features {}
}

resource "azurerm_resource_group" "aks_rg" {
  name     = "rg-aks-prod-001"
  location = "East US"
}

resource "azurerm_kubernetes_cluster" "main" {
  name                = "aks-prod-001"
  location            = azurerm_resource_group.aks_rg.location
  resource_group_name = azurerm_resource_group.aks_rg.name
  dns_prefix          = "aksprod001" # Has to be unique across Azure.

  kubernetes_version = "1.27.3"
  node_resource_group = "rg-aks-prod-nodes-001" # Explicitly name the node RG, or Azure gives it a terrible auto-name.

  default_node_pool {
    name           = "system" # Name it meaningfully.
    node_count     = 3
    vm_size        = "Standard_D4s_v3"
    os_disk_size_gb = 100
    max_pods       = 50
  }

  identity {
    type = "SystemAssigned" # This is the preferred, simplest identity model for the control plane.
  }

  network_profile {
    network_plugin = "azure"
    network_policy = "calico"
  }

  azure_active_directory_role_based_access_control {
    managed = true # This is the magic that enables Azure RBAC.
    admin_group_object_ids = [data.azurerm_client_config.current.object_id] # Put your AD group ID here.
  }

  tags = {
    Environment = "Production"
  }
}

# Output the kubeconfig and host for other modules to use
output "kube_config" {
  value = azurerm_kubernetes_cluster.main.kube_config_raw
  sensitive = true
}

output "host" {
  value = azurerm_kubernetes_cluster.main.kube_config.0.host
}

The sheer beauty of this is that you can run terraform apply, go get a coffee, and come back to a perfectly formed cluster. Even better, when you need to change the node count or VM size, you change the code and apply again. No more guessing what commands you ran six months ago.

The Pitfalls They Don’t Tell You About

First, the node resource group. When you create an AKS cluster, Azure creates a second, hidden resource group to hold all the node VMs, disks, VNets, and load balancers. The az cli command names this something like MC_myResourceGroup_myAwesomeCluster_eastus. It’s ugly and incomprehensible. With Terraform, use the node_resource_group argument to give it a sane, predictable name. This is a lifesaver for cost-tracking and cleanup.

Second, identity. The AKS control plane (the managed part) needs permissions to create resources like load balancers and managed disks in that node resource group. The az cli’s --generate-ssh-keys method defaults to using a Service Principal, which is a pain to manage (password rotation, anyone?). The Terraform example uses a SystemAssigned Managed Identity, which is Microsoft’s modern, superior alternative. The platform handles the credentials automatically. Always choose Managed Identity.

Finally, cleanup. With the az cli, you just delete the main resource group and pray the node group gets cleaned up. With Terraform, it’s a single terraform destroy. But the real pro move? Setting a prevent_destroy lifecycle policy on your production cluster in Terraform state so some intern doesn’t accidentally nuke it with a stray command. Consider it the emergency brake for your most critical infrastructure.

The az cli Firestarter Method

The Terraform “Grown-Up” Method

The Pitfalls They Don’t Tell You About

The `az cli` Firestarter Method