Curated documentation updates, feature announcements, community blogs, release highlights, and more.
May was a packed month for AKS — headlined by a critical kernel CVE, a major shift away from Ingress NGINX, and strong momentum in GPU/AI observability. The platform continues to mature in networking, upgrade safety, and multi-cluster operations. This edition also features a full KubeCon EU 2026 Azure Pre-Day recap.
Let's dive in.
Application routing add-on with Gateway API – now generally available: The application routing add-on using the Kubernetes Gateway API is now GA. This is the strategic replacement for the legacy Ingress NGINX-based routing in AKS and aligns with the upstream retirement of the Ingress NGINX project. Teams should plan migration paths from Ingress to Gateway API.
AKS GitHub issue tracking – now generally available: The consolidated AKS issue tracking on GitHub is now the official channel for bug reports and feature requests. This provides a single, transparent backlog that platform teams can follow and upvote.
Application Gateway for Containers managed add-on + AKS Automatic: Application Gateway for Containers is now available as a managed add-on integrated with AKS Automatic clusters. This gives teams a turnkey L7 ingress solution with native Azure integration, without needing to manage the ALB controller lifecycle separately.
NAT Gateway V2: The next generation of NAT Gateway enters preview for AKS clusters using managed outbound. It improves scalability and reliability for SNAT-heavy workloads and addresses known limitations in the original NAT Gateway architecture.
AKS List Available VM SKUs API: A new API allows querying which VM SKUs are available for node pool creation in a given region. This is useful for platform automation that needs to validate capacity and VM family availability before provisioning.
AKS-managed GPU metrics: GPU metrics (utilization, memory, temperature) are now collected by default in Azure Managed Prometheus when GPU node pools are present. This removes a long-standing operational gap where teams had to deploy custom exporters to get GPU observability.
Capacity Based Surge for node pool upgrades: A new surge strategy that adapts to available capacity during rolling upgrades. Instead of a fixed surge count, AKS will maximize the use of available quota, improving upgrade speed in constrained environments.
AKS Automatic defaults to Gateway API: AKS Automatic clusters will now be preconfigured with the Kubernetes Gateway API via the application routing add-on, replacing Managed NGINX Ingress as the default. This is a direct consequence of the upstream Ingress NGINX retirement and affects all new AKS Automatic clusters.
Azure Kubernetes Application Network (preview): A new unified networking layer is being introduced for Kubernetes workloads on Azure. This consolidates ingress, egress, and service-to-service networking under a single umbrella, signaling a longer-term convergence of networking primitives.
CVE-2026-31431 – kernel local privilege escalation: A critical kernel vulnerability ("Copy Fail") allows any pod — including non-root pods with no special capabilities — to escalate to root on the underlying node. All AKS Linux nodes are affected until a node image upgrade or DaemonSet mitigation is applied. Immediate action required.
Ingress NGINX upstream retirement: Kubernetes SIG Network and the Security Response Committee announced the retirement of the Ingress NGINX project, with maintenance ending in March 2026. AKS teams still using Ingress NGINX should migrate to Gateway API or Application Gateway for Containers.
Long Term Support scope updated: The LTS documentation was refreshed to clarify version coverage and support timelines. Teams relying on LTS versions should review the updated lifecycle boundaries.
Best Practices for Cost Optimization: This guide was refreshed with current recommendations for AKS cost optimization, including guidance on AKS Automatic's built-in cost controls, FinOps practices, spot node pools, and autoscaler tuning. Directly relevant for teams reviewing cluster spend.
Configure rolling upgrades for node pools: Updated with the new Capacity Based Surge option and clearer guidance on drain timeout, soak time, and max-unavailable settings. Critical reading for teams managing large node pools with strict SLA requirements.
Application routing add-on with Gateway API: Refreshed to reflect GA status and include migration guidance from legacy Ingress NGINX. This is the primary reference for teams adopting Gateway API on AKS.
GPU Node Partitioning Strategies: Updated to cover AKS-managed MIG (multi-instance GPU) and time-slicing with MPS using the NVIDIA GPU Operator. Important for teams running inference or training workloads that need to share expensive GPU hardware efficiently.
Confidential Containers overview: Refreshed with current deployment guidance and security model details for running workloads inside encrypted, attestation-backed enclaves on AKS. Relevant for regulated industries handling sensitive data.
Upgrade Options and Recommendations: A new consolidated guide covering in-place upgrades, blue-green strategies, and scenario-based recommendations for common upgrade challenges. Helps teams pick the right upgrade approach for their risk profile.
Node Auto-Provisioning (NAP): Updated with current CLI and ARM template examples for enabling and disabling NAP. Important for teams using Karpenter-based provisioning at scale.
Advanced Container Networking Services overview: Refreshed to reflect the GA of metrics filtering and log aggregation features. This is the umbrella documentation for ACNS capabilities including Container Network Observability and Security.
Migrate from in-tree storage to CSI drivers: Updated migration guidance for moving from deprecated in-tree storage drivers to CSI. Clusters still using in-tree will face removal in upcoming Kubernetes versions.
List Available VM SKUs (Preview): New documentation for the VM SKU availability API, explaining how to query region-specific SKU constraints before provisioning node pools programmatically.
Istio service mesh add-on support policy: Refreshed to clarify supported Istio versions, upgrade expectations, and the boundary between managed and self-managed Istio components on AKS.
Node Images: Updated with current node OS image options including Ubuntu 24.04, Azure Linux, and Flatcar. Helps teams understand the tradeoffs between OS choices.
Confidential VMs (CVM): Updated guidance on creating CVM-backed node pools for workloads requiring hardware-level memory encryption and attestation.
Vulnerability Data API: New documentation for the AKS Vulnerability Data API, which provides programmatic access to CVE impact data for security reviews, compliance reporting, and upgrade planning.
Virtual Machines Node Pools: Updated to explain how to mix multiple VM types of the same family in a single node pool, improving flexibility for heterogeneous workloads.
Cluster Autoscaler: Refreshed with current best practices for autoscaler profiles, scale-down behavior, and integration with pod disruption budgets.
Upgrade the Control Plane: Updated with current guidance on control plane upgrade sequencing, API compatibility, and rollback options.
Supported Kubernetes Versions: Updated version table reflecting current GA and preview Kubernetes releases available in AKS, including LTS eligibility.
Long-Term Support Versions: Refreshed to reflect current LTS version coverage and extended support timelines.
Entra ID Authentication for the Control Plane: Updated integration guide for using Microsoft Entra ID as the identity provider for Kubernetes API server authentication. Important for teams enforcing centralized identity and conditional access policies.
Azure Key Vault Provider for Secrets Store CSI Driver: Refreshed with current configuration examples and best practices for injecting secrets from Key Vault into pods without application-level SDK changes.
System Node Pools: Updated with current sizing guidance and taint configuration to properly isolate system workloads from application pods.
Ingress Networking Concepts: Refreshed to reflect the Ingress NGINX deprecation and the shift toward Gateway API as the recommended ingress pattern.
Network Policies: Updated with current guidance on Azure Network Policy Manager and Cilium-based network policies for securing pod-to-pod traffic.
Update Azure CNI IPAM Mode and Data Plane: Updated guidance for migrating existing clusters to newer Azure CNI IPAM modes (overlay, dynamic IP allocation) and data plane technologies.
CNI Networking Concepts: Refreshed to reflect current CNI options including Azure CNI Overlay, Azure CNI with dynamic IP, and kubenet deprecation path.
AKS Core Concepts: Refreshed foundational documentation covering cluster architecture, node pools, networking, and identity fundamentals.
Apply Copy Fail and DirtyFrag CVE mitigations at-scale using Azure Kubernetes Fleet Manager: This post demonstrates how to use Azure Kubernetes Fleet Manager to safely roll out mitigations for CVE-2026-31431 ("Copy Fail") and CVE-2026-43284 / CVE-2026-43500 ("DirtyFrag") across multiple AKS clusters. The vulnerability allows container-to-root escalation and impacts all AKS Linux nodes. Practical guidance covers both node image upgrades and a DaemonSet-based mitigation for clusters that cannot upgrade immediately.
Announcing Public Preview of Argo CD extension in AKS Azure Portal Experience: Argo CD is now available as a managed extension directly in the AKS Azure Portal. This brings GitOps deployment visibility into the portal experience without needing separate Argo CD dashboards, reducing context-switching for teams operating through the Azure console.
Metrics Filtering and Log Aggregation Now GA for Advanced Container Networking Services: ACNS now offers GA-level metrics filtering and log aggregation for container networking. This allows operators to reduce noise in networking telemetry and focus on actionable signals, directly improving incident response for network-related issues.
Building a Controllable Inference Platform on Kubernetes with AI Runway: AI Runway provides a practical framework for teams operating inference workloads on AKS. It bridges the gap between "calling an external model API" and "operating an enterprise-grade inference platform" with guardrails, routing, and cost controls built on Kubernetes primitives.
Getting Started with OpenSearch on AKS with AKS AVM and Helm: A practical walkthrough for deploying OpenSearch on AKS using Azure Verified Modules (AVM) for infrastructure and Helm for the application layer. Useful for teams needing a self-hosted search and analytics platform with full control over data residency.
Six Coding Agents, One Production System: A Field Guide to AgenticOps with AKS-Lab-GitHubCopilot: This post explores running multiple AI coding agents against a shared production-like AKS environment. It introduces "AgenticOps" patterns for sandboxing, observability, and safe deployment when AI agents are generating and deploying code.
Decoupling Memory from Startup Time in AKS Sandbox Pods: A deep dive into Kata Containers memory management on AKS, showing how to decouple memory allocation from pod startup. This is particularly relevant for teams running sandbox pods where cold start latency and memory overhead are in tension.
NVIDIA Dynamo on AKS – Autoscaling LLM Inference: Demonstrates using NVIDIA Dynamo's disaggregated inference engine on AKS with GPU-aware autoscaling. Addresses the challenge of scaling LLM serving workloads where traditional HPA metrics don't capture GPU saturation or request queuing depth effectively.
Introducing cert-manager for Azure Arc-enabled Kubernetes: now in Public Preview: cert-manager is now available as a managed extension for Arc-enabled Kubernetes, including AKS Edge. It ships with security defaults — TLS enforcement, least-privilege RBAC, and restricted pod security — making certificate lifecycle management simpler and more secure for hybrid environments.
Giving the Copilot SDK Agent a "hardware-level helmet" using Kata microVM on AKS: Shows how to use Kata Containers (microVM isolation) on AKS to sandbox AI agents that execute untrusted code. Demonstrates that hardware-level isolation adds minimal overhead while preventing container escape scenarios in agent-driven workloads.
Azure Arc AKS Explained: Run Kubernetes Beyond Azure Cloud: An overview of running AKS on Azure Arc for hybrid and multi-cloud scenarios. Covers the deployment model, management plane integration, and when Arc-based AKS makes sense versus standard AKS or self-managed Kubernetes.
Distributing model weights to your AI cluster: a faster pre-flight on AKS and Slurm: Explains techniques for distributing large model weight files across GPU nodes on AKS and Slurm clusters. Addresses the cold-start problem where model loading time dominates inference pod startup, especially for 70B+ parameter models.
High Availability Testing for Azure Kubernetes Service in a Single Region with Availability Zones: Provides a methodology for validating AKS high availability behavior under simulated zone failures. Covers test scenarios, expected behaviors, and common pitfalls when designing single-region HA architectures with availability zones.
Troubleshooting AKS Scaling Issues with Jameson Hearn: A practical walkthrough of diagnosing scaling failures in AKS — covering cluster autoscaler misconfigurations, pending pods, and quota-related issues. Useful for on-call engineers troubleshooting node scaling.
Cross-Cluster Networking for Azure Kubernetes Fleet Manager: Explains how Fleet Manager enables cross-cluster networking, allowing services in different AKS clusters to communicate seamlessly. Relevant for teams operating multi-cluster architectures that need east-west traffic across cluster boundaries.
AKS Community Call – March 2026: Monthly community call with product team updates, roadmap signals, and live Q&A. Covers networking and observability topics from this period.
AKS Community Call – April 2026: Monthly community call featuring discussions on Gateway API GA, Ingress NGINX retirement timeline, and security updates.
Simplify AKS Monitoring with Datadog – Cloud Native Partner Showcase: A partner showcase demonstrating Datadog's AKS integration for monitoring, troubleshooting, and cost optimization. Shows how third-party observability platforms complement Azure Monitor for teams with multi-cloud tooling.
Node Auto Provisioning (NAP) with Wilson Darko: Deep dive into Node Auto-Provisioning best practices — covering node class selection, disruption budgets, and workload-aware scheduling. Essential viewing for teams migrating from cluster autoscaler to NAP.
Pod CIDR Expansion on Azure CNI Overlay: Demonstrates how to expand pod CIDR ranges on existing Azure CNI Overlay clusters without recreation. Directly addresses the scaling constraint where clusters run out of pod IPs.
AKS Day EU 2026: Daniel Sol – AKS Everywhere: KubeCon EU Azure Pre-Day session on running AKS across cloud, edge, and hybrid environments with Azure Arc integration.
AKS Day EU 2026: Mitch Connors – Networking: KubeCon EU Azure Pre-Day session covering the AKS networking roadmap, Gateway API adoption, and advanced networking features.
AKS Day EU 2026: Anson Qian – AI Infrastructure: KubeCon EU Azure Pre-Day session on GPU scheduling, KAITO, and AI workload patterns on AKS.
AKS Day EU 2026: Ralph Squillace – AKS AI Inference Platform: KubeCon EU Azure Pre-Day session on building production inference platforms with AKS, covering model serving, scaling, and cost optimization for LLM workloads.
AKS Day EU 2026: Jorge Palma – Keynote: KubeCon EU Azure Pre-Day keynote covering AKS platform strategy, investment areas, and roadmap highlights for 2026.
AKS Day EU 2026: Weinong Wang – Sovereign AKS: KubeCon EU Azure Pre-Day session on sovereign cloud requirements and how AKS addresses data residency, compliance, and regulatory constraints.
Upgrade Issues 101 with Mosbah Majed: A troubleshooting-focused session covering common AKS upgrade failures — from version skew issues to addon compatibility problems. Practical remediation steps included.
MVP Summit 2026: Dinant Paardenkooper on AKS Security: An MVP-led discussion on AKS security posture management, covering pod security standards, network policies, and secure platform operations for enterprise environments.
May 2026 was dominated by a security imperative and a networking transition:
For platform teams, the immediate priorities are clear: patch CVE-2026-31431, plan the Gateway API migration, and evaluate the new GPU metrics if running inference workloads.