Kubernetes 1.25+ on Azure: Addressing Reported Memory Usage Increases

Table of Contents

Kubernetes memory usage increase

Microsoft Azure Kubernetes Service (AKS) clusters running Kubernetes version 1.25 or later may exhibit an increase in reported memory usage compared to earlier versions. This phenomenon has been observed by many users and requires understanding the underlying changes introduced in this Kubernetes version. While seemingly alarming, this increase in reported metrics doesn’t necessarily signify a genuine surge in workload memory consumption but rather a change in how that consumption is measured and reported by the operating system’s kernel.

The core of this issue lies within the Linux control group (cgroup) API, a kernel feature used by container runtimes and Kubernetes to manage and isolate resources such as CPU, memory, and I/O for processes and process groups. Kubernetes leverages cgroups extensively to enforce resource requests and limits defined for pods and containers. Starting with Kubernetes 1.25, AKS clusters transitioned to using cgroup version 2 (cgroup v2) as the default, replacing the older cgroup v1. This fundamental shift in resource accounting mechanisms is the primary driver behind the observed discrepancies in reported memory figures.

Symptoms of Increased Reported Memory Usage

Users encountering this issue typically observe one or more distinct symptoms that point towards higher memory utilization on their Kubernetes nodes or within their pods. These symptoms can sometimes lead to operational challenges if not properly understood.

The most direct symptom is seeing pods report increased memory usage. This might be visible through monitoring tools that collect metrics from the Kubernetes API or the container runtime. While individual pod consumption might appear higher, the aggregate effect is often more noticeable at the node level.

Executing the kubectl top node command, a common tool for quickly assessing node resource usage, might show significantly higher memory percentages for nodes running Kubernetes 1.25+ compared to nodes of the same size running earlier versions. This higher reported node usage can trigger alerts in monitoring systems configured with static thresholds, even if the node’s actual workload hasn’t changed.

A more critical symptom is experiencing increased pod evictions due to perceived memory pressure on a node. The Kubernetes kubelet agent, running on each node, monitors resource usage and evicts pods to reclaim resources when memory pressure is detected, ensuring node stability. If the underlying reporting mechanism (cgroup v2) indicates higher memory usage than before for the same workload, the kubelet might incorrectly perceive memory pressure and prematurely evict pods, leading to workload disruption and instability.

Cause: The Shift to cgroup v2 Memory Accounting

As mentioned, the root cause is the adoption of cgroup v2 as the default in Kubernetes 1.25 on AKS. Cgroup v2 represents a significant redesign from cgroup v1, offering a unified hierarchy and improved resource control features. However, its method of accounting for memory usage differs from its predecessor.

In cgroup v1, memory usage metrics were reported in a way that often excluded certain types of memory, such as some forms of cached file data (page cache) that could potentially be reclaimed by the kernel. This provided a view of memory consumption that focused more on “actively used” or “anonymous” memory (memory not backed by a file, like heap or stack).

Cgroup v2, on the other hand, tends to include a broader range of memory categories in its primary usage metrics. Specifically, metrics like memory.stat in cgroup v2 provide a more comprehensive view that often includes significant portions of the page cache associated with the processes within the cgroup. This means that even if the total physical memory usage on the node (including all caches) hasn’t changed, the amount attributed to specific pods or the node as a whole by the cgroup v2 mechanism appears higher because it’s counting memory that cgroup v1 typically didn’t include in the main usage figures.

The kubelet and other monitoring tools read memory usage data from the cgroup filesystem (/sys/fs/cgroup). When they switch from reading v1 structures and metrics to v2 structures and metrics, they start reporting the higher values provided by v2. This change in the source of the memory usage data, rather than an actual increase in workload memory needs, is what causes the reported values to climb after the upgrade. This is a known characteristic of cgroup v2 and affects various Linux distributions and container orchestration platforms adopting it.

Deeper Understanding: cgroup v1 vs cgroup v2 Memory Metrics

To further illustrate the difference, let’s briefly look at how cgroup files typically report memory.

In cgroup v1, you might look at files like memory.usage_in_bytes within a cgroup directory. This file primarily reported anonymous memory and some file-backed memory, but often excluded significant amounts of reclaimable cache. Other files like memory.stat provided more detailed breakdowns, but the primary usage metric often reflected a narrower scope.

In cgroup v2, the primary detailed memory usage information is found in the memory.stat file within a unified hierarchy. This file contains various counters, including anon (anonymous memory) and file (file-backed memory, including page cache). Crucially, the total reported memory usage derived from these v2 metrics often implicitly includes or gives more prominence to the file memory that was less emphasized in v1’s main usage figures. This inclusion of page cache, which can be substantial, particularly for applications that perform a lot of file I/O or load large libraries/executables, directly contributes to the higher reported numbers.

Consider a scenario where a pod is running an application that reads large files or uses memory-mapped files. In cgroup v1, the memory used for caching these files might not be fully reflected in memory.usage_in_bytes. In cgroup v2, the same memory might be counted towards the pod’s total usage via the file metric within memory.stat, leading to a higher reported number for the pod, even if the actual memory pressure on the system hasn’t changed. This difference in accounting highlights why monitoring tools and eviction thresholds need to be aware of which cgroup version is in use.

This change impacts not only the raw memory usage numbers but also how metrics are aggregated and interpreted by tools like kubectl top, monitoring agents, and the kubelet’s eviction manager. A tool expecting v1 output might misinterpret v2 data, or thresholds set based on v1 values might become too aggressive when applied to v2 reporting.

Impact on Kubernetes Operations

The transition to cgroup v2 and its different memory accounting directly influences Kubernetes operations in several ways:

  1. Monitoring Discrepancies: Existing monitoring dashboards and alerts configured with thresholds based on cgroup v1 reported memory usage might show nodes or pods consistently exceeding thresholds after upgrading to 1.25+, leading to alert fatigue or false positives. Operators need to recalibrate their expectations and potentially adjust thresholds based on the new reporting mechanism.
  2. Eviction Sensitivity: The kubelet uses memory usage metrics to determine if a node is under memory pressure and if pods need to be evicted. If cgroup v2 reports higher usage, the kubelet might trigger evictions earlier than it would have with cgroup v1, assuming the default eviction thresholds (--eviction-hard or --eviction-soft) remain unchanged. This can lead to increased workload instability and churn.
  3. Resource Planning Perception: While the actual memory needed by a workload might not have changed, the higher reported usage might influence future resource planning decisions, potentially leading to over-provisioning of node memory based on misinterpreted data.

It’s essential to differentiate between higher reported memory usage and genuinely increased memory pressure. If the increased reported usage is solely due to cgroup v2’s accounting of reclaimable cache and the node is not actually struggling (e.g., no excessive swapping, high kernel memory usage, or OOM kills unrelated to specific pod limits), then the higher number itself is less of a concern than the symptoms it might cause (like unwarranted evictions).

Solutions and Mitigations

Addressing the reported memory usage increase and its potential consequences requires a multi-faceted approach, ranging from workload-specific adjustments to system-level configurations and software updates.

One straightforward approach, if you are experiencing frequent memory pressure or evictions, is to upgrade your Azure subscription or node pool SKU to provide VMs with more memory. This effectively raises the absolute memory limit on the node. With more physical memory available, the node can accommodate the same workloads, even with the higher reported usage from cgroup v2, before hitting critical memory pressure thresholds. This is a capacity-based solution but might not be the most cost-effective if the issue is purely reporting-related.

If increased pod evictions are the primary symptom, particularly for specific workloads, review and potentially increase the memory limits and requests defined for those pods. Kubernetes uses these values for scheduling and resource management. Setting appropriate, and potentially higher, memory limits for pods ensures that the kubelet allows pods sufficient memory before considering throttling or eviction at the pod level (due to hitting their specific limits). However, this doesn’t directly address the node-level reported usage increase from cgroup v2 that causes node pressure evictions. Properly set requests and limits are a best practice regardless of the cgroup version, contributing to better scheduling and resource predictability.

A critical aspect of migrating to cgroup v2 environments is ensuring that all software running on the nodes, especially those interacting directly with cgroups or relying on resource usage metrics, is compatible. Update third-party applications and agents to versions that explicitly support cgroup v2.

  • Third-party monitoring and security agents: Many agents deployed as DaemonSets or part of the node image directly read cgroup files to gather metrics or enforce policies. Agents designed for cgroup v1 might fail to collect data correctly or misinterpret cgroup v2 structures. Check vendor documentation for compatible agent versions for Kubernetes 1.25+ and cgroup v2. Deploying outdated agents can lead to broken monitoring or ineffective security policies.
  • Java applications: Older versions of Java Virtual Machines (JVMs), particularly the HotSpot JVM before certain updates, had limitations in detecting container resource limits when running inside cgroups. With cgroup v2, these issues were compounded. Using JVM versions that fully support cgroup v2 is essential for Java applications to correctly understand their memory constraints within a container. Recommended versions include OpenJDK/HotSpot jdk8u372, 11.0.16, 15, and later, IBM Semeru Runtimes 8.0.382.0, 11.0.20.0, 17.0.8.0, and later, and IBM Java 8.0.8.6 and later. Running older JVMs can lead to the application exceeding its container memory limit and being killed with an OOMKilled error.
  • uber-go/automaxprocs: This Go package automatically sets the GOMAXPROCS environment variable based on container CPU limits, typically read from the cgroup filesystem. If you use this package in your Go applications, ensure you are using version v1.5.1 or later, which includes support for reading CPU limits from cgroup v2.

As a temporary workaround while updating applications or assessing the impact, you can revert the cgroup version on your nodes back to cgroup v1. This can typically be achieved by deploying a DaemonSet that modifies the node’s boot parameters or kernel configuration to prefer cgroup v1. Azure provides an example DaemonSet for this purpose. However, this is explicitly a temporary solution. Cgroup v1 is considered deprecated upstream in Linux and Kubernetes and will eventually be removed. Relying on this workaround prevents you from leveraging improvements in cgroup v2 and may not be supported indefinitely. It should only be used to mitigate immediate issues (like excessive evictions) while you implement the necessary updates and adjustments for cgroup v2 compatibility.

It is important to note that if you observe only an increase in reported memory usage figures without accompanying symptoms like increased evictions, performance degradation, or OOM kills, you might not need to take immediate action. The higher number itself, if it reflects reclaimable cache and not actual memory pressure, is less of a concern than the operational issues it might inadvertently trigger (like alert storms or premature evictions based on old thresholds). Carefully evaluate the actual impact on your workloads and node stability before implementing solutions.

Status and Future Resolution

The issue of higher reported memory usage with cgroup v2 and its implications for Kubernetes, including potential impacts on metrics and eviction, is a known area of discussion and development within the upstream Kubernetes community. Efforts are underway to improve how Kubernetes handles resource accounting with cgroup v2 and potentially adjust default behaviors or recommendations.

Microsoft is actively engaged with the Kubernetes community to contribute to resolving the underlying concerns related to cgroup v2 memory reporting and its interaction with kubelet eviction policies. Progress on these community efforts can often be tracked through relevant issues in the Kubernetes GitHub repository or the Azure/AKS issue tracker.

Future resolutions from the Kubernetes project or AKS might include adjustments to the default eviction thresholds to better align with cgroup v2’s reporting characteristics, or updates to how resource reservations and allocatable capacity are calculated or interpreted. These changes aim to ensure that kubelet behavior remains consistent and nodes function reliably under cgroup v2.

For users, staying informed about updates to AKS node images and Kubernetes patch versions is crucial, as they may include fixes or adjustments related to this behavior. Consulting the AKS release notes for specific version updates is recommended.

We understand that changes in resource reporting can be confusing and potentially disruptive. By providing this information, we aim to help you understand the cause of the increased reported memory usage in AKS clusters on Kubernetes 1.25+ and provide actionable steps to address associated symptoms.

Do you have further questions or experiences related to this issue on your AKS clusters? Share them below!

Post a Comment