Kubernetes 1.25+ on Azure: Addressing Reported Memory Usage Increases
Microsoft Azure Kubernetes Service (AKS) clusters running Kubernetes version 1.25 or later may exhibit an increase in reported memory usage compared to earlier versions. This phenomenon has been observed by many users and requires understanding the underlying changes introduced in this Kubernetes version. While seemingly alarming, this increase in reported metrics doesn’t necessarily signify a genuine surge in workload memory consumption but rather a change in how that consumption is measured and reported by the operating system’s kernel.
The core of this issue lies within the Linux control group (cgroup
) API, a kernel feature used by container runtimes and Kubernetes to manage and isolate resources such as CPU, memory, and I/O for processes and process groups. Kubernetes leverages cgroups extensively to enforce resource requests and limits defined for pods and containers. Starting with Kubernetes 1.25, AKS clusters transitioned to using cgroup
version 2 (cgroup v2
) as the default, replacing the older cgroup v1
. This fundamental shift in resource accounting mechanisms is the primary driver behind the observed discrepancies in reported memory figures.
Symptoms of Increased Reported Memory Usage¶
Users encountering this issue typically observe one or more distinct symptoms that point towards higher memory utilization on their Kubernetes nodes or within their pods. These symptoms can sometimes lead to operational challenges if not properly understood.
The most direct symptom is seeing pods report increased memory usage. This might be visible through monitoring tools that collect metrics from the Kubernetes API or the container runtime. While individual pod consumption might appear higher, the aggregate effect is often more noticeable at the node level.
Executing the kubectl top node
command, a common tool for quickly assessing node resource usage, might show significantly higher memory percentages for nodes running Kubernetes 1.25+ compared to nodes of the same size running earlier versions. This higher reported node usage can trigger alerts in monitoring systems configured with static thresholds, even if the node’s actual workload hasn’t changed.
A more critical symptom is experiencing increased pod evictions due to perceived memory pressure on a node. The Kubernetes kubelet
agent, running on each node, monitors resource usage and evicts pods to reclaim resources when memory pressure is detected, ensuring node stability. If the underlying reporting mechanism (cgroup v2
) indicates higher memory usage than before for the same workload, the kubelet
might incorrectly perceive memory pressure and prematurely evict pods, leading to workload disruption and instability.
Cause: The Shift to cgroup v2 Memory Accounting¶
As mentioned, the root cause is the adoption of cgroup v2
as the default in Kubernetes 1.25 on AKS. Cgroup v2
represents a significant redesign from cgroup v1
, offering a unified hierarchy and improved resource control features. However, its method of accounting for memory usage differs from its predecessor.
In cgroup v1
, memory usage metrics were reported in a way that often excluded certain types of memory, such as some forms of cached file data (page cache) that could potentially be reclaimed by the kernel. This provided a view of memory consumption that focused more on “actively used” or “anonymous” memory (memory not backed by a file, like heap or stack).
Cgroup v2
, on the other hand, tends to include a broader range of memory categories in its primary usage metrics. Specifically, metrics like memory.stat
in cgroup v2
provide a more comprehensive view that often includes significant portions of the page cache associated with the processes within the cgroup. This means that even if the total physical memory usage on the node (including all caches) hasn’t changed, the amount attributed to specific pods or the node as a whole by the cgroup v2
mechanism appears higher because it’s counting memory that cgroup v1
typically didn’t include in the main usage figures.
The kubelet
and other monitoring tools read memory usage data from the cgroup
filesystem (/sys/fs/cgroup
). When they switch from reading v1
structures and metrics to v2
structures and metrics, they start reporting the higher values provided by v2
. This change in the source of the memory usage data, rather than an actual increase in workload memory needs, is what causes the reported values to climb after the upgrade. This is a known characteristic of cgroup v2
and affects various Linux distributions and container orchestration platforms adopting it.
Deeper Understanding: cgroup v1 vs cgroup v2 Memory Metrics¶
To further illustrate the difference, let’s briefly look at how cgroup
files typically report memory.
In cgroup v1
, you might look at files like memory.usage_in_bytes
within a cgroup directory. This file primarily reported anonymous memory and some file-backed memory, but often excluded significant amounts of reclaimable cache. Other files like memory.stat
provided more detailed breakdowns, but the primary usage metric often reflected a narrower scope.
In cgroup v2
, the primary detailed memory usage information is found in the memory.stat
file within a unified hierarchy. This file contains various counters, including anon
(anonymous memory) and file
(file-backed memory, including page cache). Crucially, the total reported memory usage derived from these v2
metrics often implicitly includes or gives more prominence to the file
memory that was less emphasized in v1
’s main usage figures. This inclusion of page cache, which can be substantial, particularly for applications that perform a lot of file I/O or load large libraries/executables, directly contributes to the higher reported numbers.
Consider a scenario where a pod is running an application that reads large files or uses memory-mapped files. In cgroup v1
, the memory used for caching these files might not be fully reflected in memory.usage_in_bytes
. In cgroup v2
, the same memory might be counted towards the pod’s total usage via the file
metric within memory.stat
, leading to a higher reported number for the pod, even if the actual memory pressure on the system hasn’t changed. This difference in accounting highlights why monitoring tools and eviction thresholds need to be aware of which cgroup
version is in use.
This change impacts not only the raw memory usage numbers but also how metrics are aggregated and interpreted by tools like kubectl top
, monitoring agents, and the kubelet
’s eviction manager. A tool expecting v1
output might misinterpret v2
data, or thresholds set based on v1
values might become too aggressive when applied to v2
reporting.
Impact on Kubernetes Operations¶
The transition to cgroup v2
and its different memory accounting directly influences Kubernetes operations in several ways:
- Monitoring Discrepancies: Existing monitoring dashboards and alerts configured with thresholds based on
cgroup v1
reported memory usage might show nodes or pods consistently exceeding thresholds after upgrading to 1.25+, leading to alert fatigue or false positives. Operators need to recalibrate their expectations and potentially adjust thresholds based on the new reporting mechanism. - Eviction Sensitivity: The
kubelet
uses memory usage metrics to determine if a node is under memory pressure and if pods need to be evicted. Ifcgroup v2
reports higher usage, thekubelet
might trigger evictions earlier than it would have withcgroup v1
, assuming the default eviction thresholds (--eviction-hard
or--eviction-soft
) remain unchanged. This can lead to increased workload instability and churn. - Resource Planning Perception: While the actual memory needed by a workload might not have changed, the higher reported usage might influence future resource planning decisions, potentially leading to over-provisioning of node memory based on misinterpreted data.
It’s essential to differentiate between higher reported memory usage and genuinely increased memory pressure. If the increased reported usage is solely due to cgroup v2
’s accounting of reclaimable cache and the node is not actually struggling (e.g., no excessive swapping, high kernel memory usage, or OOM kills unrelated to specific pod limits), then the higher number itself is less of a concern than the symptoms it might cause (like unwarranted evictions).
Solutions and Mitigations¶
Addressing the reported memory usage increase and its potential consequences requires a multi-faceted approach, ranging from workload-specific adjustments to system-level configurations and software updates.
One straightforward approach, if you are experiencing frequent memory pressure or evictions, is to upgrade your Azure subscription or node pool SKU to provide VMs with more memory. This effectively raises the absolute memory limit on the node. With more physical memory available, the node can accommodate the same workloads, even with the higher reported usage from cgroup v2
, before hitting critical memory pressure thresholds. This is a capacity-based solution but might not be the most cost-effective if the issue is purely reporting-related.
If increased pod evictions are the primary symptom, particularly for specific workloads, review and potentially increase the memory limits and requests defined for those pods. Kubernetes uses these values for scheduling and resource management. Setting appropriate, and potentially higher, memory limits for pods ensures that the kubelet
allows pods sufficient memory before considering throttling or eviction at the pod level (due to hitting their specific limits). However, this doesn’t directly address the node-level reported usage increase from cgroup v2
that causes node pressure evictions. Properly set requests and limits are a best practice regardless of the cgroup
version, contributing to better scheduling and resource predictability.
A critical aspect of migrating to cgroup v2
environments is ensuring that all software running on the nodes, especially those interacting directly with cgroups or relying on resource usage metrics, is compatible. Update third-party applications and agents to versions that explicitly support cgroup v2
.
- Third-party monitoring and security agents: Many agents deployed as DaemonSets or part of the node image directly read
cgroup
files to gather metrics or enforce policies. Agents designed forcgroup v1
might fail to collect data correctly or misinterpretcgroup v2
structures. Check vendor documentation for compatible agent versions for Kubernetes 1.25+ andcgroup v2
. Deploying outdated agents can lead to broken monitoring or ineffective security policies. - Java applications: Older versions of Java Virtual Machines (JVMs), particularly the HotSpot JVM before certain updates, had limitations in detecting container resource limits when running inside cgroups. With
cgroup v2
, these issues were compounded. Using JVM versions that fully supportcgroup v2
is essential for Java applications to correctly understand their memory constraints within a container. Recommended versions include OpenJDK/HotSpotjdk8u372
,11.0.16
,15
, and later, IBM Semeru Runtimes8.0.382.0
,11.0.20.0
,17.0.8.0
, and later, and IBM Java8.0.8.6
and later. Running older JVMs can lead to the application exceeding its container memory limit and being killed with an OOMKilled error. - uber-go/automaxprocs: This Go package automatically sets the
GOMAXPROCS
environment variable based on container CPU limits, typically read from the cgroup filesystem. If you use this package in your Go applications, ensure you are using versionv1.5.1
or later, which includes support for reading CPU limits fromcgroup v2
.
As a temporary workaround while updating applications or assessing the impact, you can revert the cgroup
version on your nodes back to cgroup v1
. This can typically be achieved by deploying a DaemonSet that modifies the node’s boot parameters or kernel configuration to prefer cgroup v1
. Azure provides an example DaemonSet for this purpose. However, this is explicitly a temporary solution. Cgroup v1
is considered deprecated upstream in Linux and Kubernetes and will eventually be removed. Relying on this workaround prevents you from leveraging improvements in cgroup v2
and may not be supported indefinitely. It should only be used to mitigate immediate issues (like excessive evictions) while you implement the necessary updates and adjustments for cgroup v2
compatibility.
It is important to note that if you observe only an increase in reported memory usage figures without accompanying symptoms like increased evictions, performance degradation, or OOM kills, you might not need to take immediate action. The higher number itself, if it reflects reclaimable cache and not actual memory pressure, is less of a concern than the operational issues it might inadvertently trigger (like alert storms or premature evictions based on old thresholds). Carefully evaluate the actual impact on your workloads and node stability before implementing solutions.
Status and Future Resolution¶
The issue of higher reported memory usage with cgroup v2
and its implications for Kubernetes, including potential impacts on metrics and eviction, is a known area of discussion and development within the upstream Kubernetes community. Efforts are underway to improve how Kubernetes handles resource accounting with cgroup v2
and potentially adjust default behaviors or recommendations.
Microsoft is actively engaged with the Kubernetes community to contribute to resolving the underlying concerns related to cgroup v2
memory reporting and its interaction with kubelet
eviction policies. Progress on these community efforts can often be tracked through relevant issues in the Kubernetes GitHub repository or the Azure/AKS issue tracker.
Future resolutions from the Kubernetes project or AKS might include adjustments to the default eviction thresholds to better align with cgroup v2
’s reporting characteristics, or updates to how resource reservations and allocatable capacity are calculated or interpreted. These changes aim to ensure that kubelet
behavior remains consistent and nodes function reliably under cgroup v2
.
For users, staying informed about updates to AKS node images and Kubernetes patch versions is crucial, as they may include fixes or adjustments related to this behavior. Consulting the AKS release notes for specific version updates is recommended.
We understand that changes in resource reporting can be confusing and potentially disruptive. By providing this information, we aim to help you understand the cause of the increased reported memory usage in AKS clusters on Kubernetes 1.25+ and provide actionable steps to address associated symptoms.
Do you have further questions or experiences related to this issue on your AKS clusters? Share them below!
Post a Comment