Azure Linux VMs with 3.10 Kernel Experiencing Panics Post-Host Upgrade

Table of Contents

This article addresses a critical issue affecting Azure Linux Virtual Machines (VMs) running specific versions of the 3.10-based kernel. Specifically, it details a scenario where these VMs may experience unexpected system panics and become unresponsive following certain host node upgrade operations within the Azure infrastructure. Understanding this problem is crucial for maintaining the stability and reliability of your Linux workloads in the cloud.

Azure Linux VM

Understanding the Problem

The core of this issue lies within the interaction between older Linux kernel versions and Azure’s underlying Hyper-V virtualization platform during specific maintenance events. While Azure strives for seamless host maintenance, some operations, particularly “Memory preserving updates,” can trigger unforeseen responses in guest operating systems with unpatched kernel vulnerabilities. This leads to system instability and, ultimately, a kernel panic, rendering the VM inoperable.

Symptoms of Kernel Panic

The symptoms of this issue are distinct and immediately impactful. When an affected Azure Linux VM undergoes a host node upgrade involving a memory-preserving update, the virtual machine will cease to respond to network requests or user input. This unresponsiveness is a direct consequence of a kernel panic, a severe error from which the operating system cannot recover without a restart. The evidence of this panic is clearly logged in the Linux serial log, providing valuable diagnostic information for administrators.

Affected Environments

This problem specifically targets Microsoft Azure Linux virtual machines running RHEL/CentOS-based distributions with a Linux kernel version earlier than 3.10.0-327.10.1. This includes, but is not limited to, the following distributions and their corresponding kernel versions:

  • Red Hat Enterprise Linux (RHEL) 7.0 and 7.1
  • CentOS 7.0 and 7.1
  • Oracle Linux 7.0 and 7.1 utilizing the Red Hat-compatible kernel

It is imperative for users leveraging these older versions to be aware of the risk posed by Azure host maintenance operations. The stability of a cloud environment relies heavily on the guest OS’s ability to gracefully handle underlying infrastructure changes.

The Memory Preserving Update Operation

A “Memory preserving update” in Azure refers to a type of host maintenance that aims to minimize downtime by allowing the host to be updated while the VM remains running. This is typically achieved through live migration or similar technologies that preserve the VM’s memory state. While highly efficient for most scenarios, this specific update type can expose vulnerabilities in older kernel versions that are not designed to robustly handle the subtle changes or events occurring at the virtualization layer during such operations. The precise mechanisms involved in these updates can inadvertently trigger race conditions or faulty logic within the guest kernel, leading to the panic.

Decoding the Panic Log

When the VM panics, a stack trace is dumped to the serial log. This trace is a critical snapshot of the kernel’s state at the moment of failure, showing the sequence of function calls that led to the panic. Analyzing this log provides deep insights into the root cause.

The provided log snippet shows a clear Call Trace indicating a kernel panic:

[11480839.438577] Call Trace:
[11480839.439615] [<ffffffff816045b6>] dump_stack+0x19/0x1b
[11480839.441556] [<ffffffff8106e29b>] warn_slowpath_common+0x6b/0xb0
[11480839.443818] [<ffffffff8106e33c>] warn_slowpath_fmt+0x5c/0x80
[11480839.445983] [<ffffffff8123e585>] sysfs_add_one+0xa5/0xd0
[11480839.447983] [<ffffffff8123e77c>] create_dir+0x7c/0xe0
[11480839.449876] [<ffffffff8123eb29>] sysfs_create_dir+0xa9/0x130
[11480839.451971] [<ffffffff812d74ab>] kobject_add_internal+0xbb/0x2f0
[11480839.454310] [<ffffffff812d79e5>] kobject_add+0x75/0xd0
[11480839.456236] [<ffffffff813cfa85>] device_add+0x125/0x7a0
[11480839.458167] [<ffffffff813df9fc>] ? __pm_runtime_resume+0x5c/0x80
[11480839.460469] [<ffffffff813fe9cc>] scsi_sysfs_add_sdev+0xac/0x280
[11480839.462628] [<ffffffff813fcfbb>] do_scan_async+0x7b/0x150
[11480839.464632] [<ffffffff8109e849>] async_run_entry_fn+0x39/0x120
[11480839.467170] [<ffffffff8108f0cb>] process_one_work+0x17b/0x470
[11480839.469354] [<ffffffff8108fe9b>] worker_thread+0x11b/0x400
[11480839.472310] [<ffffffff8108fd80>] ? rescuer_thread+0x400/0x400
[11480839.475265] [<ffffffff8109727f>] kthread+0xcf/0xe0
[11480839.477904] [<ffffffff810971b0>] ? kthread_create_on_node+0x140/0x140
[11480839.481074] [<ffffffff81614358>] ret_from_fork+0x58/0x90
[11480839.483873] [<ffffffff810971b0>] ? kthread_create_on_node+0x140/0x140
[11480877.942369] ---[ end trace 1f7736c59e96a8a0 ]---
[11480877.942371] ------------[ cut here ]------------

This trace indicates an issue within device management, specifically around the sysfs, kobject, device, and SCSI subsystems.
* dump_stack, warn_slowpath_common, warn_slowpath_fmt: These are standard kernel functions indicating that a warning or an error condition was met, leading to the stack trace being dumped.
* sysfs_add_one, create_dir, sysfs_create_dir: These functions are related to the sysfs filesystem, which the kernel uses to export information about devices and kernel objects. The presence of these calls suggests a problem during the creation or management of device entries.
* kobject_add_internal, kobject_add, device_add: kobjects are fundamental kernel data structures used for device and driver management. device_add is used to register a device with the kernel. A failure here points to an issue during device enumeration or initialization.
* scsi_sysfs_add_sdev, do_scan_async: These are specifically related to the SCSI (Small Computer System Interface) subsystem, indicating a problem during the addition or scanning of a SCSI device. Given that virtual disks in Hyper-V (and thus Azure) are often presented as SCSI devices (e.g., VMBus SCSI), this is a strong indicator of an issue within the virtual disk handling.
* async_run_entry_fn, process_one_work, worker_thread, kthread: These indicate that the failure occurred within a kernel work queue or a kernel thread, often responsible for asynchronous operations like device hot-plugging or scanning.

A second Call Trace often appears following the initial [cut here] marker:

[11480864.118093] Call Trace:
[11480864.118093] [<ffffffff815f2535>] klist_put+0x25/0xa0
[11480864.118093] [<ffffffff815f25be>] klist_del+0xe/0x10
[11480864.118093] [<ffffffff813ce908>] device_del+0x58/0x1f0
[11480864.118093] [<ffffffff813ceabe>] device_unregister+0x1e/0x60
[11480864.118093] [<ffffffff812c36ee>] bsg_unregister_queue+0x5e/0xa0
[11480864.118093] [<ffffffff813fec49>] __scsi_remove_device+0xa9/0xd0
[11480864.118093] [<ffffffff813fcfc7>] do_scan_async+0x87/0x150
[11480864.118093] [<ffffffff8109e849>] async_run_entry_fn+0x39/0x120
[11480864.118093] [<ffffffff8108f0cb>] process_one_work+0x17b/0x470
[11480864.118093] [<ffffffff8108fe9b>] worker_thread+0x11b/0x400
[11480864.118093] [<ffffffff8108fd80>] ? rescuer_thread+0x400/0x400
[11480864.118093] [<ffffffff8109727f>] kthread+0xcf/0xe0
[11480864.118093] [<ffffffff810971b0>] ? kthread_create_on_node+0x140/0x140
[11480864.118093] [<ffffffff81614358>] ret_from_fork+0x58/0x90
[11480864.118093] [<ffffffff810971b0>] ? kthread_create_on_node+0x140/0x140

This second trace confirms the issue around device removal, specifically:
* klist_put, klist_del: Operations on kernel lists, suggesting cleanup or modification of linked lists of kernel objects.
* device_del, device_unregister: Functions responsible for removing a device from the kernel’s device model.
* bsg_unregister_queue: Refers to Block SCSI Generic (BSG) layer, used for direct SCSI pass-through. Unregistering a queue implies a device is being removed or its interface is being decommissioned.
* __scsi_remove_device: The explicit function for removing a SCSI device.

The combination of device addition (first trace) and device removal (second trace) in rapid succession, especially within the SCSI subsystem, strongly suggests a race condition or a faulty locking mechanism. This happens when the host update causes virtual SCSI devices to be momentarily detached and re-attached, and the older kernel’s handling of these events is not robust enough.

Cause of the Kernel Panic

The underlying cause of this critical issue has been identified as a flaw in the locking logic within the Linux kernel’s SCSI subsystem. Specifically, this vulnerability becomes apparent when a SCSI disk is logically “removed” from a running RHEL/CentOS-based VM guest during operations on a Microsoft Hyper-V host. While a user might not explicitly remove a disk, certain host maintenance procedures, like a memory-preserving update, can induce events that the guest VM interprets as a disk detachment and subsequent re-attachment.

Faulty Locking Logic in SCSI Subsystem

The SCSI subsystem in the Linux kernel is responsible for managing interactions with various storage devices, including those presented virtually by a hypervisor. Kernel modules often use locking mechanisms (like spinlocks or mutexes) to protect shared data structures from concurrent access by multiple threads or CPU cores. If these locking mechanisms are implemented incorrectly, particularly in complex scenarios involving device hot-plugging or transient device states, it can lead to race conditions. A race condition occurs when the timing of operations can alter the outcome, potentially causing data corruption, deadlocks, or, in this case, a kernel panic. The faulty logic means that certain critical sections of code within the SCSI driver or its management layers are not adequately protected, allowing simultaneous access that corrupts internal states.

Impact of Hyper-V Environment

Microsoft Azure leverages Hyper-V as its core virtualization technology. In this environment, virtual machines interact with virtualized hardware, including virtual SCSI controllers and disks. During a memory-preserving host update, the Hyper-V hypervisor orchestrates a complex dance of state transfers and device re-enumeration for the running guest VMs. For older Linux kernels, particularly those with the identified SCSI locking flaw, these precise moments of virtual device detachment and re-attachment can expose the vulnerability. The kernel attempts to handle the “removal” of a SCSI device while concurrently processing other events or potentially encountering an inconsistent state due to the underlying host operation, leading to the panic.

This issue highlights the importance of keeping guest operating system kernels updated, especially in highly dynamic virtualized environments where the interaction between the guest and host can be intricate and reveal latent bugs in older software.

Resolution Strategies

Addressing this problem involves both immediate recovery steps and long-term preventative measures to ensure the stability of your Azure Linux VMs. Prompt action is necessary to restore affected VMs, while strategic updates are crucial to prevent future occurrences.

Immediate Recovery: Manual VM Restart

The most direct and immediate way to resolve an active kernel panic is to manually restart the affected virtual machine. When a Linux kernel panics, the operating system effectively halts, and the VM becomes unresponsive. A manual restart forces the VM to reinitialize its kernel and re-establish a stable state. This can be done through the Azure portal, Azure CLI, or Azure PowerShell by initiating a “Restart” operation for the VM. While effective for recovery, this is a reactive solution and does not prevent the issue from recurring if the VM is still running an unpatched kernel and experiences another host upgrade.

Steps to Manually Restart an Azure VM:

  1. Azure Portal: Navigate to your Virtual Machine in the Azure portal. In the left-hand menu, under Operations, click on Restart. Confirm the action when prompted.
  2. Azure CLI:
    az vm restart --name YourVMName --resource-group YourResourceGroup
    

    Replace YourVMName and YourResourceGroup with your VM’s actual name and resource group.
  3. Azure PowerShell:
    Restart-AzVM -ResourceGroupName "YourResourceGroup" -Name "YourVMName"
    

    Again, replace placeholders with your specific details.

Long-Term Prevention: Kernel Update

To permanently avoid this kernel panic, it is essential to update your Linux kernel to version 3.10.0-327.10.1 or a later stable release. This specific kernel version, and subsequent releases, contains the necessary fixes to address the faulty locking logic in the SCSI subsystem, ensuring that your VM can gracefully handle host node upgrades without crashing.

The recommended kernel versions are available in the following distributions:

  • Red Hat Enterprise Linux 7.2
  • CentOS 7.2
  • Oracle Linux 7.2 with the Red Hat-compatible kernel

Organizations should prioritize migrating to these or newer kernel versions as part of their regular patch management routine. Running outdated kernels, especially in a dynamic cloud environment, exposes systems to known vulnerabilities and unexpected behaviors.

General Steps to Update the Kernel on RHEL/CentOS-based Systems:

  1. Backup: Before performing any major system update, especially a kernel update, it is highly recommended to create a snapshot or backup of your VM. This provides a rollback point in case of unforeseen issues.
  2. Connect to VM: Establish an SSH connection to your Azure Linux VM.
  3. Check Current Kernel Version:
    uname -r
    

    This command will show your currently running kernel version.
  4. Update Package Lists and Kernel:
    sudo yum update kernel
    

    This command will download and install the latest available kernel package from your configured repositories. If you wish to update all packages, use sudo yum update.
  5. Reboot the VM: After the kernel update, you must reboot the VM for the new kernel to take effect.
    sudo reboot
    
  6. Verify New Kernel Version: After the reboot, reconnect to your VM via SSH and run uname -r again to confirm that the new kernel version is active.

It is advisable to test kernel updates in a staging or development environment before deploying them to production systems. This ensures compatibility with your applications and other system configurations. Regular patch management and staying informed about critical updates are paramount for maintaining robust and secure cloud infrastructure.

More Information and Best Practices

Maintaining a stable and performant Linux environment in Azure extends beyond addressing specific kernel panics. It involves understanding Azure’s ecosystem, adhering to best practices, and leveraging available support resources.

Endorsed Linux Distributions on Azure

Microsoft Azure provides comprehensive support for a wide range of Linux distributions. These “endorsed distributions” are thoroughly tested and optimized for performance and compatibility with the Azure platform, including its virtualization stack. Running an endorsed distribution ensures that you benefit from:

  • Optimized Performance: Specific kernel configurations and drivers (like the Azure Linux Agent) are tailored for Azure’s environment.
  • Reliable Updates: Access to vendor-provided updates that are validated for Azure.
  • Comprehensive Support: Direct support from Microsoft and, in many cases, from the Linux distribution vendor.

For a complete list of currently endorsed Linux distributions, it is recommended to consult the official Azure documentation. Staying within the endorsed ecosystem simplifies management and reduces the likelihood of encountering platform-specific issues.

Support for Linux and Open Source Technology in Azure

Microsoft is deeply committed to supporting Linux and open-source technologies on Azure. This commitment is reflected in the continuous development of new features, extensive documentation, and a robust support infrastructure. Azure actively collaborates with major Linux vendors to ensure that their distributions run optimally and are well-supported on the platform. This means that when you encounter issues, there’s a clear path to resolution, whether through Microsoft support channels or by leveraging the vibrant open-source community.

Proactive VM Health Management

To minimize disruptions from incidents like kernel panics, consider implementing proactive VM health management strategies:

  • Automated Patch Management: Implement tools or scripts to automate the process of applying kernel and package updates, ensuring your systems are always running the latest stable versions.
  • Monitoring and Alerting: Configure Azure Monitor or other monitoring solutions to track VM health metrics and log data. Set up alerts for critical events, such as VM restarts, high CPU usage, or specific log entries indicating kernel issues.
  • Serial Console Access: Familiarize yourself with Azure’s serial console. It is an invaluable tool for diagnosing unbootable VMs or those experiencing kernel panics, as it provides direct access to the boot messages and kernel logs, even if the VM is otherwise unresponsive.
  • VM Snapshots and Backups: Regularly back up your VMs and take snapshots before major changes (like kernel upgrades or application deployments). This allows for quick recovery if an update introduces unforeseen problems.
  • Test Environments: Always test significant system changes, especially kernel updates, in non-production environments that mirror your production setup. This helps identify potential incompatibilities or regressions before they impact live services.

Illustration of Host-Guest Interaction during Maintenance

mermaid graph TD A[Azure Host Node] -- Performs --> B{Memory Preserving Update}; B -- Affects --> C[Running Linux VM]; C -- Old Kernel (<3.10.0-327.10.1) --> D{SCSI Subsystem Error}; D -- Faulty Locking Logic --> E{Kernel Panic}; E -- Result --> F[VM Unresponsive]; F -- Requires --> G[Manual Restart]; G -- Recommended --> H[Update Kernel to >=3.10.0-327.10.1]; H -- Prevents --> D;
This diagram visually represents the chain of events leading from an Azure host update to a VM kernel panic due to an older Linux kernel and highlights the resolution path.

Seek Assistance and Provide Feedback

Should you encounter persistent issues or require assistance with your Azure Linux VMs, several avenues are available for support:

  • Create a Support Request: For direct assistance from Microsoft Azure engineers, creating a formal support request through the Azure portal is the recommended path. This ensures that your specific problem is tracked and addressed by experts.
  • Azure Community Support: The Azure community forums and Q&A platforms are excellent resources for general questions, troubleshooting common issues, and learning from the experiences of other users and Microsoft experts.
  • Azure Feedback Community: Your feedback is invaluable for improving Azure services. If you have suggestions, feature requests, or encounter issues not yet addressed, consider submitting your feedback to the Azure feedback community.

Disclaimer Regarding Third-Party Information:

This article references third-party products (specific Linux distributions and their kernels) that are developed and maintained by entities independent of Microsoft. Microsoft provides this information for convenience and makes no warranty, implied or otherwise, regarding the performance, reliability, or security of these third-party products. Users are encouraged to consult the respective vendor documentation for the most accurate and up-to-date information and support for their specific Linux distributions.


We hope this comprehensive overview helps you understand and mitigate the kernel panic issue on your Azure Linux VMs. Have you encountered similar challenges with host maintenance in cloud environments? What strategies have you found most effective in maintaining the stability of your Linux workloads? Share your experiences and insights in the comments below!

Post a Comment