Troubleshooting Azure Files Performance: A Comprehensive Guide to Optimal Speed

Table of Contents

Understanding the factors that influence Azure Files performance is crucial for diagnosing and resolving issues. This guide provides common performance problems, their potential causes, and effective workarounds to help you optimize your file share operations. Familiarity with Azure Files performance concepts is recommended to fully utilize this troubleshooting information.

Applies to

This guide covers performance issues relevant to different types of Azure file shares.

File share type SMB NFS
Standard file shares (GPv2), LRS/ZRS
Standard file shares (GPv2), GRS/GZRS
Premium file shares (FileStorage), LRS/ZRS

General Performance Troubleshooting

Before diving into specific issues, address some common root causes of performance problems. These initial checks can often quickly identify and resolve the source of slowdowns.

You’re running an old operating system

Older operating systems may not have the latest performance enhancements or bug fixes for interacting with Azure Files. Client VMs running Windows 8.1, Windows Server 2012 R2, or outdated Linux distributions/kernels might experience suboptimal performance. Upgrading your client operating system to a newer version is highly recommended. If upgrading is not immediately possible, check for and apply available fixes specifically designed for accessing network shares from older systems.

Your workload is being throttled

Throttling occurs when the activity on your file share exceeds its defined limits for IOPS (I/O operations per second), ingress (data coming into Azure Files), or egress (data leaving Azure Files). When these limits are reached, the Azure Files service queues requests, leading to increased latency and reduced performance for the client. Understanding the scale targets for both standard and premium file shares is essential to determine if your workload fits within the provisioned limits. Often, migrating from standard to premium file shares can alleviate throttling issues due to the significantly higher performance targets available with premium tiers.

Throttling can impact performance at both the storage account level (for standard shares) and the share level (for premium shares). Recognizing the signs of throttling, such as high latency or low throughput, is the first step in addressing this issue. Monitoring metrics is a key way to detect throttling occurrences and understand their impact on your application.

High Latency, Low Throughput, or Low IOPS

These symptoms often indicate limitations or bottlenecks in the interaction between your client and the Azure file share. Diagnosing the specific cause requires examining client configuration, workload patterns, and Azure service metrics.

Cause 1: Share or storage account is being throttled

Monitoring Azure metrics is the most reliable way to confirm if throttling is the cause of performance issues. You can set up alerts in Azure Monitor to receive notifications when your share or storage account approaches or exceeds its performance limits.

Important: For standard storage accounts, throttling limits are applied at the storage account level, affecting all shares within it. For premium file shares, throttling is applied at the individual file share level, providing more granular control and predictable performance.

To check for throttling using Azure metrics:
1. Navigate to your storage account in the Azure portal.
2. Select Metrics under the Monitoring section on the left pane.
3. Choose File as the metric namespace and Transactions as the metric.
4. Add a filter for Response type and look for response types indicating throttling.

Common throttling response types for standard file shares include ClientAccountRequestThrottlingError and ClientAccountBandwidthThrottlingError. For premium file shares, you might see SuccessWithShareEgressThrottling, SuccessWithShareIngressThrottling, SuccessWithShareIopsThrottling, ClientShareEgressThrottlingError, ClientShareIngressThrottlingError, or ClientShareIopsThrottlingError. If Kerberos authentication is used, these may be prefixed with Kerberos.

Solution:
If you are using a premium file share, the most direct solution for throttling is to increase the provisioned size of the file share. Provisioned premium shares offer a guaranteed level of performance (IOPS and throughput) that scales directly with the provisioned capacity. Increasing the size provides higher limits, potentially resolving throttling. Understanding the provisioned model for premium shares helps determine the appropriate size for your workload.

Cause 2: Metadata or namespace heavy workload

Workloads that involve frequent operations on file and directory metadata, rather than data content, can experience higher latency. Operations like creating, opening, closing, querying information, or listing directories (such as createfile, openfile, closefile, queryinfo, or querydirectory) fall into this category. The latency for these operations can be worse than for simple read/write operations, especially with large directories or deep folder structures.

To identify a metadata-heavy workload, examine your transaction metrics by filtering on API name instead of Response type. Analyzing the counts of different API calls helps determine if metadata operations dominate your workload.

Workarounds:
* Optimize your application to reduce the number of metadata operations if possible. Sometimes, application logic can be adjusted to minimize redundant calls.
* For premium SMB Azure file shares, enable and utilize metadata caching features. Caching frequently accessed metadata on the client or server side can significantly reduce the need for repeated calls to the Azure Files service for the same information.
* Split the data and workload across multiple file shares within the same storage account. Distributing the metadata operations across several shares can help distribute the load and reduce contention on a single share.
* Consider using a Virtual Hard Disk (VHD) stored on the Azure file share. Mount the VHD from the client VM and perform file operations directly against the file system within the mounted VHD. This approach allows metadata operations to be handled locally by the client’s operating system, offering performance similar to local storage for metadata-intensive tasks. This method is suitable for single-writer or multiple-reader, no-writer scenarios. Data inside the VHD is only accessible by mounting the VHD, not via REST API or the portal.

To implement the VHD workaround on Windows:
1. Mount the Azure file share using the storage account key to a network drive (e.g., Z:).
2. Open Disk Management and create a VHD, specifying the mapped network drive location, size, and choosing “Fixed size”.
3. Initialize the new disk that appears after VHD creation.
4. Create a New Simple Volume on the unallocated space.
5. A new drive letter (e.g., E:) will represent the mounted VHD. Perform file operations on this new drive letter for better metadata performance. You can potentially disconnect the initial share mapping (Z:) while keeping the VHD mounted (E:).

On Linux, you can also mount VHDs, typically involving tools like qemu-nbd or loop devices depending on the VHD format and distribution. Consult your specific Linux distribution’s documentation for the correct procedure.

Cause 3: Single-threaded application

The maximum achievable throughput from a client VM can be limited by the number of threads the application uses to interact with the file share. If your application is single-threaded, it can only utilize a single connection and potentially a single CPU core for file operations, even if the provisioned share size supports higher performance. This limits the effective IOPS and throughput well below the share’s capabilities.

Solution:
Increase the parallelism of your application. Modify the application to use multiple threads for file operations, allowing concurrent requests to the file share. If the application itself cannot be modified, use file transfer tools designed for parallelism, such as AzCopy or RoboCopy on Windows, or the parallel command on Linux. These tools can leverage multiple connections and threads to transfer data much faster than single-threaded copy operations.

Cause 4: Number of SMB channels exceeds four

SMB Multichannel is a feature that allows SMB3 clients to establish multiple network connections to an SMB share, significantly improving performance, especially for large files and high throughput workloads. However, on some systems, configuring the number of channels per network interface card (NIC) can inadvertently lead to too many channels being created, which can sometimes degrade performance rather than improve it. If the total number of SMB channels exceeds four, you might observe unexpected performance issues.

You can check the current SMB client configuration on Windows using the PowerShell cmdlet Get-SmbClientConfiguration. Look at the ConnectionCountPerRssNetworkInterface setting.

Solution:
Limit the number of SMB channels per NIC using the Set-SmbClientConfiguration PowerShell cmdlet. For example, if your VM has two network interfaces, setting Set-SmbClientConfiguration -ConnectionCountPerRssNetworkInterface 2 would cap the total channels at four (2 NICs * 2 channels/NIC). After changing this setting, it’s crucial to unmount and remount the file share to ensure the new configuration takes effect. Wait at least 60 seconds after unmounting before remounting.

Cause 5: Read-ahead size is too small (NFS only)

For NFS mounts on Linux kernels, the read_ahead_kb parameter controls how much data is proactively read from the share when a sequential read operation is detected. A small default value, such as the 128 KiB used in some kernel versions (5.4 and later), can limit the read throughput, particularly when reading large files sequentially. Increasing this value can allow the client to fetch larger chunks of data at once, improving performance.

Solution:
Increase the read_ahead_kb kernel parameter for your NFS mount. A recommended value is 15 mebibytes (MiB), which is 15360 KiB. You can make this setting persistent by adding a rule to the udev device manager. Create a file named /etc/udev/rules.d/99-nfs.rules and add the rule that targets NFS block devices (SUBSYSTEM=="bdi") and sets the read_ahead_kb attribute. After saving the file, apply the new rule by running sudo udevadm control --reload. This ensures the setting is applied automatically for NFS mounts.

Very High Latency for Requests

Significant latency, noticeably impacting the responsiveness of file operations, often points to fundamental network path or client-side processing delays rather than just throttling or workload patterns.

Cause 1: Geographical distance or network issues

A primary cause of very high latency is the physical distance between the client VM and the Azure region where the file share is hosted. Data must travel across networks, and longer distances inherently introduce latency. Other network issues, such as congestion, routing problems, or insufficient bandwidth on the client’s network path to Azure, can also contribute. Client-side processing delays, where the client VM is slow to handle responses, can also manifest as high end-to-end latency.

Solution:
* Deploy your client VMs in the same Azure region as your Azure file share whenever possible. This minimizes network hops and latency between the client and the storage service.
* Utilize Azure Monitor metrics to diagnose network and client latency. Compare the SuccessE2ELatency (end-to-end latency measured by the client) with SuccessServerLatency (latency measured by the Azure Files service). A significant difference between these two metrics indicates that the majority of the delay is occurring outside the Azure Files service, likely in the network or on the client side. Further network diagnostics from the client VM can help pinpoint specific network path issues.

Client Unable to Achieve Maximum Throughput Supported by the Network

Even with sufficient provisioned performance on the Azure file share and a capable network connection, the client may not utilize the full available throughput. This can happen due to client-side limitations or specific protocol implementations.

Cause 2: Lack of SMB Multichannel support on Standard shares

Standard Azure file shares currently support only a single SMB channel per connection from a client VM. This means that the maximum throughput achievable for a single file stream or operation is limited by the performance of a single network connection and potentially the processing power of a single CPU core on the client VM dedicated to that connection. This single-threaded nature of the connection can become a bottleneck for high-throughput scenarios.

Workaround:
* For workloads requiring high throughput from a single client, consider using premium file shares, which support SMB Multichannel. Enabling SMB Multichannel on premium shares allows multiple connections and can significantly boost throughput.
* Obtain a client VM with a CPU that has higher clock speeds or better single-core performance. A more powerful single core can process data faster within the constraints of a single connection.
* Distribute the workload across multiple client VMs. By accessing the share from several VMs concurrently, you can aggregate the throughput from each single connection, achieving higher overall performance.
* If your application allows, use REST APIs for data transfer. REST APIs can often achieve higher throughput than SMB on standard shares by using different connection mechanisms.
* For NFS Azure file shares, utilize the nconnect mount option. nconnect allows establishing multiple TCP connections between the client and the NFS server (Azure Files), similar to SMB Multichannel, which can improve performance for parallel I/O. Consult documentation on improving NFS performance with nconnect for details.

Slow Performance on an Azure file share Mounted on a Linux VM

Linux clients can sometimes exhibit specific performance challenges when mounting Azure file shares (SMB/CIFS).

Cause 1: Disabled caching

Client-side caching can significantly improve performance for workloads that repeatedly access the same files or directory structures. If caching is disabled on your Linux mount, performance may suffer due to every access requiring a fresh round trip to the Azure file share. The cache= mount option controls this behavior. cache=none explicitly disables caching.

Additionally, the serverino mount option, while sometimes necessary for compatibility, can cause the ls command to perform a stat operation for every entry in a directory listing. For large directories, this results in a large number of metadata operations and can dramatically slow down directory traversals.

Solution for cause 1:
Check your mount options, typically in /etc/fstab or by running sudo mount | grep cifs. If cache=none is present, remount the share using the default options or explicitly include cache=strict. The cache=strict setting generally provides a good balance between performance and consistency. Ensure serverino is not used if you experience slow directory listings, unless it is strictly required for application compatibility. The default mounting procedure described in Azure documentation usually provides recommended options for optimal performance.

Cause 2: Throttling

Similar to Windows clients, Linux clients can also experience performance degradation if the file share or storage account is being throttled. Exceeding the IOPS, ingress, or egress limits will queue requests and increase latency.

Solution for cause 2:
Monitor Azure Storage metrics in Azure Monitor to identify throttling events, as described earlier. If throttling is occurring, verify that your application’s workload stays within the Azure Files scale targets for your chosen share tier (standard or premium). If you are using standard shares and consistently hit limits, upgrading to premium shares is the most effective solution to obtain higher scale targets.

Cause 3: Azure File Share reaches capacity

As a file share approaches its maximum capacity, various operations, including file writes and metadata updates necessary for managing space, can become slower. The file system needs to work harder to find available space and update file allocation tables, which can increase latency.

Workaround:
Identify and manage large files and directories. Mount the root of the share and use command-line tools like du (disk usage) combined with sort and head to find the largest consumers of space.

cd /path/to/mount/point
du -ah --max-depth=1 | sort -rh | head -n 20

This command lists the top 20 largest items (files or directories) at the current level. Analyze this output to determine what is consuming space and either archive, delete, or move large data as needed. If you cannot mount the root, use Azure Storage Explorer or other third-party tools that provide graphical interfaces to browse share content and sizes.

Throughput on Linux clients is lower than that of Windows clients

Even when both clients use the same share and network conditions, Linux clients might show lower maximum throughput compared to Windows clients. This is a known characteristic of the SMB client implementation in the Linux kernel compared to the highly optimized implementation in Windows.

Cause 3: Linux SMB client implementation

The current state of the SMB client in the Linux kernel may not be as optimized for high throughput as the Windows implementation, particularly regarding how it handles concurrent I/O or leverages underlying network capabilities.

Workaround:
* Distribute the workload across multiple Linux VMs. Similar to the single-threaded application workaround, running clients on multiple VMs aggregates the total throughput.
* Use multiple mount points on the same VM, perhaps with different nosharesock options if applicable (though nosharesock is less common now). Spreading load across distinct mount points might help circumvent some client-side limitations.
* Consider mounting with the nostrictsync option. This option prevents the client from forcing a server-side SMB flush (FLUSH command) on every fsync call. While it won’t affect data consistency for typical file write scenarios on Azure Files, it might cause minor delays in directory listing updates (ls -l) showing the most recent metadata. Use stat for the most up-to-date file information.

High Latencies for Metadata-Heavy Workloads Involving Extensive Open/Close Operations

Workloads that repeatedly open and close file handles on the same directory within a short timeframe can experience high latency. This pattern is often seen in applications that poll directories frequently or perform many small, distinct operations requiring opening and closing files.

Cause 4: Lack of support for directory leases

Azure Files currently does not fully support SMB directory leases. Directory leases allow clients to cache directory contents and metadata locally for a period, reducing the need for repeated server calls for operations like opening or querying directories. Without directory lease support, every directory operation requires a trip to the server, increasing latency for rapid, repeated operations on the same directory.

Workaround:
* Modify the application to reduce the frequency of opening and closing handles on the same directory if possible. Restructure the workflow to keep handles open for longer durations or process files in batches.
* On Linux VMs, increase the directory entry cache timeout using the actimeo=<sec> mount option. The default timeout is often 1 second. Setting a larger value (e.g., 30 seconds) keeps directory entries cached longer, reducing server calls for subsequent operations on those entries.
* For CentOS Linux or RHEL VMs, upgrading to versions 8.2 or later, or for other distributions, upgrading the kernel to 5.0 or later, may include improvements in SMB client caching behavior that partially mitigate this issue.

Slow Enumeration of Files and Folders

Listing the contents of large directories can be slow, especially if the client machine has limited resources.

This problem can occur if the client machine lacks sufficient cache memory or processing power to handle the large volume of directory entries being enumerated. The client needs to process and potentially cache metadata for many files and folders during the listing process.

Slow File Copying to and from Azure File Shares

Directly copying files using standard operating system copy commands (like Windows Explorer copy/paste or Linux cp) might result in slow performance, particularly for large files or large numbers of small files.

Performance for file copies can be significantly impacted by factors like network latency, single-threaded copy processes, and the overhead of processing many small files individually. For optimal performance during file transfers to or from Azure Files, especially for large-scale data movement, using an I/O size of 1 MiB is generally recommended, though this is often handled by optimized tools.

Solution:
Use multi-threaded copy tools designed for cloud storage. Tools like AzCopy, RoboCopy (Windows), or rclone (cross-platform) are highly recommended. These tools can leverage multiple threads to perform copy operations in parallel, drastically increasing throughput compared to single-threaded methods. They also often include features like resuming failed transfers and handling permissions.

Excessive DirectoryOpen/DirectoryClose calls

If performance issues coincide with a large number of DirectoryOpen and DirectoryClose API calls reported in metrics, particularly when the client application doesn’t seem to be explicitly performing these actions frequently, third-party software might be interfering.

Cause 5: Antivirus software

Antivirus or anti-malware software installed on the client VM can sometimes excessively scan or monitor file system activity. This can involve opening and closing directory handles repeatedly as the software traverses the file share structure, generating a high volume of metadata operations and impacting performance.

Workaround:
Ensure your antivirus software is up-to-date. Specific updates may contain fixes for this behavior when scanning network shares. For example, Microsoft provided a platform update for Windows Defender that addressed excessive directory calls. Configure your antivirus software to exclude scanning of network drives or specific paths on the Azure file share if possible, after carefully evaluating the security implications.

SMB Multichannel Isn’t Being Triggered

You might have configured your system and share for SMB Multichannel (available on premium shares), but metrics or performance benchmarks indicate it’s not active.

Cause 6: Configuration changes require a refresh

Changes to SMB client or server configuration settings, including those related to Multichannel, require the existing SMB connection to be reset for the new settings to take effect. Simply changing the settings on the client or storage account is not enough.

Solution:
After modifying any SMB Multichannel configuration settings on either the Windows SMB client or the Azure storage account, you must explicitly disconnect (unmount) the Azure file share. Wait approximately 60 seconds to ensure the connection is fully terminated. Then, remount the share. This forces the client to establish a new connection using the updated configuration, which should trigger SMB Multichannel if all requirements are met. Note that on Windows client operating systems, generating some I/O load (e.g., copying a file) with a sufficient queue depth (like QD=8) might be needed to trigger Multichannel activation, whereas server OS versions typically trigger it with lower queue depth (QD=1).

Slow performance when unzipping files

Decompressing archive files directly on an Azure file share can be significantly slower than performing the same operation on a local disk.

This is often due to the nature of decompression operations. Unzipping involves numerous small read operations, file creations, metadata updates, and potentially seeking within the archive file. These operations translate into a high volume of metadata-intensive requests to the file share. As discussed earlier, metadata operations on network shares generally have higher latency than on local disks.

Solution:
For the best performance when dealing with compressed archives, copy the archive file from the Azure file share to your local disk or a temporary disk attached to your client VM. Perform the decompression operation locally. Once unzipped, use a multi-threaded copy tool like RoboCopy or AzCopy to copy the extracted files back to the Azure file share. This leverages the speed of local decompression and the efficiency of parallel copy tools for the final transfer.

High latency on web sites hosted on file shares

Hosting web site content directly on an Azure file share and serving it via a web server (like IIS) on a VM can sometimes lead to high latency.

Cause 7: Excessive file change notifications

Web servers often monitor content directories for changes using mechanisms like file change notifications (ReadDirectoryChangesW on Windows). If the web site or application has a deep directory structure, or if the content changes frequently, the volume of change notifications sent from the Azure file share to the web server VM can become very high. Each notification consumes system resources on both the service and the client side. A high volume of notifications can contribute to throttling on the file share and increase overall client-side latency for web requests.

To verify if excessive notifications are an issue, check Azure Metrics for your storage account. Look at the “Transactions” metric filtered by “ResponseType”. High counts of SuccessWithThrottling or ClientThrottlingError could indicate that the load, potentially including notifications, is causing throttling.

Solution:
* Disable file change notification: If your web application doesn’t rely on dynamic content updates that require immediate change notifications, disable this feature in your web server configuration. For IIS, this can often be controlled via settings like FCNMode or by adjusting registry keys like ConfigPollMilliSeconds. Setting ConfigPollMilliSeconds to 0 in the registry (HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\W3SVC\Parameters) and restarting the W3WP process can effectively disable polling.
* Increase polling interval: If notifications are necessary but don’t need to be instantaneous, increase the polling interval (ConfigPollMilliSeconds) to a higher value (e.g., 10 or 30 minutes). This reduces the frequency of checks and the volume of notifications.
* Limit notification scope: For IIS, if your virtual directory maps to a physical directory with many subdirectories, limit the scope of configuration monitoring. Setting allowSubDirConfig to false on the virtual directory configuration in Web.config prevents IIS from watching child directories for configuration changes, which can significantly reduce the number of change notifications issued from the file share.

Troubleshooting Azure Files Performance

Troubleshooting performance issues in Azure Files involves systematically checking configurations, monitoring metrics, understanding workload patterns, and applying appropriate workarounds based on the identified root cause. By addressing common bottlenecks like throttling, metadata operations, client limitations, and application behavior, you can significantly improve the speed and responsiveness of your Azure file shares.

What are your experiences troubleshooting Azure Files performance? Do you have any other tips or workarounds that have worked for you? Share your insights in the comments below!

Post a Comment