You are viewing documentation for Kubernetes version: v1.24

Kubernetes v1.24 documentation is no longer actively maintained. The version you are currently viewing is a static snapshot. For up-to-date information, see the latest version.

New in Kubernetes v1.22: alpha support for using swap memory

Monday, August 09, 2021

Author: Elana Hashman (Red Hat)

The 1.22 release introduced alpha support for configuring swap memory usage for Kubernetes workloads on a per-node basis.

In prior releases, Kubernetes did not support the use of swap memory on Linux, as it is difficult to provide guarantees and account for pod memory utilization when swap is involved. As part of Kubernetes' earlier design, swap support was considered out of scope, and a kubelet would by default fail to start if swap was detected on a node.

However, there are a number of use cases that would benefit from Kubernetes nodes supporting swap, including improved node stability, better support for applications with high memory overhead but smaller working sets, the use of memory-constrained devices, and memory flexibility.

Hence, over the past two releases, SIG Node has been working to gather appropriate use cases and feedback, and propose a design for adding swap support to nodes in a controlled, predictable manner so that Kubernetes users can perform testing and provide data to continue building cluster capabilities on top of swap. The alpha graduation of swap memory support for nodes is our first milestone towards this goal!

How does it work?

There are a number of possible ways that one could envision swap use on a node. To keep the scope manageable for this initial implementation, when swap is already provisioned and available on a node, we have proposed the kubelet should be able to be configured such that:

It can start with swap on.
It will direct the Container Runtime Interface to allocate zero swap memory to Kubernetes workloads by default.
You can configure the kubelet to specify swap utilization for the entire node.

Swap configuration on a node is exposed to a cluster admin via the memorySwap in the KubeletConfiguration. As a cluster administrator, you can specify the node's behaviour in the presence of swap memory by setting memorySwap.swapBehavior.

This is possible through the addition of a memory_swap_limit_in_bytes field to the container runtime interface (CRI). The kubelet's config will control how much swap memory the kubelet instructs the container runtime to allocate to each container via the CRI. The container runtime will then write the swap settings to the container level cgroup.

How do I use it?

On a node where swap memory is already provisioned, Kubernetes use of swap on a node can be enabled by enabling the NodeSwap feature gate on the kubelet, and disabling the failSwapOn configuration setting or the --fail-swap-on command line flag.

You can also optionally configure memorySwap.swapBehavior in order to specify how a node will use swap memory. For example,

memorySwap:
  swapBehavior: LimitedSwap

The available configuration options for swapBehavior are:

LimitedSwap (default): Kubernetes workloads are limited in how much swap they can use. Workloads on the node not managed by Kubernetes can still swap.
UnlimitedSwap: Kubernetes workloads can use as much swap memory as they request, up to the system limit.

If configuration for memorySwap is not specified and the feature gate is enabled, by default the kubelet will apply the same behaviour as the LimitedSwap setting.

The behaviour of the LimitedSwap setting depends if the node is running with v1 or v2 of control groups (also known as "cgroups"):

cgroups v1: Kubernetes workloads can use any combination of memory and swap, up to the pod's memory limit, if set.
cgroups v2: Kubernetes workloads cannot use swap memory.

Caveats

Having swap available on a system reduces predictability. Swap's performance is worse than regular memory, sometimes by many orders of magnitude, which can cause unexpected performance regressions. Furthermore, swap changes a system's behaviour under memory pressure, and applications cannot directly control what portions of their memory usage are swapped out. Since enabling swap permits greater memory usage for workloads in Kubernetes that cannot be predictably accounted for, it also increases the risk of noisy neighbours and unexpected packing configurations, as the scheduler cannot account for swap memory usage.

The performance of a node with swap memory enabled depends on the underlying physical storage. When swap memory is in use, performance will be significantly worse in an I/O operations per second (IOPS) constrained environment, such as a cloud VM with I/O throttling, when compared to faster storage mediums like solid-state drives or NVMe.

Hence, we do not recommend the use of swap for certain performance-constrained workloads or environments. Cluster administrators and developers should benchmark their nodes and applications before using swap in production scenarios, and we need your help with that!

Looking ahead

The Kubernetes 1.22 release introduces alpha support for swap memory on nodes, and we will continue to work towards beta graduation in the 1.23 release. This will include:

Adding support for controlling swap consumption at the Pod level via cgroups.
- This will include the ability to set a system-reserved quantity of swap from what kubelet detects on the host.
Determining a set of metrics for node QoS in order to evaluate the performance and stability of nodes with and without swap enabled.
Collecting feedback from test user cases.
- We will consider introducing new configuration modes for swap, such as a node-wide swap limit for workloads.

How can I learn more?

You can review the current documentation on the Kubernetes website.

For more information, and to assist with testing and provide feedback, please see KEP-2400 and its design proposal.

How do I get involved?

Your feedback is always welcome! SIG Node meets regularly and can be reached via Slack (channel #sig-node), or the SIG's mailing list. Feel free to reach out to me, Elana Hashman (@ehashman on Slack and GitHub) if you'd like to help.

←Previous