thoughtpile i write things here

Control Group version 2 by hand

We have Control Group v2 since 2016 but I had trouble finding good documentation on how to use it. Most tutorials and blog posts only cover v1 or are specific to systemd[1]. The kernel documentation is a great reference and the basis for this post but not always easy to follow. I will give you a few short examples on how to use it. I will not explain everything, but hopefully enough to get an idea and understand the reference better.

Your interface to cgroups is a special file-system. Most distributions have cgroup v1 mounted at /sys/fs/cgroup and cgroup v2 at /sys/fs/cgroup/unified. Some distributions removed v1 support by default and have v2 mounted at /sys/fs/cgroup. You can find out where cgroup v2 is mounted with mount | grep cgroup2. If it is not mounted, you can do it yourself with mount -t cgroup2 none /sys/fs/cgroup/unified. You can theoretically mount it anywhere you like, but tools expect it in the path mentioned above. Going forward I will assume you are in a terminal in the cgroup v2 directory.

Linux distributions should have all cgroup options compiled in. If you built the kernel yourself, or you are missing files in /sys/fs/cgroup, you can check with zgrep CGROUP /proc/config.gz | grep -Ev 'DEBUG|=y' if you are missing anything important.

Note
All examples on this page are tested with kernel 5.10.

Enabling controllers

There are 8 controllers currently[2]: cpu, memory, io, pids, cpuset, rdma, hugetlb and perf_event. You can find out which are available with cat cgroup.controllers. perf_event is automatically enabled, all others have to be enabled explicitly, with echo "cpu +memory" > cgroup.subtree_control` for example. You can disable a controller by using a `-` instead of a `.

Resources are distributed top-down and a cgroup can further distribute a resource only if the resource has been distributed to it from the parent. This means that all non-root “cgroup.subtree_control” files can only contain controllers which are enabled in the parent’s “cgroup.subtree_control” file. A controller can be enabled only if the parent has the controller enabled and a controller can’t be disabled if one or more children have it enabled. […] Non-root cgroups can distribute domain resources to their children only when they don’t have any processes of their own. In other words, only domain cgroups which don’t contain any processes can have domain controllers enabled in their “cgroup.subtree_control” files.[3]

Kernel documentation on Control Group v2

We will keep it simple by only setting controllers globally in our root cgroup.

Controlling CPU usage

This control group will use the cpu controller[4]. Every process in this group will be deprioritized, all processes together can only use the power of 2 CPU cores.

echo "+cpu" > cgroup.subtree_control
mkdir testgroup
cd testgroup
echo "50" > cpu.weight
echo "200000 100000" > cpu.max

Try adding your current shell to the group with echo "$$" > cgroup.procs. All processes you start from this shell are now in the same cgroup. But what does the example do, exactly?

  • cpu.weight is the relative amount of CPU cycles the cgroup is getting under load. The CPU cycles are distributed by adding up the weights of all active children and giving each the fraction matching the ratio of its weight against the sum.[5] It has a range from 1 to 10,000. If one process has a weight of 3,000 and the only other active process has a weight of 7,000, the former will get 30% and the latter 70% of CPU cycles. The default is 100.

  • cpu.max sets the “maximum bandwidth limit”. We told the kernel that the processes should use at most 200,000 µs every 100,000 µs, meaning they can use the power of up to 2 cores.

Try running for process in $(seq 1 4); do (cat /dev/urandom > /dev/null &); done. You will see that the CPU usage of each process hovers at around 50% instead of 100%. The processes were added to cgroup.procs.

Tip
You can add a cgroup column to htop by pressing F2 and then navigating to “Columns”. Select “CGROUP” in “Available Columns” and press Enter.
Tip
cpu.weight.nice is an alternate interface to cpu.weight that uses the same values used by nice and has a range from -20 to 19.

Controlling CPU core usage

This control group will use the cpuset controller[6] to restrict the processes to the CPU cores 0 and 3.

echo "+cpuset" > cgroup.subtree_control
mkdir testgroup
cd testgroup
echo "0,3" > cpuset.cpus
echo "$$" > cgroup.procs

cpuset.cpus takes comma-separated numbers or ranges. For example: “0-4,6,8-10”.

Tip
You can add a CPU column to htop by pressing F2 and then navigating to “Columns”. Select “PROCESSOR” in “Available Columns” and press Enter.

Controlling memory usage

This control group will use the memory controller[7]. All processes together can only use 1 GiB of memory at most.

echo "+memory" > cgroup.subtree_control
mkdir testgroup
cd testgroup
echo "512M" > memory.high
echo "1G" > memory.max
echo "$$" > cgroup.procs
  • If the memory usage of a cgroup goes over memory.high it throttles allocations by forcing them into direct reclaim to work off the excess.

  • memory.max is the hard limit. If the cgroup reaches that limit and the memory usage can not be reduced, the OOM killer is invoked in the cgroup.

Controlling Input/Output usage

This control group will increase the IO priority and limit the write speed to 2 MiB a second using the io controller[8]. IO limits are set per device. You need to specify the major and minor device numbers of the device (not partition) you want to limit (in my case it is “8:0” for /dev/sda). Run lsblk or cat /proc/partitions to find them out.

echo "+io" > cgroup.subtree_control
mkdir testgroup
cd testgroup
echo "default 500" > io.weight
echo "8:0 wbps=$((2 * 1024 * 1024)) rbps=max" > io.max
echo "$$" > cgroup.procs
  • io.weight specifies the relative amount of IO time the cgroup can use in relation to its siblings and has a range from 1 to 10,000.[5] The priority can be overridden for individual devices with the major:minor syntax, like “8:0 90”. The default is value is 100.

  • io.max limits bytes per second (rbps/wbps) and/or IO operations per second (riops/wiops).

Try running dd if=/dev/zero bs=1M count=100 of=test.img oflag=direct[9]. You will see that the speed is around 2 MiB a second.

Tip
Kernel 5.14 introduced blkio.prio.class[10] that controls the IO priority. It seems to work similar to ionice.
Important
Weight based distribution (io.weight) is available only if cfq-iosched is in use and absolute bandwidth or IOPS limit distribution (io.max) is not available for blk-mq devices.[8] The CFQ scheduler was removed in kernel 5.0.[11]

Controlling process numbers

This control group will limit the amount of processes to 10 using the process number controller[12].

echo "+pids" > cgroup.subtree_control
mkdir testgroup
cd testgroup
echo 10 > pids.max
echo "$$" > cgroup.procs

Try running for process in $(seq 1 10); do ((sleep 2 && echo ${process}) &); done. You will get error messages from your shell that it can not fork another process.