Control Group - Resource Distribution

April 24, 2022

⊕ Previous post: hierarchical organization
Next coming post: no internal process constraint

A system has hundreds of resources that process may use (and exhaust!): controlling them it is not trivial and requires a precise intervention by part of the kernel.

We’ll use the simplest resource to understand: the amount of process ids or pids.

While CPU and memory are the most common resources that can be exhausted, the process id space is not infinite: it is an integer typically in the range of 1 to \(2^{16}\).

A malicious or really-bugged program can trivially consume all the available pids spawning thousands of processes and threads long before other resource get exhausted. This the so called fork bomb.

Once you run out of pids, no other process can be started leaving the system unusable.

In this post we will explore the rules of resources distribution in a cgroup hierarchy and in particular how to avoid fork bombs to explode.

Limit the resource (pid count) on a cgroup

⊕ cgroup hierarchy after adding the shell process to test/.
Boxes represent cgroups and the arrows between the nesting (here, test/ is inside of the root / cgroup). Circles are processes and the small purple dots are the resource controller’s setting (the limit of pids in this case).

Let’s create a test/ cgroup:

$ cd /sys/fs/cgroup
$ mkdir test

Let’s add ourselves to test/: we add the shell process to test/ by adding its process id into cgroup.procs file.

$ echo $$ > test/cgroup.procs

Now we can query how many pids are being used in the cgroup. The expected count is 2: one for the shell that we added and other for the cat program that it is reading pids.current:

$ cat test/pids.current
2

⊕ cgroup hierarchy during the execution of echo "Hello" | cat. Both echo and cat are children of shell so they inherit the parent’s cgroup by default but due the limit on pids ≤ 2, the spawn of cat fails (the fork syscall fails) and the process never gets alive.

The +pids controller allows us to set a maximum: once reached calls to fork or clone will fail because they will not be able to reserve another pid.

$ echo 2 > test/pids.max

$ echo "Hello"
Hello

$ echo "Hello" | cat        # byexample: +timeout=30
<...>fork: Resource temporarily unavailable

Resource distribution over a subtree

⊕ Notice how test/cg1/ and test/cg2/ are (sub) cgroups of test/.

We can divide test/ in more sub cgroups however there are no pids.max there by default:

$ mkdir test/cg1
$ mkdir test/cg2

$ ls -1 test/cg1/pids*
<...>No such file or directory

$ ls -1 test/cg2/pids*
<...>No such file or directory

Each new (sub) cgroup will not inherit the controllers of its parent. The parent must explicitly select which controllers its children will administrate.

Nevertheless, the sub cgroups are subject to the limits of the controller.

⊕ Despite +pids not be enabled on test/cg1/ the limit imposed by test/ still applies.
Notice also that writing shell’s pid to test/cg1/cgroup.procs literally moved it from test/ to test/cg1/: each process belongs to one and only one cgroup.

To prove this, let’s move our shell to test/cg1/ and try to spawn more than 2 processes there.

$ echo $$ > test/cg1/cgroup.procs
$ cat test/cgroup.procs

$ # the limit of 2 processes on test/ applies to test/'s children too
$ echo "Hello" | cat        # byexample: +timeout=30
<...>fork: Resource temporarily unavailable

In fact the whole subtree works as single unit: the sum of the consumed resources by the processes in the subtree is subject to the limits of the controller.

Let’s extend the limit to 3 processes, create a long running process in test/cg2/ and try to spawn more processes on test/cg1/ as before.

While neither the amount of processes in test/cg1/ nor test/cg2/ exceed the limit of 3 processes, the sum does.

⊕

$ # limit on our parent (test/)
$ echo 3 > test/pids.max

$ # we can spawn 2 processes without problem
$ # because we under the limit (2+1 <= 3)
$ echo "Hello" | cat
Hello

$ # let's add a process to our sibling (test/cg2)
$ sleep 1000 &
[<job-id-proc1>] <pid-proc1>
$ echo <pid-proc1> > test/cg2/cgroup.procs      # byexample: +paste

$ # limit exceeded: our shell, the sleep process and these
$ # 2 spawned processes exceeds the limit of 3 for the whole
$ # subtree of test/
$ echo "Hello" | cat        # byexample: +timeout=30
<...>fork: Resource temporarily unavailable

Further resource distribution: top-down constraint

Having sub cgroups makes more sense if we enable controllers there allowing us to have a better discretion to distribute the resources.

⊕ Writing to +pids to test/cgroup.subtree_control enables the resource controller on the immediate children.

If we want to define (sub) limits in the test/ immediate children, we need to enable the +pids controller on the subtree.

$ # kill the sleep process, not of much interest from now on
$ kill -15 <pid-proc1>                          # byexample: +paste +pass

$ # enable the controller to the test/ children
$ echo '+pids' > test/cgroup.subtree_control

The cgroup.subtree_control lists which controllers the immediate children will have access and control over.

$ ls -1 test/cg1/pids*
test/cg1/pids.current
test/cg1/pids.events
test/cg1/pids.max

$ ls -1 test/cg2/pids*
test/cg2/pids.current
test/cg2/pids.events
test/cg2/pids.max

By default the sub cgroups will have no limit on their pids.max file but the are still subject to the limit set in test/.

$ cat test/cg1/pids.max      # apparently no limit on the child (test/cg1)
max

$ echo "Hello" | cat | cat   # byexample: +timeout=30
<...>fork: Resource temporarily unavailable

⊕

The controller can never enlarge or relax the limits on the resource beyond the limits given by the parent, it can only restrict it further.

This is the top-down constraint.

So let’s try that!

$ # limit on the parent (test/)
$ echo 6 > test/pids.max

$ # we can perfectly spawn these two process now
$ echo "Hello" | cat
Hello

$ # limit further on the child (cg1/)
$ echo 2 > test/cg1/pids.max

$ # now we can't: we are hitting the limit not of test/ but of test/cg1/
$ echo "Hello" | cat        # byexample: +timeout=30
<...>fork: Resource temporarily unavailable

Explosion contained

⊕ Fork bomb reference.

Not really sure why, but a fork bomb in my machine tends to eat all the CPU so it is not just eating all the pids.

Thankfully, there is a resource controller for the CPU in cgroup too.

$ echo 3 > test/pids.max                # byexample: +fail-fast
$ echo "1000 100000" > test/cpu.max     # byexample: +fail-fast

$ echo max > test/cg1/pids.max

$ bomb() {
>  bomb | bomb
> }

$ bomb       # byexample: +timeout=90
<...>Resource temporarily unavailable<...>

After a few experiments I would say that the +pids controller is by no means the only measure to take to contain a fork explosion. Without +cpu I melted down my machine a few times.

Run the bomb under your responsibility!

Next stuff

In the past post we talked about how we can organize the cgroup hierarchy and in this post how we can control and distribute resources over it.

But we talked little about where the processes should be put in the hierarchy. It may seem that they could be put anywhere and for v1 that is true.

In v2 however, the things will get a bit hairy and weird with the no internal process constraint.

Related tags: linux, kernel, cgroup, fork bomb

The Book of Gehn