linux-kernel - Re: [PATCH RFC fs/namespace] Make kern_unmount() use synchronize_rcu

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20220426140919.GD4285@paulmck-ThinkPad-P17-Gen-1>
Date:   Tue, 26 Apr 2022 07:09:19 -0700
From:   "Paul E. McKenney" <paulmck@...nel.org>
To:     Christoph Bartoschek <bartoschek@...gle.com>
Cc:     Chris Mason <clm@...com>, Giuseppe Scrivano <gscrivan@...hat.com>,
        linux-kernel@...r.kernel.org,
        "riel@...riel.com" <riel@...riel.com>,
        "viro@...iv.linux.org.uk" <viro@...iv.linux.org.uk>
Subject: Re: [PATCH RFC fs/namespace] Make kern_unmount() use
 synchronize_rcu_expedited()

On Tue, Apr 26, 2022 at 08:59:17AM +0200, Christoph Bartoschek wrote:
> The regression that has been introduced with commit
> e1eb26fa62d04ec0955432be1aa8722a97cb52e7 has hit us when building with Bazel
> using the linux-sandbox
> (https://github.com/bazelbuild/bazel/blob/master/src/main/tools/linux-sandbox.cc).
> The sandbox tries to isolate build steps from each other and to ensure that
> builds are hermetic and therefore sets up new namespaces for each step. For
> large software packages and even with the time spend building we run out of
> namespaces on larger machines that allow for enough parallelism. I have reduced
> the sandbox to a simple test case:
> 
> #define _GNU_SOURCE
> #include <errno.h>
> #include <sched.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <sys/types.h>
> #include <sys/wait.h>
> 
> int pid1main(void *) {
>    return 0;
> }
> 
> int main(void) {
>   int clone_flags = CLONE_NEWUSER | CLONE_NEWIPC | SIGCHLD;
>   void * stack = malloc(1024*1024);
>   const pid_t child_pid = clone(pid1main, stack + 1024*1024, clone_flags, NULL);
> 
>   if (child_pid < 0) {
>     perror("clone");
>   }
>   int ret = waitpid(child_pid, NULL, 0);
>   if (ret < 0) {
>     perror("waitpid");
>     return ret;
>   }
>   return 0;
> }
> 
> Run it with
> $ gcc clone-test.cc
> $ seq 1 10000000 | parallel --halt now,fail=1 -j32 $PWD/a.out
> clone: No space left on device
> waitpid: No child processes
> parallel: This job failed:
> /usr/local/google/home/bartoschek/linux-sandbox-test/a.out 53070
> 
> I run the test on kernel v5.18-rc4.
> Depending on your configured limits you will soon get an ENOSPC even though
> never more than 32 additional namespaces should be in use by parallel.
> During execution the whole system can become quite unresponsive.
> This does not happen without e1eb26fa62d04ec0955432be1aa8722a97cb52e7.
> 
> I see that the issue was already reported in 2020:
> http://merlin.infradead.org/pipermail/linux-nvme/2020-September/019565.html
> 
> Would it be possible to revert e1eb26fa62d04ec0955432be1aa8722a97cb52e7? It
> seems to make the kernel less deterministic and hard to reason about active
> namespaces.

There were several attempts to fix this:

1. https://lore.kernel.org/lkml/20220214190549.GA2815154@paulmck-ThinkPad-P17-Gen-1/
	Replace a synchronize_rcu() with synchronize_rcu_expedited()

2. https://lore.kernel.org/lkml/20220217153620.4607bc28@imladris.surriel.com/
	Use queue_rcu_work() and streamline things.

3. https://lore.kernel.org/lkml/20220218183114.2867528-1-riel@surriel.com/
	Refined queue_rcu_work() approach.

#1 should work, but the resulting IPIs are not going to make the real-time
guys happy.  #2 and #3 have been subject to reasonably heavy testing
and did fix a very similar issue to the one that you are reporting,
but last I knew there were doubts about the concurrency consequences.

Could you please give at least #3 a shot and see if it helps you?

							Thanx, Paul