[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220426140919.GD4285@paulmck-ThinkPad-P17-Gen-1>
Date: Tue, 26 Apr 2022 07:09:19 -0700
From: "Paul E. McKenney" <paulmck@...nel.org>
To: Christoph Bartoschek <bartoschek@...gle.com>
Cc: Chris Mason <clm@...com>, Giuseppe Scrivano <gscrivan@...hat.com>,
linux-kernel@...r.kernel.org,
"riel@...riel.com" <riel@...riel.com>,
"viro@...iv.linux.org.uk" <viro@...iv.linux.org.uk>
Subject: Re: [PATCH RFC fs/namespace] Make kern_unmount() use
synchronize_rcu_expedited()
On Tue, Apr 26, 2022 at 08:59:17AM +0200, Christoph Bartoschek wrote:
> The regression that has been introduced with commit
> e1eb26fa62d04ec0955432be1aa8722a97cb52e7 has hit us when building with Bazel
> using the linux-sandbox
> (https://github.com/bazelbuild/bazel/blob/master/src/main/tools/linux-sandbox.cc).
> The sandbox tries to isolate build steps from each other and to ensure that
> builds are hermetic and therefore sets up new namespaces for each step. For
> large software packages and even with the time spend building we run out of
> namespaces on larger machines that allow for enough parallelism. I have reduced
> the sandbox to a simple test case:
>
> #define _GNU_SOURCE
> #include <errno.h>
> #include <sched.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <sys/types.h>
> #include <sys/wait.h>
>
> int pid1main(void *) {
> return 0;
> }
>
> int main(void) {
> int clone_flags = CLONE_NEWUSER | CLONE_NEWIPC | SIGCHLD;
> void * stack = malloc(1024*1024);
> const pid_t child_pid = clone(pid1main, stack + 1024*1024, clone_flags, NULL);
>
> if (child_pid < 0) {
> perror("clone");
> }
> int ret = waitpid(child_pid, NULL, 0);
> if (ret < 0) {
> perror("waitpid");
> return ret;
> }
> return 0;
> }
>
> Run it with
> $ gcc clone-test.cc
> $ seq 1 10000000 | parallel --halt now,fail=1 -j32 $PWD/a.out
> clone: No space left on device
> waitpid: No child processes
> parallel: This job failed:
> /usr/local/google/home/bartoschek/linux-sandbox-test/a.out 53070
>
> I run the test on kernel v5.18-rc4.
> Depending on your configured limits you will soon get an ENOSPC even though
> never more than 32 additional namespaces should be in use by parallel.
> During execution the whole system can become quite unresponsive.
> This does not happen without e1eb26fa62d04ec0955432be1aa8722a97cb52e7.
>
> I see that the issue was already reported in 2020:
> http://merlin.infradead.org/pipermail/linux-nvme/2020-September/019565.html
>
> Would it be possible to revert e1eb26fa62d04ec0955432be1aa8722a97cb52e7? It
> seems to make the kernel less deterministic and hard to reason about active
> namespaces.
There were several attempts to fix this:
1. https://lore.kernel.org/lkml/20220214190549.GA2815154@paulmck-ThinkPad-P17-Gen-1/
Replace a synchronize_rcu() with synchronize_rcu_expedited()
2. https://lore.kernel.org/lkml/20220217153620.4607bc28@imladris.surriel.com/
Use queue_rcu_work() and streamline things.
3. https://lore.kernel.org/lkml/20220218183114.2867528-1-riel@surriel.com/
Refined queue_rcu_work() approach.
#1 should work, but the resulting IPIs are not going to make the real-time
guys happy. #2 and #3 have been subject to reasonably heavy testing
and did fix a very similar issue to the one that you are reporting,
but last I knew there were doubts about the concurrency consequences.
Could you please give at least #3 a shot and see if it helps you?
Thanx, Paul
Powered by blists - more mailing lists