linux-kernel - PID_NS unshare VS synchronize_rcu_tasks() (was: Re: [Syzkaller & bisect] There is task hung in "synchronize

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20221123143758.GA1387380@lothringen>
Date:   Wed, 23 Nov 2022 15:37:58 +0100
From:   Frederic Weisbecker <frederic@...nel.org>
To:     Pengfei Xu <pengfei.xu@...el.com>,
        Lai Jiangshan <jiangshanlai@...il.com>,
        "Paul E. McKenney" <paulmck@...nel.org>,
        Neeraj Upadhyay <quic_neeraju@...cinc.com>,
        Christian Brauner <brauner@...nel.org>,
        "Eric W. Biederman" <ebiederm@...ssion.com>
Cc:     linux-kernel@...r.kernel.org, heng.su@...el.com,
        rcu@...r.kernel.org
Subject: PID_NS unshare VS synchronize_rcu_tasks() (was: Re: [Syzkaller &
 bisect] There is task hung in "synchronize_rcu" in v6.1-rc5 kernel)

On Mon, Nov 21, 2022 at 01:37:06PM +0800, Pengfei Xu wrote:
> Hi Frederic Weisbecker and kernel developers,
> 
> Greeting!
> There is task hung in "synchronize_rcu" in v6.1-rc5 kernel.
> 
> Bisected the issue on Raptor and server(No atom small core, big core only),
> both platforms bisected results show that:
> first bad commit is c597bfddc9e9e8a63817252b67c3ca0e544ace26:
> "sched: Provide Kconfig support for default dynamic preempt mode"
> 
> [  300.097166] INFO: task rcu_tasks_kthre:11 blocked for more than 147 seconds.
> [  300.097455]       Not tainted 6.1.0-rc5-094226ad94f4 #1
> [  300.097641] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  300.097922] task:rcu_tasks_kthre state:D stack:0     pid:11    ppid:2      flags:0x00004000
> [  300.098230] Call Trace:
> [  300.098325]  <TASK>
> [  300.098410]  __schedule+0x2de/0x8f0
> [  300.098562]  schedule+0x5b/0xe0
> [  300.098693]  schedule_timeout+0x3f1/0x4b0
> [  300.098849]  ? __sanitizer_cov_trace_pc+0x25/0x60
> [  300.099032]  ? queue_delayed_work_on+0x82/0xc0
> [  300.099206]  wait_for_completion+0x81/0x140
> [  300.099373]  __synchronize_srcu.part.23+0x83/0xb0
> [  300.099558]  ? __bpf_trace_rcu_stall_warning+0x20/0x20
> [  300.099757]  synchronize_srcu+0xd6/0x100
> [  300.099913]  rcu_tasks_postscan+0x19/0x20
> [  300.100070]  rcu_tasks_wait_gp+0x108/0x290
> [  300.100230]  ? _raw_spin_unlock+0x1d/0x40
> [  300.100389]  rcu_tasks_one_gp+0x27f/0x370
> [  300.100546]  ? rcu_tasks_postscan+0x20/0x20
> [  300.100709]  rcu_tasks_kthread+0x37/0x50
> [  300.100863]  kthread+0x14d/0x190
> [  300.100998]  ? kthread_complete_and_exit+0x40/0x40
> [  300.101199]  ret_from_fork+0x1f/0x30
> [  300.101347]  </TASK>

Thanks for reporting this. Fortunately I managed to reproduce and debug.
It took me a few days to understand the complicated circular dependency
involved.

So here is a summary:

1) TASK A calls unshare(CLONE_NEWPID), this creates a new PID namespace
   that every subsequent child of TASK A will belong to. But TASK A doesn't
   itself belong to that new PID namespace.

2) TASK A forks() and creates TASK B (it is a new threadgroup so it is a
   thread group leader). TASK A stays attached to its PID namespace (let's say PID_NS1)
   and TASK B is the first task belonging to the new PID namespace created by
   unshare()  (let's call it PID_NS2).

3) Since TASK B is the first task attached to PID_NS2, it becomes the PID_NS2
   child reaper.

4) TASK A forks() again and creates TASK C which get attached to PID_NS2.
   Note how TASK C has TASK A as a parent (belonging to PID_NS1) but has
   TASK B (belonging to PID_NS2) as a pid_namespace child_reaper.

3) TASK B exits and since it is the child reaper for PID_NS2, it has to
   kill all other tasks attached to PID_NS2, and wait for all of them to die
   before reaping itself (zap_pid_ns_process()). Note it seems to make a
   misleading assumption here, trusting that all tasks in PID_NS2 either
   get reaped by a parent belonging to the same namespace or by TASK B.
   And it is confident that since it deactivated SIGCHLD handler, all
   the remaining tasks ultimately autoreap. And it waits for that to happen.
   However TASK C escapes that rule because it will get reaped by its parent
   TASK A belonging to PID_NS1.

4) TASK A calls synchronize_rcu_tasks() which leads to
   synchronize_srcu(&tasks_rcu_exit_srcu).

5) TASK B is waiting for TASK C to get reaped (wrongly assuming it autoreaps)
   But TASK B is under a tasks_rcu_exit_srcu SRCU critical section
   (exit_notify() is between exit_tasks_rcu_start() and
   exit_tasks_rcu_finish()), blocking TASK A

6) TASK C exits and since TASK A is its parent, it waits for it to reap TASK C,
   but it can't because TASK A waits for TASK B that waits for TASK C.

So there is a circular dependency:

_ TASK A waits for TASK B to get out of tasks_rcu_exit_srcu SRCU critical
section
_ TASK B waits for TASK C to get reaped
_ TASK C waits for TASK A to reap it.

I have no idea how to solve the situation without violating the pid_namespace
rules and unshare() semantics (although I wish unshare(CLONE_NEWPID) had a less
error prone behaviour with allowing creating more than one task belonging to the
same namespace).

So probably having an SRCU read side critical section within exit_notify() is
not a good idea, is there a solution to work around that for rcu tasks?

Thanks.