linux-kernel - Re: [PATCH tip/core/rcu 13/22] rcu: Fix grace-period hangs due to race with CPU offline

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20180626203225.GT2494@hirez.programming.kicks-ass.net>
Date:   Tue, 26 Jun 2018 22:32:25 +0200
From:   Peter Zijlstra <peterz@...radead.org>
To:     "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
Cc:     linux-kernel@...r.kernel.org, mingo@...nel.org,
        jiangshanlai@...il.com, dipankar@...ibm.com,
        akpm@...ux-foundation.org, mathieu.desnoyers@...icios.com,
        josh@...htriplett.org, tglx@...utronix.de, rostedt@...dmis.org,
        dhowells@...hat.com, edumazet@...gle.com, fweisbec@...il.com,
        oleg@...hat.com, joel@...lfernandes.org
Subject: Re: [PATCH tip/core/rcu 13/22] rcu: Fix grace-period hangs due to
 race with CPU offline

On Tue, Jun 26, 2018 at 01:26:15PM -0700, Paul E. McKenney wrote:
> commit 2e5b2ff4047b138d6b56e4e3ba91bc47503cdebe
> Author: Paul E. McKenney <paulmck@...ux.vnet.ibm.com>
> Date:   Fri May 25 19:23:09 2018 -0700
> 
>     rcu: Fix grace-period hangs due to race with CPU offline
>     
>     Without special fail-safe quiescent-state-propagation checks, grace-period
>     hangs can result from the following scenario:
>     
>     1.      CPU 1 goes offline.
>     
>     2.      Because CPU 1 is the only CPU in the system blocking the current
>             grace period, the grace period ends as soon as
>             rcu_cleanup_dying_idle_cpu()'s call to rcu_report_qs_rnp()
>             returns.

My current code doesn't have that call... So this is a new problem
earlier in this series.

>     3.      At this point, the leaf rcu_node structure's ->lock is no longer
>             held: rcu_report_qs_rnp() has released it, as it must in order
>             to awaken the RCU grace-period kthread.
>     
>     4.      At this point, that same leaf rcu_node structure's ->qsmaskinitnext
>             field still records CPU 1 as being online.  This is absolutely
>             necessary because the scheduler uses RCU (in this case on the
>             wake-up path while awakening RCU's grace-period kthread), and
>             ->qsmaskinitnext contains RCU's idea as to which CPUs are online.
>             Therefore, invoking rcu_report_qs_rnp() after clearing CPU 1's
>             bit from ->qsmaskinitnext would result in a lockdep-RCU splat
>             due to RCU being used from an offline CPU.

Argh.. so it's your own wakeup!

This all still smells really bad. But let me try and figure out where
you introduced the problem.