[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <xhsmhy1zr99zt.mognet@vschneid.remote.csb>
Date: Tue, 26 Apr 2022 15:48:06 +0100
From: Valentin Schneider <vschneid@...hat.com>
To: paulmck@...nel.org
Cc: linux-kernel@...r.kernel.org, mingo@...hat.com,
peterz@...radead.org, juri.lelli@...hat.com,
vincent.guittot@...aro.org, dietmar.eggemann@....com,
rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
bristot@...hat.com
Subject: Re: "Dying CPU not properly vacated" splat
On 25/04/22 17:03, Paul E. McKenney wrote:
> On Mon, Apr 25, 2022 at 10:59:44PM +0100, Valentin Schneider wrote:
>> On 25/04/22 10:33, Paul E. McKenney wrote:
>> >
>> > So what did rcu_torture_reader() do wrong here? ;-)
>> >
>>
>> So on teardown, CPUHP_AP_SCHED_WAIT_EMPTY->sched_cpu_wait_empty() waits for
>> the rq to be empty. Tasks must *not* be enqueued onto that CPU after that
>> step has been run - if there are per-CPU tasks bound to that CPU, they must
>> be unbound in their respective hotplug callback.
>>
>> For instance for workqueue.c, we have workqueue_offline_cpu() as a hotplug
>> callback which invokes unbind_workers(cpu), the interesting bit being:
>>
>> for_each_pool_worker(worker, pool) {
>> kthread_set_per_cpu(worker->task, -1);
>> WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, cpu_possible_mask) < 0);
>> }
>>
>> The rcu_torture_reader() kthreads aren't bound to any particular CPU are
>> they? I can't find any code that would indicate they are - and in that case
>> it means we have a problem with is_cpu_allowed() or related.
>
> I did not intend that the rcu_torture_reader() kthreads be bound, and
> I am not seeing anything that binds them.
>
> Thoughts? (Other than that validating any alleged fix will be quite
> "interesting".)
>
> Thanx, Paul
IIUC the bogus scenario is is_cpu_allowed() lets one of those kthreads be
enqueued on the outgoing CPU *after* CPUHP_AP_SCHED_WAIT_EMPTY.teardown() has
been run, and hilarity ensues.
The cpu_dying() condition should prevent a regular kthread from getting
enqueued there, most of the details have been evinced from my brain but I
recall we got the ordering conditions right...
The only other "obvious" thing here is migrate_disable() which lets the
enqueue happen, but then balance_push()->select_fallback_rq() should punt
it away on context switch.
I need to rediscover those paths, I don't see any obvious clue right now.
Powered by blists - more mailing lists