lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <xhsmhttkrbvfb.mognet@vschneid-thinkpadt14sgen2i.remote.csb>
Date: Wed, 27 Mar 2024 15:28:56 +0100
From: Valentin Schneider <vschneid@...hat.com>
To: Frederic Weisbecker <frederic@...nel.org>
Cc: "Paul E. McKenney" <paulmck@...nel.org>, Thomas Gleixner
 <tglx@...utronix.de>, LKML <linux-kernel@...r.kernel.org>, Ingo Molnar
 <mingo@...nel.org>, Anna-Maria Behnsen <anna-maria@...utronix.de>, Alex
 Shi <alexs@...nel.org>, Peter Zijlstra <peterz@...radead.org>, Vincent
 Guittot <vincent.guittot@...aro.org>, Barry Song
 <song.bao.hua@...ilicon.com>
Subject: Re: for_each_domain()/sched_domain_span() has offline CPUs (was Re:
 [PATCH 2/2] timers: Fix removed self-IPI on global timer's enqueue in
 nohz_full)

On 27/03/24 13:42, Frederic Weisbecker wrote:
> Le Tue, Mar 26, 2024 at 05:46:07PM +0100, Valentin Schneider a écrit :
>> > Then with that patch I ran TREE07, just some short iterations:
>> >
>> > tools/testing/selftests/rcutorture/bin/kvm.sh --configs "10*TREE07" --allcpus --bootargs "rcutorture.onoff_interval=200" --duration 2
>> >
>> > And the warning triggers very quickly. At least since v6.3 but maybe since
>> > earlier. Is this expected behaviour or am I right to assume that
>> > for_each_domain()/sched_domain_span() shouldn't return an offline CPU?
>> >
>> 
>> I would very much assume an offline CPU shouldn't show up in a
>> sched_domain_span().
>> 
>> Now, on top of the above, there's one more thing worth noting:
>>   cpu_up_down_serialize_trainwrecks()
>> 
>> This just flushes the cpuset work, so after that the sched_domain topology
>> should be sane. However I see it's invoked at the tail end of _cpu_down(),
>> IOW /after/ takedown_cpu() has run, which sounds too late. The comments
>> around this vs. lock ordering aren't very reassuring however, so I need to
>> look into this more.
>
> Ouch...
>
>> 
>> Maybe as a "quick" test to see if this is the right culprit, you could try
>> that with CONFIG_CPUSET=n? Because in that case the sched_domain update is
>> ran within sched_cpu_deactivate().
>
> I just tried and I fear that doesn't help. It still triggers even without
> cpusets :-s
>

What, you mean I can't always blame cgroups? What has the world come to?

That's interesting, it means the deferred work item isn't the (only)
issue. I'll grab your test patch and try to reproduce on TREE07.

> Thanks.


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ