linux-kernel - Re: [PATCH v3] sched/topology: improve topology_span

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <xhsmhv7sxzpby.mognet@vschneid-thinkpadt14sgen2i.remote.csb>
Date: Tue, 25 Feb 2025 22:28:33 +0100
From: Valentin Schneider <vschneid@...hat.com>
To: Steve Wahl <steve.wahl@....com>
Cc: Steve Wahl <steve.wahl@....com>, Ingo Molnar <mingo@...hat.com>, Peter
 Zijlstra <peterz@...radead.org>, Juri Lelli <juri.lelli@...hat.com>,
 Vincent Guittot <vincent.guittot@...aro.org>, Dietmar Eggemann
 <dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, Ben
 Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
 linux-kernel@...r.kernel.org, K Prateek Nayak <kprateek.nayak@....com>,
 Vishal Chourasia <vishalc@...ux.ibm.com>, samir <samir@...ux.ibm.com>,
 Naman Jain <namjain@...ux.microsoft.com>, Saurabh Singh Sengar
 <ssengar@...ux.microsoft.com>, srivatsa@...il.mit.edu, Michael Kelley
 <mhklinux@...look.com>, Russ Anderson <rja@....com>, Dimitri Sivanich
 <sivanich@....com>
Subject: Re: [PATCH v3] sched/topology: improve topology_span_sane speed

On 25/02/25 14:02, Steve Wahl wrote:
> On Thu, Feb 20, 2025 at 01:59:20PM -0600, Steve Wahl wrote:
>> On Mon, Feb 17, 2025 at 11:11:36AM +0100, Valentin Schneider wrote:
>> > On 14/02/25 09:42, Steve Wahl wrote:
>> > > On Fri, Feb 14, 2025 at 03:25:31PM +0100, Valentin Schneider wrote:
>> > >> On 10/02/25 09:42, Steve Wahl wrote:
>> > >> > Use a different approach to topology_span_sane(), that checks for the
>> > >> > same constraint of no partial overlaps for any two CPU sets for
>> > >> > non-NUMA topology levels, but does so in a way that is O(N) rather
>> > >> > than O(N^2).
>> > >> >
>> > >> > Instead of comparing with all other masks to detect collisions, keep
>> > >> > one mask that includes all CPUs seen so far and detect collisions with
>> > >> > a single cpumask_intersects test.
>> > >> >
>> > >> > If the current mask has no collisions with previously seen masks, it
>> > >> > should be a new mask, which can be uniquely identified by the lowest
>> > >> > bit set in this mask.  Keep a pointer to this mask for future
>> > >> > reference (in an array indexed by the lowest bit set), and add the
>> > >> > CPUs in this mask to the list of those seen.
>> > >> >
>> > >> > If the current mask does collide with previously seen masks, it should
>> > >> > be exactly equal to a mask seen before, looked up in the same array
>> > >> > indexed by the lowest bit set in the mask, a single comparison.
>> > >> >
>> > >> > Move the topology_span_sane() check out of the existing topology level
>> > >> > loop, let it use its own loop so that the array allocation can be done
>> > >> > only once, shared across levels.
>> > >> >
>> > >> > On a system with 1920 processors (16 sockets, 60 cores, 2 threads),
>> > >> > the average time to take one processor offline is reduced from 2.18
>> > >> > seconds to 1.01 seconds.  (Off-lining 959 of 1920 processors took
>> > >> > 34m49.765s without this change, 16m10.038s with this change in place.)
>> > >> >
>> > >> > Signed-off-by: Steve Wahl <steve.wahl@....com>
>> > >> > ---
>> > >> >
>> > >> > Version 3: While the intent of this patch is no functional change, I
>> > >> > discovered that version 2 had conditions where it would give different
>> > >> > results than the original code.  Version 3 returns to the V1 approach,
>> > >> > additionally correcting the handling of masks with no bits set and
>> > >> > fixing the num_possible_cpus() problem Peter Zijlstra noted.  In a
>> > >> > stand-alone test program that used all possible sets of four 4-bit
>> > >> > masks, this algorithm matched the original code in all cases, where
>> > >> > the others did not.
>> > >> >
>> > >>
>> > >> So looking at my notes from v2 I was under the impression the array-less
>> > >> approach worked, do you have an example topology where the array-less
>> > >> approach fails? I usually poke at topology stuff via QEMU so if you have a
>> > >> virtual topology description I'd be happy to give that a span.
>> > >
>> > > Valentin, thank you for your time looking at this patch.
>> > >
>> > > Note that I'm trying to make this patch function exactly as the code
>> > > did before, only faster, regardless of the inputs.  No functional
>> > > change.
>> > >
>> > > Your statement below about assuming a mask should at least contain the
>> > > cpu itself is intertwined with finding differences.  This code is
>> > > checking the validity of the masks.  If we can't already trust that
>> > > the masks are disjoint, why can we trust they at least have the cpu
>> > > itself set?
>> > >
>> >
>> > True... Though I think this would already be caught by the sched_domain
>> > debugging infra we have, see sched_domain_debug_one().
>>
>> Note that a previous patch of mine was reverted because it allowed
>> another problem to surface on a small number of machines (and was
>> later re-instated after fixing the other problem).
>>
>> Reference: https://lore.kernel.org/all/20240717213121.3064030-1-steve.wahl@hpe.com
>>
>> So, I am quite sensitive to introducing unintended behavior changes.
>>
>> Anyway, sched_domain_debug_one() is only called when
>> sched_debug_verbose is set.  But, at least as it sits currently,
>> topology_span_sane() is run at all times, and its return code is acted
>> on to change system behavior.
>>
>> If there's a system out there where the cpu masks are buggy but
>> currently accepted, I don't want this patch to cause that system to
>> degrade by declaring it insane.
>>
>> I don't fully understand all the code that sets up masks, as there's a
>> lot of it.  But as an example, I think I see in
>> arch/s390/kernel/topology.c, that update_cpu_masks() uses
>> cpu_group_map() to update masks, and that function zeros the mask and
>> then returns if the cpu is not set in cpu_setup_mask.  So potentially
>> there can be some zeroed masks.
>>
>> [Why am I looking at s390 code? Simply because a 'sort | uniq' on the
>> possible tl->mask() functions took me to cpu_book_mask() first.]
>>
>> > So roughly speaking I'd say SCHED_DEBUG+sched_verbose gives you basic
>> > sanity checks on individual sched_domain's (and so indirectly topology
>> > levels), while topology_span_sane() looks at the intersections between the
>> > spans to check it all makes sense as a whole.
>> >
>> > Arguably there is some intersection (!) between these two debugging
>> > features, but I think it still makes sense to keep them separate.
>> >
>> > > Where the assumption that a cpu's mask contains itself holds true, it
>> > > appears v2 and v3 agree.
>> > >
>> > > An example of where the two disagree would be a set of four masks,
>> > > 0010 0001 0001 0001.  These match the stated criteria being checked
>> > > (no overlap between masks that don't exactly match), yet the v2
>> > > algorithm would mark it insane, where the original code doesn't.
>> > >
>> > > If it's valid to mark some additional conditions as insane, I'd rather
>> > > see that in a different patch, because I'm not comfortable enough with
>> > > the ramifications of possibly marking as insane a system that current
>> > > code reports as sane.
>> > >
>> >
>> > Per the above I think it's fairly safe to call that setup insane,
>> > sched_domain_debug_one() would call it so.
>>
>> But this isn't just debug code, at least as it sits now, and system
>> operation changes based on what it returns.  It's not just a printk.
>>
>> > IMO your v2 had sufficient checks and was quite neat without the
>> > additional array. Unless I'm missing something I don't see why we couldn't
>> > stick with that approach.
>>
>> I won't deny I liked the appearance of v2.  As a separate follow on
>> patch I certainly wouldn't object, especially if it came from someone
>> working on improving the scheduling code, instead of someone who's
>> just here because this code slows down large machines significantly.
>>
>> But I would first like to see the speed-up in a low-risk form without
>> possible functional changes.
>
> Valentin,
>
> How would you feel about a two part series, where part one is my v1
> patch, and part 2 is the v2 improvements?  Then if there was a
> problem with the v2 improvements, that could be reverted and we'd
> still have the speed improvement.
>
> Thanks for considering,

Sounds good to me! If it helps, I'll be happy to test that under QEMU with
various fake topologies.