[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <xhsmhmsekj2xz.mognet@vschneid-thinkpadt14sgen2i.remote.csb>
Date: Mon, 17 Feb 2025 11:11:36 +0100
From: Valentin Schneider <vschneid@...hat.com>
To: Steve Wahl <steve.wahl@....com>
Cc: Steve Wahl <steve.wahl@....com>, Ingo Molnar <mingo@...hat.com>, Peter
Zijlstra <peterz@...radead.org>, Juri Lelli <juri.lelli@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>, Dietmar Eggemann
<dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, Ben
Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
linux-kernel@...r.kernel.org, K Prateek Nayak <kprateek.nayak@....com>,
Vishal Chourasia <vishalc@...ux.ibm.com>, samir <samir@...ux.ibm.com>,
Naman Jain <namjain@...ux.microsoft.com>, Saurabh Singh Sengar
<ssengar@...ux.microsoft.com>, srivatsa@...il.mit.edu, Michael Kelley
<mhklinux@...look.com>, Russ Anderson <rja@....com>, Dimitri Sivanich
<sivanich@....com>
Subject: Re: [PATCH v3] sched/topology: improve topology_span_sane speed
On 14/02/25 09:42, Steve Wahl wrote:
> On Fri, Feb 14, 2025 at 03:25:31PM +0100, Valentin Schneider wrote:
>> On 10/02/25 09:42, Steve Wahl wrote:
>> > Use a different approach to topology_span_sane(), that checks for the
>> > same constraint of no partial overlaps for any two CPU sets for
>> > non-NUMA topology levels, but does so in a way that is O(N) rather
>> > than O(N^2).
>> >
>> > Instead of comparing with all other masks to detect collisions, keep
>> > one mask that includes all CPUs seen so far and detect collisions with
>> > a single cpumask_intersects test.
>> >
>> > If the current mask has no collisions with previously seen masks, it
>> > should be a new mask, which can be uniquely identified by the lowest
>> > bit set in this mask. Keep a pointer to this mask for future
>> > reference (in an array indexed by the lowest bit set), and add the
>> > CPUs in this mask to the list of those seen.
>> >
>> > If the current mask does collide with previously seen masks, it should
>> > be exactly equal to a mask seen before, looked up in the same array
>> > indexed by the lowest bit set in the mask, a single comparison.
>> >
>> > Move the topology_span_sane() check out of the existing topology level
>> > loop, let it use its own loop so that the array allocation can be done
>> > only once, shared across levels.
>> >
>> > On a system with 1920 processors (16 sockets, 60 cores, 2 threads),
>> > the average time to take one processor offline is reduced from 2.18
>> > seconds to 1.01 seconds. (Off-lining 959 of 1920 processors took
>> > 34m49.765s without this change, 16m10.038s with this change in place.)
>> >
>> > Signed-off-by: Steve Wahl <steve.wahl@....com>
>> > ---
>> >
>> > Version 3: While the intent of this patch is no functional change, I
>> > discovered that version 2 had conditions where it would give different
>> > results than the original code. Version 3 returns to the V1 approach,
>> > additionally correcting the handling of masks with no bits set and
>> > fixing the num_possible_cpus() problem Peter Zijlstra noted. In a
>> > stand-alone test program that used all possible sets of four 4-bit
>> > masks, this algorithm matched the original code in all cases, where
>> > the others did not.
>> >
>>
>> So looking at my notes from v2 I was under the impression the array-less
>> approach worked, do you have an example topology where the array-less
>> approach fails? I usually poke at topology stuff via QEMU so if you have a
>> virtual topology description I'd be happy to give that a span.
>
> Valentin, thank you for your time looking at this patch.
>
> Note that I'm trying to make this patch function exactly as the code
> did before, only faster, regardless of the inputs. No functional
> change.
>
> Your statement below about assuming a mask should at least contain the
> cpu itself is intertwined with finding differences. This code is
> checking the validity of the masks. If we can't already trust that
> the masks are disjoint, why can we trust they at least have the cpu
> itself set?
>
True... Though I think this would already be caught by the sched_domain
debugging infra we have, see sched_domain_debug_one().
So roughly speaking I'd say SCHED_DEBUG+sched_verbose gives you basic
sanity checks on individual sched_domain's (and so indirectly topology
levels), while topology_span_sane() looks at the intersections between the
spans to check it all makes sense as a whole.
Arguably there is some intersection (!) between these two debugging
features, but I think it still makes sense to keep them separate.
> Where the assumption that a cpu's mask contains itself holds true, it
> appears v2 and v3 agree.
>
> An example of where the two disagree would be a set of four masks,
> 0010 0001 0001 0001. These match the stated criteria being checked
> (no overlap between masks that don't exactly match), yet the v2
> algorithm would mark it insane, where the original code doesn't.
>
> If it's valid to mark some additional conditions as insane, I'd rather
> see that in a different patch, because I'm not comfortable enough with
> the ramifications of possibly marking as insane a system that current
> code reports as sane.
>
Per the above I think it's fairly safe to call that setup insane,
sched_domain_debug_one() would call it so.
IMO your v2 had sufficient checks and was quite neat without the
additional array. Unless I'm missing something I don't see why we couldn't
stick with that approach.
Powered by blists - more mailing lists