[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <xhsmh8quc5ca4.mognet@vschneid-thinkpadt14sgen2i.remote.csb>
Date: Fri, 25 Oct 2024 19:21:23 +0200
From: Valentin Schneider <vschneid@...hat.com>
To: Steve Wahl <steve.wahl@....com>
Cc: Steve Wahl <steve.wahl@....com>, Ingo Molnar <mingo@...hat.com>, Peter
Zijlstra <peterz@...radead.org>, Juri Lelli <juri.lelli@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>, Dietmar Eggemann
<dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, Ben
Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
linux-kernel@...r.kernel.org, Russ Anderson <rja@....com>, Dimitri
Sivanich <sivanich@....com>
Subject: Re: [PATCH] sched/topology: improve topology_span_sane speed
On 25/10/24 10:06, Steve Wahl wrote:
> On Tue, Oct 15, 2024 at 04:37:35PM +0200, Valentin Schneider wrote:
>> On 10/10/24 10:51, Steve Wahl wrote:
>> > Use a different approach to topology_span_sane(), that checks for the
>> > same constraint of no partial overlaps for any two CPU sets for
>> > non-NUMA topology levels, but does so in a way that is O(N) rather
>> > than O(N^2).
>> >
>> > Instead of comparing with all other masks to detect collisions, keep
>> > one mask that includes all CPUs seen so far and detect collisions with
>> > a single cpumask_intersects test.
>> >
>> > If the current mask has no collisions with previously seen masks, it
>> > should be a new mask, which can be uniquely identified by the lowest
>> > bit set in this mask. Keep a pointer to this mask for future
>> > reference (in an array indexed by the lowest bit set), and add the
>> > CPUs in this mask to the list of those seen.
>> >
>> > If the current mask does collide with previously seen masks, it should
>> > be exactly equal to a mask seen before, looked up in the same array
>> > indexed by the lowest bit set in the mask, a single comparison.
>> >
>> > Move the topology_span_sane() check out of the existing topology level
>> > loop, let it use its own loop so that the array allocation can be done
>> > only once, shared across levels.
>> >
>> > On a system with 1920 processors (16 sockets, 60 cores, 2 threads),
>> > the average time to take one processor offline is reduced from 2.18
>> > seconds to 1.01 seconds. (Off-lining 959 of 1920 processors took
>> > 34m49.765s without this change, 16m10.038s with this change in place.)
>> >
>>
>> This isn't the first complaint about topology_span_sane() vs big
>> systems. It might be worth to disable the check once it has scanned all
>> CPUs once - not necessarily at init, since some folks have their systems
>> boot with only a subset of the available CPUs and online them later on.
>>
>> I'd have to think more about how this behaves vs the dynamic NUMA topology
>> code we got as of
>>
>> 0fb3978b0aac ("sched/numa: Fix NUMA topology for systems with CPU-less nodes")
>>
>> (i.e. is scanning all possible CPUs enough to guarantee no overlaps when
>> having only a subset of online CPUs? I think so...)
>>
>> but maybe something like so?
>> ---
>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>> index 9748a4c8d6685..bf95c3d4f6072 100644
>> --- a/kernel/sched/topology.c
>> +++ b/kernel/sched/topology.c
>> @@ -2361,12 +2361,25 @@ static struct sched_domain *build_sched_domain(struct sched_domain_topology_leve
>> static bool topology_span_sane(struct sched_domain_topology_level *tl,
>> const struct cpumask *cpu_map, int cpu)
>> {
>> + static bool validated;
>> int i = cpu + 1;
>>
>> + if (validated)
>> + return true;
>> +
>> /* NUMA levels are allowed to overlap */
>> if (tl->flags & SDTL_OVERLAP)
>> return true;
>>
>> + /*
>> + * We're visiting all CPUs available in the system, no need to re-check
>> + * things after that. Even if we end up finding overlaps here, we'll
>> + * have issued a warning and can skip the per-CPU scan in later
>> + * calls to this function.
>> + */
>> + if (cpumask_equal(cpu_map, cpu_possible_mask))
>> + validated = true;
>> +
>> /*
>> * Non-NUMA levels cannot partially overlap - they must be either
>> * completely equal or completely disjoint. Otherwise we can end up
>
> I tried adding this, surprisingly I saw no effect on the time taken,
> perhaps even a small slowdown, when combined with my patch. So at
> this point I don't intend to add it to v2 of the patch.
>
Thanks for testing, I assume your cpu_possible_mask reports more CPUs than
you have physically plugged in... I guess it would make sense to
short-circuit the function when cpu_map is a subset of what we've
previously checked, and then re-kick the testing once new CPU(s) are
plugged in. Something like the untested below?
Optimisations notwithstanding, IMO we shouldn't be repeating checks if we
can avoid it.
---
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 9748a4c8d6685..87ba730c34800 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2354,6 +2354,8 @@ static struct sched_domain *build_sched_domain(struct sched_domain_topology_leve
return sd;
}
+static cpumask_var_t topology_sane_cpus;
+
/*
* Ensure topology masks are sane, i.e. there are no conflicts (overlaps) for
* any two given CPUs at this (non-NUMA) topology level.
@@ -2367,6 +2369,11 @@ static bool topology_span_sane(struct sched_domain_topology_level *tl,
if (tl->flags & SDTL_OVERLAP)
return true;
+ if (cpumask_subset(cpu_map, topology_sane_cpus))
+ return true;
+
+ cpumask_or(topology_sane_cpus, cpu_map, topology_sane_cpus);
+
/*
* Non-NUMA levels cannot partially overlap - they must be either
* completely equal or completely disjoint. Otherwise we can end up
@@ -2607,6 +2614,7 @@ int __init sched_init_domains(const struct cpumask *cpu_map)
zalloc_cpumask_var(&sched_domains_tmpmask, GFP_KERNEL);
zalloc_cpumask_var(&sched_domains_tmpmask2, GFP_KERNEL);
zalloc_cpumask_var(&fallback_doms, GFP_KERNEL);
+ zalloc_cpumask_var(&topology_sane_cpus, GFP_KERNEL);
arch_update_cpu_topology();
asym_cpu_capacity_scan();
Powered by blists - more mailing lists