[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Zyz8WpuIppVhcLHU@swahl-home.5wahls.com>
Date: Thu, 7 Nov 2024 11:43:54 -0600
From: Steve Wahl <steve.wahl@....com>
To: K Prateek Nayak <kprateek.nayak@....com>
Cc: Steve Wahl <steve.wahl@....com>, Ingo Molnar <mingo@...hat.com>,
Peter Zijlstra <peterz@...radead.org>,
Juri Lelli <juri.lelli@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>,
linux-kernel@...r.kernel.org, Vishal Chourasia <vishalc@...ux.ibm.com>,
samir <samir@...ux.ibm.com>, Russ Anderson <rja@....com>,
Dimitri Sivanich <sivanich@....com>
Subject: Re: [PATCH v2] sched/topology: improve topology_span_sane speed
On Wed, Nov 06, 2024 at 10:19:13AM +0530, K Prateek Nayak wrote:
> Hello Steve,
Hi, Prateek; thanks for looking at this patch.
> On 11/1/2024 1:34 AM, Steve Wahl wrote:
> > Use a different approach to topology_span_sane(), that checks for the
> > same constraint of no partial overlaps for any two CPU sets for
> > non-NUMA topology levels, but does so in a way that is O(N) rather
> > than O(N^2).
> >
> > Instead of comparing with all other masks to detect collisions, keep
> > one mask that includes all CPUs seen so far and detect collisions with
> > a single cpumask_intersects test.
> >
> > If the current mask has no collisions with previously seen masks, it
> > should be a new mask, which can be uniquely identified ("id") by the
> > lowest bit set in this mask. Mark that we've seen a mask with this
> > id, and add the CPUs in this mask to the list of those seen.
> >
> > If the current mask does collide with previously seen masks, it should
> > be exactly equal to a mask seen before, identified once again by the
> > lowest bit the current mask has set. It's an error if we haven't seen
> > a mask with that id, or if the current mask doesn't match the one we
> > get by looking up that id.
> >
> > Move the topology_span_sane() check out of the existing topology level
> > loop, let it do its own looping to match the needs of this algorithm.
> >
> > On a system with 1920 processors (16 sockets, 60 cores, 2 threads),
> > the average time to take one processor offline is reduced from 2.18
> > seconds to 1.01 seconds. (Off-lining 959 of 1920 processors took
> > 34m49.765s without this change, 16m10.038s with this change in place.)
> >
> > Signed-off-by: Steve Wahl <steve.wahl@....com>
> > ---
> >
> > Version 2: Adopted suggestion by K Prateek Nayak that removes an array and
> > simplifies the code, and eliminates the erroneous use of
> > num_possible_cpus() that Peter Zijlstra noted.
> >
> > Version 1 discussion:
> > https://lore.kernel.org/all/20241010155111.230674-1-steve.wahl@hpe.com/
> >
> > kernel/sched/topology.c | 73 +++++++++++++++++++++++++++--------------
> > 1 file changed, 48 insertions(+), 25 deletions(-)
> >
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index 9748a4c8d668..6a2a3e91d59e 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -2356,35 +2356,58 @@ static struct sched_domain *build_sched_domain(struct sched_domain_topology_leve
> > /*
> > * Ensure topology masks are sane, i.e. there are no conflicts (overlaps) for
> > - * any two given CPUs at this (non-NUMA) topology level.
> > + * any two given CPUs on non-NUMA topology levels.
> > */
> > -static bool topology_span_sane(struct sched_domain_topology_level *tl,
> > - const struct cpumask *cpu_map, int cpu)
> > +static bool topology_span_sane(const struct cpumask *cpu_map)
> > {
> > - int i = cpu + 1;
> > + struct sched_domain_topology_level *tl;
> > + struct cpumask *covered, *id_seen;
> > + int cpu;
> > - /* NUMA levels are allowed to overlap */
> > - if (tl->flags & SDTL_OVERLAP)
> > - return true;
> > + lockdep_assert_held(&sched_domains_mutex);
> > + covered = sched_domains_tmpmask;
> > + id_seen = sched_domains_tmpmask2;
> > +
> > + for_each_sd_topology(tl) {
> > +
> > + /* NUMA levels are allowed to overlap */
> > + if (tl->flags & SDTL_OVERLAP)
> > + continue;
> > +
> > + cpumask_clear(covered);
> > + cpumask_clear(id_seen);
> > - /*
> > - * Non-NUMA levels cannot partially overlap - they must be either
> > - * completely equal or completely disjoint. Otherwise we can end up
> > - * breaking the sched_group lists - i.e. a later get_group() pass
> > - * breaks the linking done for an earlier span.
> > - */
> > - for_each_cpu_from(i, cpu_map) {
> > /*
> > - * We should 'and' all those masks with 'cpu_map' to exactly
> > - * match the topology we're about to build, but that can only
> > - * remove CPUs, which only lessens our ability to detect
> > - * overlaps
> > + * Non-NUMA levels cannot partially overlap - they must be either
> > + * completely equal or completely disjoint. Otherwise we can end up
> > + * breaking the sched_group lists - i.e. a later get_group() pass
> > + * breaks the linking done for an earlier span.
> > */
> > - if (!cpumask_equal(tl->mask(cpu), tl->mask(i)) &&
> > - cpumask_intersects(tl->mask(cpu), tl->mask(i)))
> > - return false;
> > + for_each_cpu(cpu, cpu_map) {
> > + const struct cpumask *tl_cpu_mask = tl->mask(cpu);
> > + int id;
> > +
> > + /* lowest bit set in this mask is used as a unique id */
> > + id = cpumask_first(tl_cpu_mask);
> > +
> > + /* if this mask doesn't collide with what we've already seen */
> > + if (!cpumask_intersects(tl_cpu_mask, covered)) {
> > + /* Really odd case when cpu != id, likely not sane */
> > + if ((cpu != id) && !cpumask_equal(tl_cpu_mask, tl->mask(id)))
>
> I was wondering, since we are doing a "for_each_cpu(cpu, cpu_map)",
> wouldn't we always see the "id" cpu first since "id" is nothing but the
> first CPU of the topology level mask? Maybe I'm not thinking creatively
> enough but I cannot see a scenario where the sanity check will trip
> here and not in the "else if" clause down below :)
The scenario I was thinking of that could trip this is if
tl->mask(cpu) does not actually include this CPU. It should not
happen, but we are looking to detect a malformed set of masks, so I
added the check.
> I've done some basic testing on my system on all NPS modes and also
> with "numa=fake" cmdline and I do not see any splats in my case. Please
> feel free to include:
>
> Tested-by: K Prateek Nayak <kprateek.nayak@....com>
>
> I'll try to hack the kernel to trip the topology check, with and without
> the patch, and also try to get back with some hotplug perf numbers but
> for the time being I do not see any false negatives with the approach on
> my dual socket 3rd Generation EPYC system (2 x 64C/128T)
>
> --
> Thanks and Regards,
> Prateek
Thanks so much!
--> Steve Wahl
> > + return false;
> > + if (cpumask_test_and_set_cpu(id, id_seen))
> > + return false;
> > + cpumask_or(covered, tl_cpu_mask, covered);
> > + } else if ((!cpumask_test_cpu(id, id_seen)) ||
> > + !cpumask_equal(tl->mask(id), tl_cpu_mask)) {
> > + /*
> > + * a collision with covered should have exactly matched
> > + * a previously seen mask with the same id
> > + */
> > + return false;
> > + }
> > + }
> > }
> > -
> > return true;
> > }
> > @@ -2417,9 +2440,6 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> > sd = NULL;
> > for_each_sd_topology(tl) {
> > - if (WARN_ON(!topology_span_sane(tl, cpu_map, i)))
> > - goto error;
> > -
> > sd = build_sched_domain(tl, cpu_map, attr, sd, i);
> > has_asym |= sd->flags & SD_ASYM_CPUCAPACITY;
> > @@ -2433,6 +2453,9 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> > }
> > }
> > + if (WARN_ON(!topology_span_sane(cpu_map)))
> > + goto error;
> > +
> > /* Build the groups for the domains */
> > for_each_cpu(i, cpu_map) {
> > for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
>
--
Steve Wahl, Hewlett Packard Enterprise
Powered by blists - more mailing lists