[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Zz4FuGyFDaQUWXjz@swahl-home.5wahls.com>
Date: Wed, 20 Nov 2024 09:52:24 -0600
From: Steve Wahl <steve.wahl@....com>
To: Valentin Schneider <vschneid@...hat.com>
Cc: Steve Wahl <steve.wahl@....com>, Ingo Molnar <mingo@...hat.com>,
Peter Zijlstra <peterz@...radead.org>,
Juri Lelli <juri.lelli@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
Mel Gorman <mgorman@...e.de>, linux-kernel@...r.kernel.org,
K Prateek Nayak <kprateek.nayak@....com>,
Vishal Chourasia <vishalc@...ux.ibm.com>, samir <samir@...ux.ibm.com>,
Russ Anderson <rja@....com>, Dimitri Sivanich <sivanich@....com>
Subject: Re: [PATCH v2] sched/topology: improve topology_span_sane speed
On Tue, Nov 19, 2024 at 11:54:33PM +0100, Valentin Schneider wrote:
> On 19/11/24 15:03, Steve Wahl wrote:
> > On Wed, Nov 13, 2024 at 09:42:34AM -0600, Steve Wahl wrote:
> >> On Tue, Nov 12, 2024 at 05:15:47PM +0100, Valentin Schneider wrote:
> >> > On 31/10/24 15:04, Steve Wahl wrote:
> >> > > Use a different approach to topology_span_sane(), that checks for the
> >> > > same constraint of no partial overlaps for any two CPU sets for
> >> > > non-NUMA topology levels, but does so in a way that is O(N) rather
> >> > > than O(N^2).
> >> > >
> >> > > Instead of comparing with all other masks to detect collisions, keep
> >> > > one mask that includes all CPUs seen so far and detect collisions with
> >> > > a single cpumask_intersects test.
> >> > >
> >> > > If the current mask has no collisions with previously seen masks, it
> >> > > should be a new mask, which can be uniquely identified ("id") by the
> >> > > lowest bit set in this mask. Mark that we've seen a mask with this
> >> > > id, and add the CPUs in this mask to the list of those seen.
> >> > >
> >> > > If the current mask does collide with previously seen masks, it should
> >> > > be exactly equal to a mask seen before, identified once again by the
> >> > > lowest bit the current mask has set. It's an error if we haven't seen
> >> > > a mask with that id, or if the current mask doesn't match the one we
> >> > > get by looking up that id.
> >> > >
> >> > > Move the topology_span_sane() check out of the existing topology level
> >> > > loop, let it do its own looping to match the needs of this algorithm.
> >> > >
> >> > > On a system with 1920 processors (16 sockets, 60 cores, 2 threads),
> >> > > the average time to take one processor offline is reduced from 2.18
> >> > > seconds to 1.01 seconds. (Off-lining 959 of 1920 processors took
> >> > > 34m49.765s without this change, 16m10.038s with this change in place.)
> >> > >
> >> > > Signed-off-by: Steve Wahl <steve.wahl@....com>
> >> > > ---
> >> > >
> >> > > Version 2: Adopted suggestion by K Prateek Nayak that removes an array and
> >> > > simplifies the code, and eliminates the erroneous use of
> >> > > num_possible_cpus() that Peter Zijlstra noted.
> >> > >
> >> > > Version 1 discussion:
> >> > > https://lore.kernel.org/all/20241010155111.230674-1-steve.wahl@hpe.com/
> >> > >
> >> > > kernel/sched/topology.c | 73 +++++++++++++++++++++++++++--------------
> >> > > 1 file changed, 48 insertions(+), 25 deletions(-)
> >> > >
> >> > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> >> > > index 9748a4c8d668..6a2a3e91d59e 100644
> >> > > --- a/kernel/sched/topology.c
> >> > > +++ b/kernel/sched/topology.c
> >> > > @@ -2356,35 +2356,58 @@ static struct sched_domain *build_sched_domain(struct sched_domain_topology_leve
> >> > >
> >> > > /*
> >> > > * Ensure topology masks are sane, i.e. there are no conflicts (overlaps) for
> >> > > - * any two given CPUs at this (non-NUMA) topology level.
> >> > > + * any two given CPUs on non-NUMA topology levels.
> >> > > */
> >> > > -static bool topology_span_sane(struct sched_domain_topology_level *tl,
> >> > > - const struct cpumask *cpu_map, int cpu)
> >> > > +static bool topology_span_sane(const struct cpumask *cpu_map)
> >> > > {
> >> > > - int i = cpu + 1;
> >> > > + struct sched_domain_topology_level *tl;
> >> > > + struct cpumask *covered, *id_seen;
> >> > > + int cpu;
> >> > >
> >> > > - /* NUMA levels are allowed to overlap */
> >> > > - if (tl->flags & SDTL_OVERLAP)
> >> > > - return true;
> >> > > + lockdep_assert_held(&sched_domains_mutex);
> >> > > + covered = sched_domains_tmpmask;
> >> > > + id_seen = sched_domains_tmpmask2;
> >> > > +
> >> > > + for_each_sd_topology(tl) {
> >> > > +
> >> > > + /* NUMA levels are allowed to overlap */
> >> > > + if (tl->flags & SDTL_OVERLAP)
> >> > > + continue;
> >> > > +
> >> > > + cpumask_clear(covered);
> >> > > + cpumask_clear(id_seen);
> >> > >
> >> > > - /*
> >> > > - * Non-NUMA levels cannot partially overlap - they must be either
> >> > > - * completely equal or completely disjoint. Otherwise we can end up
> >> > > - * breaking the sched_group lists - i.e. a later get_group() pass
> >> > > - * breaks the linking done for an earlier span.
> >> > > - */
> >> > > - for_each_cpu_from(i, cpu_map) {
> >> > > /*
> >> > > - * We should 'and' all those masks with 'cpu_map' to exactly
> >> > > - * match the topology we're about to build, but that can only
> >> > > - * remove CPUs, which only lessens our ability to detect
> >> > > - * overlaps
> >> > > + * Non-NUMA levels cannot partially overlap - they must be either
> >> > > + * completely equal or completely disjoint. Otherwise we can end up
> >> > > + * breaking the sched_group lists - i.e. a later get_group() pass
> >> > > + * breaks the linking done for an earlier span.
> >> > > */
> >> > > - if (!cpumask_equal(tl->mask(cpu), tl->mask(i)) &&
> >> > > - cpumask_intersects(tl->mask(cpu), tl->mask(i)))
> >> > > - return false;
> >> > > + for_each_cpu(cpu, cpu_map) {
> >> > > + const struct cpumask *tl_cpu_mask = tl->mask(cpu);
> >> > > + int id;
> >> > > +
> >> > > + /* lowest bit set in this mask is used as a unique id */
> >> > > + id = cpumask_first(tl_cpu_mask);
> >> > > +
> >> > > + /* if this mask doesn't collide with what we've already seen */
> >> > > + if (!cpumask_intersects(tl_cpu_mask, covered)) {
> >> > > + /* Really odd case when cpu != id, likely not sane */
> >> > > + if ((cpu != id) && !cpumask_equal(tl_cpu_mask, tl->mask(id)))
> >> > > + return false;
> >> > > + if (cpumask_test_and_set_cpu(id, id_seen))
> >> > > + return false;
> >> > > + cpumask_or(covered, tl_cpu_mask, covered);
> >> > > + } else if ((!cpumask_test_cpu(id, id_seen)) ||
> >> > > + !cpumask_equal(tl->mask(id), tl_cpu_mask)) {
> >> > > + /*
> >> > > + * a collision with covered should have exactly matched
> >> > > + * a previously seen mask with the same id
> >> > > + */
> >> > > + return false;
> >> > > + }
> >> > > + }
> >> >
> >> > Ok so you're speeding it up, but you still get a O(nr_cpu_ids) walk every
> >> > hotplug when the check itself only needs to be done at most once per
> >> > possible online CPU combination (~ 2^(nr_cpu_ids)). If all CPUs are kicked
> >> > to life at boot, then the check only needs to be done once. If you only
> >> > boot with a subset of present CPUs to speed things up, the check still
> >> > becomes irrelevant once you've kicked the rest to life.
> >> >
> >> > I would reiterate my suggestion to get to a state where the check can be
> >> > entirely short-circuited [1].
> >> >
> >> > [1]: http://lore.kernel.org/r/xhsmh8quc5ca4.mognet@vschneid-thinkpadt14sgen2i.remote.csb
> >>
> >> Bringing forward a bit of that conversation:
> >>
> >> > > I tried adding this, surprisingly I saw no effect on the time taken,
> >> > > perhaps even a small slowdown, when combined with my patch. So at
> >> > > this point I don't intend to add it to v2 of the patch.
> >> > >
> >> >
> >> > Thanks for testing, I assume your cpu_possible_mask reports more CPUs than
> >> > you have physically plugged in...
> >>
> >> That assumption is wrong. I started with all CPUs enabled. Disabled
> >> and re-enabled cpus from there. The timings I got were as I stated,
> >> no effect, perhaps a small slowdown.
> >>
> >> > I guess it would make sense to short-circuit the function when
> >> > cpu_map is a subset of what we've previously checked, and then
> >> > re-kick the testing once new CPU(s) are plugged in. Something like
> >> > the untested below?
> >> >
> >> > Optimisations notwithstanding, IMO we shouldn't be repeating checks if we
> >> > can avoid it.
> >>
> >> I will attempt to measure it once more. I was surprised at my
> >> measured results, but that's why we take them, right?
> >>
> >> If I can't measure a difference, though, I am not sure it's
> >> appropriate to include the change with this patch, the point of which
> >> *is* optimization.
> >
> > I completed timing tests; test system has 1920 logical CPUS, 2 threads
> > per core, 60 cores per NUMA node, and the script offlined then onlined
> > both threads of half the cores in each node. Four runs each were
> > timed. Times in seconds.
> >
> > Unpatched kernel:
> > Offline Online
> > min 3991.57 3967.22
> > max 4025.59 4028.66
> > avg 4013.79 3996.75
> > stddev 15.28 29.90
> >
> > With my patch:
> > Offline Online
> > min 1987.97 2130.7
> > max 2032.25 2150.93
> > avg 2017.43 2140.18
> > stddev 20.12 10.25
> >
> > With my patch + Valentin's extra short-circuit patch:
> > min 2019.58 2137.43
> > max 2056.89 2223.86
> > avg 2033.04 2171.91
> > stddev 16.83 37.73
> >
> > I'm at a loss as to why adding the short circuit slowed things down,
> > but it's in the noise. If you want a new version of the patch
> > incorporating both changes after viewing these timings, I'm willing to
> > make that change. Let me know how you feel.
> >
>
> Huh, weird. Obviously your patch improves the situation and the
> short-circuit as proposed doesn't, so I'd say go forth with your patch and
> I'll poke around to figure out what's going on and submit a hopefully
> working short-circuit.
I think the short-circuit is actually working. In the previous email
iteration on this, after I did the timings saying it was a tad slower,
I added some printk's to make sure the short circuit was actually
happening, and it appeared to be true. And this time around, I came
up with a version of your patch that functions without mine; it's not
as pretty before combining all the topology levels into one call to
topology_span_sane, but the result of the bypass is a similar time
saving.
I am thinking that the memory touched by the full topology_span_sane
run may be pre-loading the cache for some of the subsequent
operations. That's the best I can come up with, at least for the
moment.
--> Steve
--
Steve Wahl, Hewlett Packard Enterprise
Powered by blists - more mailing lists