linux-kernel - Re: [PATCH v2] sched/topology: improve topology_span

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20241213063137.GA16800@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net>
Date: Thu, 12 Dec 2024 22:31:37 -0800
From: Saurabh Singh Sengar <ssengar@...ux.microsoft.com>
To: Steve Wahl <steve.wahl@....com>, peterz@...radead.org,
	juri.lelli@...hat.com, vincent.guittot@...aro.org,
	dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
	mgorman@...e.de, vschneid@...hat.com, linux-kernel@...r.kernel.org,
	kprateek.nayak@....com, vishalc@...ux.ibm.com, samir@...ux.ibm.com
Cc: rja@....com, sivanich@....com, mhklinux@...look.com
Subject: Re: [PATCH v2] sched/topology: improve topology_span_sane speed

On Thu, Oct 31, 2024 at 03:04:31PM -0500, Steve Wahl wrote:
> Use a different approach to topology_span_sane(), that checks for the
> same constraint of no partial overlaps for any two CPU sets for
> non-NUMA topology levels, but does so in a way that is O(N) rather
> than O(N^2).
> 
> Instead of comparing with all other masks to detect collisions, keep
> one mask that includes all CPUs seen so far and detect collisions with
> a single cpumask_intersects test.
> 
> If the current mask has no collisions with previously seen masks, it
> should be a new mask, which can be uniquely identified ("id") by the
> lowest bit set in this mask.  Mark that we've seen a mask with this
> id, and add the CPUs in this mask to the list of those seen.
> 
> If the current mask does collide with previously seen masks, it should
> be exactly equal to a mask seen before, identified once again by the
> lowest bit the current mask has set.  It's an error if we haven't seen
> a mask with that id, or if the current mask doesn't match the one we
> get by looking up that id.
> 
> Move the topology_span_sane() check out of the existing topology level
> loop, let it do its own looping to match the needs of this algorithm.
> 
> On a system with 1920 processors (16 sockets, 60 cores, 2 threads),
> the average time to take one processor offline is reduced from 2.18
> seconds to 1.01 seconds.  (Off-lining 959 of 1920 processors took
> 34m49.765s without this change, 16m10.038s with this change in place.)
> 
> Signed-off-by: Steve Wahl <steve.wahl@....com>
> Tested-by: Michael Kelley <mhklinux@...look.com>
> Tested-by: K Prateek Nayak <kprateek.nayak@....com>
> ---
> Version 2: Adopted suggestion by K Prateek Nayak that removes an array and
> simplifies the code, and eliminates the erroneous use of
> num_possible_cpus() that Peter Zijlstra noted.
> 
> Version 1 discussion:
>     https://lore.kernel.org/all/20241010155111.230674-1-steve.wahl@hpe.com/
> 
>  kernel/sched/topology.c | 73 +++++++++++++++++++++++++++--------------
>  1 file changed, 48 insertions(+), 25 deletions(-)
> 
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 9748a4c8d668..6a2a3e91d59e 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2356,35 +2356,58 @@ static struct sched_domain *build_sched_domain(struct sched_domain_topology_leve
>  
>  /*
>   * Ensure topology masks are sane, i.e. there are no conflicts (overlaps) for
> - * any two given CPUs at this (non-NUMA) topology level.
> + * any two given CPUs on non-NUMA topology levels.
>   */
> -static bool topology_span_sane(struct sched_domain_topology_level *tl,

<snip>

Subject: RE: [PATCH v2] sched/topology: improve topology_span_sane speed

Apologies for the delayed response, finding such machines for testing
is challenging.

I finally managed to test this patch on a large VM with 1792 vCPUs and
28 NUMA nodes. The CPUs in the test reported BogoMIPS of 3800. For this
test, I measured the time taken by the build_sched_domains function
during the boot-up.

Here are the results:

Without this patch, the build_sched_domains function took approximately
2.33 seconds on the above system. With this patch applied, the time
reduced to 1.14 seconds, resulting in a savings of around 1.2 seconds.
I understand that systems with less powerful CPUs may see even greater
improvements, but I do not currently have access to such hardware.
Therefore, this testing should be considered in relative terms.

I tested this patch purely for performance evaluation. If suitable,
please feel free to add:
Tested-by: Saurabh Sengar <ssengar@...ux.microsoft.com>

Are there any remaining concerns with this patch which we can address ?

- Saurabh