[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZjUNLAhS2F/Qxt/t@yury-ThinkPad>
Date: Fri, 3 May 2024 09:13:32 -0700
From: Yury Norov <yury.norov@...il.com>
To: Kuan-Wei Chiu <visitorckw@...il.com>
Cc: linux-kernel@...r.kernel.org,
Rasmus Villemoes <linux@...musvillemoes.dk>,
Andrew Morton <akpm@...ux-foundation.org>,
Chin-Chun Chen <n26122115@...ncku.edu.tw>,
Ching-Chun Huang <jserv@...s.ncku.edu.tw>
Subject: Re: [PATCH 3/4] bitops: squeeze even more out of fns()
On Fri, May 03, 2024 at 10:19:10AM +0800, Kuan-Wei Chiu wrote:
> +Cc Chin-Chun Chen & Ching-Chun (Jim) Huang
>
> On Thu, May 02, 2024 at 04:32:03PM -0700, Yury Norov wrote:
> > The function clears N-1 first set bits to find the N'th one with:
> >
> > while (word && n--)
> > word &= word - 1;
> >
> > In the worst case, it would take 63 iterations.
> >
> > Instead of linear walk through the set bits, we can do a binary search
> > by using hweight(). This would work even better on platforms supporting
> > hardware-assisted hweight() - pretty much every modern arch.
> >
> Chin-Chun once proposed a method similar to binary search combined with
> hamming weight and discussed it privately with me and Jim. However,
> Chin-Chun found that binary search would actually impair performance
> when n is small. Since we are unsure about the typical range of n in
> our actual workload, we have not yet proposed any relevant patches. If
> considering only the overall benchmark results, this patch looks good
> to me.
fns() is used only as a helper to find_nth_bit().
In the kernel the find_nth_bit() is used in
- bitmap_bitremap((),
- bitmap_remap(), and
- cpumask_local_spread() via sched_numa_find_nth_cpu()
with the bit to search calculated as n = n % cpumask_weigth(). This
virtually implies random uniformly distributed n and word, just like
in the test_fns().
In rebalance_wq_table() in drivers/crypto/intel/iaa/iaa_crypto_main.c
it's used like:
for (cpu = 0; cpu < nr_cpus_per_node; cpu++) {
int node_cpu = cpumask_nth(cpu, node_cpus);
...
}
This is an API abuse, and should be rewritten with for_each_cpu()
In cpumask_any_housekeeping() at arch/x86/kernel/cpu/resctrl/internal.h
it's used like:
90 hk_cpu = cpumask_nth_andnot(0, mask, tick_nohz_full_mask);
91 if (hk_cpu == exclude_cpu)
92 hk_cpu = cpumask_nth_andnot(1, mask, tick_nohz_full_mask);
93
94 if (hk_cpu < nr_cpu_ids)
95 cpu = hk_cpu;
And this is another example of the API abuse. We need to introduce a new
helper cpumask_andnot_any_but() and use it like:
hk_cpu = cpumask_andnot_any_but(exclude_cpu, mask, tick_nohz_full_mask).
if (hk_cpu < nr_cpu_ids)
cpu = hk_cpu;
So, where the use of find_nth_bit() is legitimate, the parameters are
distributed like in the test, and I would expect the real-life
performance impact to be similar to the test.
Optimizing the helper for non-legitimate cases doesn't worth the
effort.
Thanks,
Yury
Powered by blists - more mailing lists