linux-kernel - Re: [PATCH 3/4] bitops: squeeze even more out of fns()

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ZjUNLAhS2F/Qxt/t@yury-ThinkPad>
Date: Fri, 3 May 2024 09:13:32 -0700
From: Yury Norov <yury.norov@...il.com>
To: Kuan-Wei Chiu <visitorckw@...il.com>
Cc: linux-kernel@...r.kernel.org,
	Rasmus Villemoes <linux@...musvillemoes.dk>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Chin-Chun Chen <n26122115@...ncku.edu.tw>,
	Ching-Chun Huang <jserv@...s.ncku.edu.tw>
Subject: Re: [PATCH 3/4] bitops: squeeze even more out of fns()

On Fri, May 03, 2024 at 10:19:10AM +0800, Kuan-Wei Chiu wrote:
> +Cc Chin-Chun Chen & Ching-Chun (Jim) Huang
> 
> On Thu, May 02, 2024 at 04:32:03PM -0700, Yury Norov wrote:
> > The function clears N-1 first set bits to find the N'th one with:
> > 
> > 	while (word && n--)
> > 		word &= word - 1;
> > 
> > In the worst case, it would take 63 iterations.
> > 
> > Instead of linear walk through the set bits, we can do a binary search
> > by using hweight(). This would work even better on platforms supporting
> > hardware-assisted hweight() - pretty much every modern arch.
> > 
> Chin-Chun once proposed a method similar to binary search combined with
> hamming weight and discussed it privately with me and Jim. However,
> Chin-Chun found that binary search would actually impair performance
> when n is small. Since we are unsure about the typical range of n in
> our actual workload, we have not yet proposed any relevant patches. If
> considering only the overall benchmark results, this patch looks good
> to me.

fns() is used only as a helper to find_nth_bit(). 

In the kernel the find_nth_bit() is used in
 - bitmap_bitremap((),
 - bitmap_remap(), and
 - cpumask_local_spread() via sched_numa_find_nth_cpu()

with the bit to search calculated as n = n % cpumask_weigth(). This
virtually implies random uniformly distributed n and word, just like
in the test_fns().

In rebalance_wq_table() in drivers/crypto/intel/iaa/iaa_crypto_main.c
it's used like:
        
         for (cpu = 0; cpu < nr_cpus_per_node; cpu++) {
                   int node_cpu = cpumask_nth(cpu, node_cpus);
                   ...
         }

This is an API abuse, and should be rewritten with for_each_cpu()

In cpumask_any_housekeeping() at arch/x86/kernel/cpu/resctrl/internal.h
it's used like:

 90         hk_cpu = cpumask_nth_andnot(0, mask, tick_nohz_full_mask);
 91         if (hk_cpu == exclude_cpu)
 92                 hk_cpu = cpumask_nth_andnot(1, mask, tick_nohz_full_mask);
 93 
 94         if (hk_cpu < nr_cpu_ids)
 95                 cpu = hk_cpu;

And this is another example of the API abuse. We need to introduce a new
helper cpumask_andnot_any_but() and use it like:

        hk_cpu = cpumask_andnot_any_but(exclude_cpu, mask, tick_nohz_full_mask).
        if (hk_cpu < nr_cpu_ids)
                 cpu = hk_cpu;

So, where the use of find_nth_bit() is legitimate, the parameters are
distributed like in the test, and I would expect the real-life
performance impact to be similar to the test.

Optimizing the helper for non-legitimate cases doesn't worth the
effort.

Thanks,
Yury