[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z69YYeQB5E5Mi3Jf@vaxr-BM6660-BM6360>
Date: Fri, 14 Feb 2025 22:51:13 +0800
From: I Hsin Cheng <richard120310@...il.com>
To: Yury Norov <yury.norov@...il.com>
Cc: anshuman.khandual@....com, arnd@...db.de, linux-kernel@...r.kernel.org,
jserv@...s.ncku.edu.tw, skhan@...uxfoundation.org,
Matthias Kaehlcke <mka@...omium.org>,
Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [PATCH 1/2] uapi: Refactor __GENMASK() for speed-up
On Thu, Feb 13, 2025 at 02:54:27PM -0500, Yury Norov wrote:
> + Andrew, Matthias
>
> On Wed, Feb 12, 2025 at 10:01:12PM +0800, I Hsin Cheng wrote:
> > On Tue, Feb 11, 2025 at 01:44:18PM -0500, Yury Norov wrote:
> > > On Tue, Feb 11, 2025 at 01:39:34PM -0500, Yury Norov wrote:
> > > > On Wed, Feb 12, 2025 at 12:24:11AM +0800, I Hsin Cheng wrote:
> > > > > The calculation of "((~_UL(0)) - (_UL(1) << (l)) + 1)" is to generate a
> > > > > bitmask with "l" trailing zeroes, which is equivalent to
> > > > > "(~_UL(0) << (l))".
> > > >
> > > > I used to think that GENMASK() is a compile-time macro. __GENMASK() is
> > > > not, but it has very limited usage through the kernel, all in the uapi.
> > > >
> > > > > Refactor the calculation so the number of arithmetic instruction will be
> > > > > reduced from 3 to 1.
> > > >
> > > > I'd like to look at it! Please share disassembly.
> > > >
> > > > > Signed-off-by: I Hsin Cheng <richard120310@...il.com>
> > > > > ---
> > > > > Test is done to show the speed-up we can get from reducing the number of
> > > > > instruction. The test machine runs with 6.9.0-0-generic kernel on x86_64
> > > > > architecture with processor AMD Ryzen 7 5700X3D 8-Core Processor.
> > > >
> > > > So you CC arm maintainers and provide tests against x86_64. Are your
> > > > improvements consistent for arm, power and other arches?
> > > >
> > > > Can you run bloat-o-meter against your change?
> > >
> > > Ah, sorry, overlooked you bloat-o-meter results in cover letter.
> > > Anyways, can you provide it for each patch individually?
> >
> > Oh ok, let me paste them here first, I'll attach them along with v2 as
> > well.
> >
> > In the section below, vmlinux_old uses old version of GENMASK() and
> > GENMASK_ULL(), vmlinux_p1 use new version of GENMASK() and old version
> > of GENMASK_ULL(), vmlinux_p2 use new version of GENMASK() and new
> > version of GENMASK_ULL().
> >
> > $ ./scripts/bloat-o-meter vmlinux_old vmlinux_p1
> > add/remove: 0/2 grow/shrink: 46/505 up/down: 464/-1717 (-1253)
>
> So, I ran the build myself and I confirm that reverting c32ee3d9abd284b4
> (which this patch actually does) helps to save over 1k of .text. In my
> case it's 1269 bytes on x86_64 + defconfig for GENMASK() only.
>
> I looked at some functions affected by this revert, and I found that
> they call for_each_*_cpu() macros. This eventually boils down to bitmap
> functions like this:
>
> static __always_inline
> unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
> unsigned long offset)
> {
> if (small_const_nbits(size)) {
> unsigned long val;
>
> if (unlikely(offset >= size))
> return size;
>
> val = *addr & GENMASK(size - 1, offset);
> return val ? __ffs(val) : size;
> }
>
> return _find_next_bit(addr, size, offset);
> }
>
> The 'size' here is NR_CPUS, which on my machine is 64.
>
> GENMASK is used with non-constant parameters, but it's OK because
> from compiler's point of view, the GENMASK_INPUT_CHECK() which is
> just:
> BUILD_BUG_ON_ZERO(const_true((l) > (h)))
>
> It is passed as the 'offset > size', and it's is indeed a constant
> expression at that point because it's explicitly tested before.
>
> Is this for_each_cpu() thing the only case - I don't know, but it's
> enough to consider reverting back to the original version.
>
> I Hsing,
>
> At first, thank you for catching this up.
>
> I think performance testing part is trivial here, so let's focus on
> code generation and consistency.
>
> Can you please run your build against few recent GCC and clang versions?
> Can you also build the kernel for ARM64? You don't need to run it on
> real hardware (but maybe on VM). I just need to make sure that the
> result is consistent for all important arches.
>
> Matthias didn't mention how did he build his kernel. Did clang emit
> that warning against W=0, 1 or higher, and which code triggered the
> warning? Maybe clang already fixed it?
>
> Can you please check how would it work if NR_CPUS is set to be say 1024?
> This way the small_const_nbits optimization will be disabled.
>
> If you will find that clang still emits warnings at lower than W=2,
> can you please resend this patches together with a patch that
> suppresses clang warning?
>
> Can you also please attach one or two examples of code generation for
> real functions, not an artificial one as you did before. And maybe a
> link to goldbolt?
>
> For the next iteration can you please make sure that you refer your
> series as a revert of Matthias's patch?
>
> Thank you for discovering this. I realize that I'm asking you to do
> some extra work, but we all need to be 100% sure that it's not a fluke
> before reverting the c32ee3d9abd284b4 because it potentially leads to
> suppressing another clang warning.
>
> Thanks,
> Yury
Hi Yury,
No problem ! I'll be happy to help with these tests, I'll send the next
iteration when I complete the things you mentioned.
> Is this for_each_cpu() thing the only case - I don't know, but it's
> enough to consider reverting back to the original version.
So basically the reason of sizing up is because for_each_cpu() put
non-constant parameter in GENMASK(), which is supposed to be used by
constant only?
> Can you also please attach one or two examples of code generation for
> real functions, not an artificial one as you did before. And maybe a
> link to goldbolt?
I have no problem of other tests but this one, I wrote a simple
artificial use case because most functions I found according to the
report generated by bloat-o-meter, doesn't use GENMASK() directly or
they're super long and GENMASK() only accounts for very small part of
them, it wasn't very trivial to sense the difference of disassembly at
least for me. Should I just pick 1~2 random use cases? or do you have
any suggestions?
Btw, are you talking about Online Compiler Explorer? I don't really know
what goldbolt means here, sorry XD .
Best regards,
I Hsin Cheng
Powered by blists - more mailing lists