[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <Z69nTow0iaZIeV85@thinkpad>
Date: Fri, 14 Feb 2025 10:55:02 -0500
From: Yury Norov <yury.norov@...il.com>
To: I Hsin Cheng <richard120310@...il.com>
Cc: anshuman.khandual@....com, arnd@...db.de, linux-kernel@...r.kernel.org,
jserv@...s.ncku.edu.tw, skhan@...uxfoundation.org,
Matthias Kaehlcke <mka@...omium.org>,
Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [PATCH 1/2] uapi: Refactor __GENMASK() for speed-up
> No problem ! I'll be happy to help with these tests, I'll send the next
> iteration when I complete the things you mentioned.
>
> > Is this for_each_cpu() thing the only case - I don't know, but it's
> > enough to consider reverting back to the original version.
>
> So basically the reason of sizing up is because for_each_cpu() put
> non-constant parameter in GENMASK(), which is supposed to be used by
> constant only?
The (l - h) is supposed to be a const_true. In most cases both l and h
are compile-time constants. In that case GENMASK() is evaluated at
compile time, and it allows to complicate it to workaround clang
warning.
But now we obviously have an example where workarounding sacrifices
code generation. This is not OK.
> > Can you also please attach one or two examples of code generation for
> > real functions, not an artificial one as you did before. And maybe a
> > link to goldbolt?
>
> I have no problem of other tests but this one, I wrote a simple
> artificial use case because most functions I found according to the
> report generated by bloat-o-meter, doesn't use GENMASK() directly or
> they're super long and GENMASK() only accounts for very small part of
> them, it wasn't very trivial to sense the difference of disassembly at
> least for me. Should I just pick 1~2 random use cases? or do you have
> any suggestions?
In bloat-o-meter output, pick some small function. The function size
is listed in 'old' and correspondingly 'new' columns. To begin with,
pick some with small difference. The cpuusage_read is one good example:
cpuusage_read 111 109 -2
It boils down to __cpuusage_read(), which is:
static u64 __cpuusage_read(struct cgroup_subsys_state *css,
enum cpuacct_stat_index index)
{
struct cpuacct *ca = css_ca(css);
u64 totalcpuusage = 0;
int i;
for_each_possible_cpu(i)
totalcpuusage += cpuacct_cpuusage_read(ca, i, index);
return totalcpuusage;
}
Now disassemble it like:
objdump --disassemble=cpuusage_read ../build-linux-x86_64/vmlinux.orig > cpuusage_read.orig
objdump --disassemble=cpuusage_read ../build-linux-x86_64/vmlinux.new > cpuusage_read.new
And if you put the listings side-by-side, you will see:
Orig New
ja ffffffff812efdf0 <cpuusage_read+0x60> ja ffffffff812efd2e <cpuusage_read+0x5e>
mov %r8,%rdx mov %r8,%rdx
shl %cl,%rdx shl %cl,%rdx
neg %rdx and %rax,%rdx
and %rax,%rdx je ffffffff812efd2e <cpuusage_read+0x5e>
je ffffffff812efdf0 <cpuusage_read+0x60> tzcnt %rdx,%rdx
tzcnt %rdx,%rdx
OK, now we see that new GENMASK saves one negation. Great success!
Can you repeat this exercise for few other functions both with
positive and negative impact and make a conclusion about how exactly
code generation is impacted?
> Btw, are you talking about Online Compiler Explorer? I don't really know
> what goldbolt means here, sorry XD .
https://godbolt.org/
The idea is that you can cherry-pick all necessary ifdefery from
kernel headers and build a simple example by using all supported
compilers. The godbolt's list of compilers is really long. You
will not need to install any of them locally to just check how
they work.
I actually cherry-picked GENMASK() recently for the other discussion,
so you can start from that:
https://lore.kernel.org/all/20250206020227.GA322224@yaz-khff2.amd.com/T/#m27bed06e15809d7b632966a1e861767c72a8722a
Thanks,
Yury
Powered by blists - more mailing lists