[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHk-=wirUHJ8DWzWjb3ZxatjQaomeTzM4uPjemrkbSjVvYOaGQ@mail.gmail.com>
Date: Mon, 10 Jun 2024 10:58:17 -0700
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Rasmus Villemoes <linux@...musvillemoes.dk>
Cc: Peter Anvin <hpa@...or.com>, Ingo Molnar <mingo@...nel.org>, Borislav Petkov <bp@...en8.de>,
Thomas Gleixner <tglx@...utronix.de>, Josh Poimboeuf <jpoimboe@...nel.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>, "the arch/x86 maintainers" <x86@...nel.org>,
linux-arch <linux-arch@...r.kernel.org>
Subject: Re: [PATCH] x86: add 'runtime constant' infrastructure
On Mon, 10 Jun 2024 at 02:50, Rasmus Villemoes <linux@...musvillemoes.dk> wrote:
>
> Well, I think it lacks a little. I made it so that an arch didn't need
> to implement support for all runtime_const_*() helpers
Honestly, normally the number of helpers would max out at probably
three or four.
The only helpers you need is for the different constant sizes and the
placement in the instructions. For example, on x86-64, you almost
certainly end up with just three fixup functions (byte, 32-bit and
64-bit) and then *maybe* a couple of "operation" helpers.
Honestly, the only op helpers that are likely to exist are
add/mask/shift. But I didn't add macros for creating these ops,
because I don't actually believe very many exist in the first place.
And yes, x86-64 ends up being fairly simple, because it doesn't have
multiple different ways of encoding immediates in different places of
the instruction. But even that isn't going to make for very many
operations.
In fact, one of the whole design points here was KISS - Keep It Simple
Stupid. Very much *not* trying to handle all operations. This is not
something that will ever be like the atomic ops, where we have a
plethora of them.
To use a runtime constant, the code has to be really *really*
critical. Honestly, I have only ever seen one such function.
__d_lookup_rcu() really has shown up for over a decace as one of the
hottest kernel functions on real and relevant loads.
For example, in your RAI patches six years ago, you did the dentry
cache, and you did the inode_cachep thing.
And honestly, while I've never seen the inode_cachep in a profile. If
you ever look up the value of inode_cachep, it's because you're doing
to allocate or free an inode, and the memory load is the *least* of
your problems.
> name because yes, much better than rai_), but could implement just
> runtime_const_load() and not the _shift_right thing,
Not a single relevant architecture would find that AT ALL useful.
If you have a constant shift, that constant will be encoded in the
shift instruction. Only completely broken and pointless architectures
would ever have to use a special "load constant into register, then
shift by register".
I'm sure such horrors exist - hardware designers sometimes have some
really odd hang-ups - but none of the relevant architectures do that.
> Otherwise, if we want to add some runtime_const_mask32() thing to
> support hash tables where we use that model, one would have to hook up
> support on all architectures at once.
See above: when we're talking about a couple of operations, I think
trying to abstract it is actually *wrong*. It makes the code less
readable, not more.
> Don't you need a cc clobber here?
"cc" clobbers aren't actually needed on x86. All inline asms are
assumed to clobber cc
It's not wrong to add them, but we generally don't.
> And for both, I think asm_inline is appropriate
Yup. Will do.
> I know it's a bit more typing, but the section names should be
> "runtime_const_shift" etc., not merely "runtime_shift".
I actually had that initially. It wasn't the "more typing". It was
"harder to read" because everything became so long and cumbersome.
The noise in the names basically ended up just overwhelming all the
RealCode(tm).
> > + RUNTIME_CONST(shift, d_hash_shift)
> > + RUNTIME_CONST(ptr, dentry_hashtable)
> > +
>
> Hm, doesn't this belong in the common linker script?
I actually went back and forth on this and had it both ways. I ended
up worrying that different architectures would want to have these
things in particular locations, closer to the text section etc.
IOW, I ended up putting it in the architecture-specific part because
that's where the altinstructions sections were, and it fit that
pattern.
It could possibly go into the
INIT_DATA_SECTION
thing in the generic macros, but it has to go *outside* the section
that is called ".init.data" because the section name has to match.
See why I didn't do that?
> I mean, if arm64 were to implement support for this, they'd also have to add this
> boilerplate to their vmlinux.lds.S?
See
https://lore.kernel.org/all/CAHk-=wiPSnaCczHp3Jy=kFjfqJa7MTQg6jht_FwZCxOnpsi4Vw@mail.gmail.com/
doing arm64 was always the plan - ti was doing perf profiles on arm64
that showed this nasty pattern to me once again (and honestly, showed
it much more: I have 14% of all time in __d_lookup_rcu() because while
this machine has lots of cores, it doesn't have lots of cache, so it's
all very nasty).
Notice how small that patch is. Yes, it adds two lines to the arm64
vmlinux.lds.S file. But everything else is literally just the required
"this is how you modify the instructions".
There is no extra fat there.
> Please keep the #includes together at the top of the file.
Yes, my original plan was to do
#ifdef CONFIG_RUNTIME_CONST
#include <asm/runtime-const.h>
.. do the const version of d_hash() ..
#else
.. do the non-const one ..
#endif
but the asm-generic approach meant that I didn't even need to make it
a CONFIG variable.
> > +static inline struct hlist_bl_head *d_hash(unsigned long hashlen)
>
> Could you spend some words on this signature change? Why should this now
> (on 64BIT) take the full hash_len as argument, only to let the compiler
> (with the (u32) cast) or cpu (in the x86_64 case, via the %k modifier if
> I'm reading things right) only use the lower 32 bits?
I'll separate it out, but it boils down to both x86 and arm64 doing
32-bit shifts that clear the upper 32 bits.
So you do *not* want a separate instruction just to turn a "unsigned
long" into a "u32".
And while arm64 can take a 32-bit input register and output a 64-bit
output, on x86-64 the input and output registers are the same, and you
cannot tell the compiler that the upper 32 bits don't matter on input
to the asm.
End result: you actually want to keep the input to the constant shift
as the native register size.
But I'll separate out the d_hash() interface change from the actual
runtime constant part.
Linus
Powered by blists - more mailing lists