[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHk-=wjbWTbRKDP=Yb9VWBGjSBEGB3dJ0=--+7-4oA2n1=1FKw@mail.gmail.com>
Date: Sat, 30 Dec 2023 12:41:12 -0800
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: David Laight <David.Laight@...lab.com>
Cc: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"peterz@...radead.org" <peterz@...radead.org>, "longman@...hat.com" <longman@...hat.com>,
"mingo@...hat.com" <mingo@...hat.com>, "will@...nel.org" <will@...nel.org>,
"boqun.feng@...il.com" <boqun.feng@...il.com>,
"xinhui.pan@...ux.vnet.ibm.com" <xinhui.pan@...ux.vnet.ibm.com>,
"virtualization@...ts.linux-foundation.org" <virtualization@...ts.linux-foundation.org>,
Zeng Heng <zengheng4@...wei.com>
Subject: Re: [PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses.
On Fri, 29 Dec 2023 at 12:57, David Laight <David.Laight@...lab.com> wrote:
>
> this_cpu_ptr() is rather more expensive than raw_cpu_read() since
> the latter can use an 'offset from register' (%gs for x86-84).
>
> Add a 'self' field to 'struct optimistic_spin_node' that can be
> read with raw_cpu_read(), initialise on first call.
No, this is horrible.
The problem isn't the "this_cpu_ptr()", it's the rest of the code.
> bool osq_lock(struct optimistic_spin_queue *lock)
> {
> - struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
> + struct optimistic_spin_node *node = raw_cpu_read(osq_node.self);
No. Both of these are crap.
> struct optimistic_spin_node *prev, *next;
> int old;
>
> - if (unlikely(node->cpu == OSQ_UNLOCKED_VAL))
> - node->cpu = encode_cpu(smp_processor_id());
> + if (unlikely(!node)) {
> + int cpu = encode_cpu(smp_processor_id());
> + node = decode_cpu(cpu);
> + node->self = node;
> + node->cpu = cpu;
> + }
The proper fix here is to not do that silly
node = this_cpu_ptr(&osq_node);
..
node->next = NULL;
dance at all, but to simply do
this_cpu_write(osq_node.next, NULL);
in the first place. That makes the whole thing just a single store off
the segment descriptor.
Yes, you'll eventually end up doing that
node = this_cpu_ptr(&osq_node);
thing because it then wants to use that raw pointer to do
WRITE_ONCE(prev->next, node);
but that's a separate issue and still does not make it worth it to
create a pointless self-pointer.
Btw, if you *really* want to solve that separate issue, then make the
optimistic_spin_node struct not contain the pointers at all, but the
CPU numbers, and then turn those numbers into the pointers the exact
same way it does for the "lock->tail" thing, ie doing that whole
prev = decode_cpu(old);
dance. That *may* then result in avoiding turning them into pointers
at all in some cases.
Also, I think that you might want to look into making OSQ_UNLOCKED_VAL
be -1 instead, and add something like
#define IS_OSQ_UNLOCKED(x) ((int)(x)<0)
and that would then avoid the +1 / -1 games in encoding/decoding the
CPU numbers. It causes silly code generated like this:
subl $1, %eax #, cpu_nr
...
cltq
addq __per_cpu_offset(,%rax,8), %rcx
which seems honestly stupid. The cltq is there for sign-extension,
which is because all these things are "int", and the "subl" will
zero-extend to 64-bit, not sign-extend.
At that point, I think gcc might be able to just generate
addq __per_cpu_offset-8(,%rax,8), %rcx
but honestly, I think it would be nicer to just have decode_cpu() do
unsigned int cpu_nr = encoded_cpu_val;
return per_cpu_ptr(&osq_node, cpu_nr);
and not have the -1/+1 at all.
Hmm?
UNTESTED patch to just do the "this_cpu_write()" parts attached.
Again, note how we do end up doing that this_cpu_ptr conversion later
anyway, but at least it's off the critical path.
Linus
View attachment "patch.diff" of type "text/x-patch" (1083 bytes)
Powered by blists - more mailing lists