linux-kernel - Re: [PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAHk-=wjbWTbRKDP=Yb9VWBGjSBEGB3dJ0=--+7-4oA2n1=1FKw@mail.gmail.com>
Date: Sat, 30 Dec 2023 12:41:12 -0800
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: David Laight <David.Laight@...lab.com>
Cc: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, 
	"peterz@...radead.org" <peterz@...radead.org>, "longman@...hat.com" <longman@...hat.com>, 
	"mingo@...hat.com" <mingo@...hat.com>, "will@...nel.org" <will@...nel.org>, 
	"boqun.feng@...il.com" <boqun.feng@...il.com>, 
	"xinhui.pan@...ux.vnet.ibm.com" <xinhui.pan@...ux.vnet.ibm.com>, 
	"virtualization@...ts.linux-foundation.org" <virtualization@...ts.linux-foundation.org>, 
	Zeng Heng <zengheng4@...wei.com>
Subject: Re: [PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses.

On Fri, 29 Dec 2023 at 12:57, David Laight <David.Laight@...lab.com> wrote:
>
> this_cpu_ptr() is rather more expensive than raw_cpu_read() since
> the latter can use an 'offset from register' (%gs for x86-84).
>
> Add a 'self' field to 'struct optimistic_spin_node' that can be
> read with raw_cpu_read(), initialise on first call.

No, this is horrible.

The problem isn't the "this_cpu_ptr()", it's the rest of the code.

>  bool osq_lock(struct optimistic_spin_queue *lock)
>  {
> -       struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
> +       struct optimistic_spin_node *node = raw_cpu_read(osq_node.self);

No. Both of these are crap.

>         struct optimistic_spin_node *prev, *next;
>         int old;
>
> -       if (unlikely(node->cpu == OSQ_UNLOCKED_VAL))
> -               node->cpu = encode_cpu(smp_processor_id());
> +       if (unlikely(!node)) {
> +               int cpu = encode_cpu(smp_processor_id());
> +               node = decode_cpu(cpu);
> +               node->self = node;
> +               node->cpu = cpu;
> +       }

The proper fix here is to not do that silly

        node = this_cpu_ptr(&osq_node);
        ..
        node->next = NULL;

dance at all, but to simply do

        this_cpu_write(osq_node.next, NULL);

in the first place. That makes the whole thing just a single store off
the segment descriptor.

Yes, you'll eventually end up doing that

        node = this_cpu_ptr(&osq_node);

thing because it then wants to use that raw pointer to do

        WRITE_ONCE(prev->next, node);

but that's a separate issue and still does not make it worth it to
create a pointless self-pointer.

Btw, if you *really* want to solve that separate issue, then make the
optimistic_spin_node struct not contain the pointers at all, but the
CPU numbers, and then turn those numbers into the pointers the exact
same way it does for the "lock->tail" thing, ie doing that whole

        prev = decode_cpu(old);

dance. That *may* then result in avoiding turning them into pointers
at all in some cases.

Also, I think that you might want to look into making OSQ_UNLOCKED_VAL
be -1 instead, and add something like

  #define IS_OSQ_UNLOCKED(x) ((int)(x)<0)

and that would then avoid the +1 / -1 games in encoding/decoding the
CPU numbers. It causes silly code generated like this:

        subl    $1, %eax        #, cpu_nr
...
        cltq
        addq    __per_cpu_offset(,%rax,8), %rcx

which seems honestly stupid. The cltq is there for sign-extension,
which is because all these things are "int", and the "subl" will
zero-extend to 64-bit, not sign-extend.

At that point, I think gcc might be able to just generate

        addq    __per_cpu_offset-8(,%rax,8), %rcx

but honestly, I think it would be nicer to just have decode_cpu() do

        unsigned int cpu_nr = encoded_cpu_val;

        return per_cpu_ptr(&osq_node, cpu_nr);

and not have the -1/+1 at all.

Hmm?

UNTESTED patch to just do the "this_cpu_write()" parts attached.
Again, note how we do end up doing that this_cpu_ptr conversion later
anyway, but at least it's off the critical path.

                 Linus

View attachment "patch.diff" of type "text/x-patch" (1083 bytes)