[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAG_fn=WKA_4Vzsh6yH+=5TFv-EMrX87pr+2c_ANoH_MOVpbe1w@mail.gmail.com>
Date: Fri, 29 Mar 2019 16:54:52 +0100
From: Alexander Potapenko <glider@...gle.com>
To: Paul McKenney <paulmck@...ux.ibm.com>
Cc: Peter Zijlstra <peterz@...radead.org>,
"H. Peter Anvin" <hpa@...or.com>, Ingo Molnar <mingo@...nel.org>,
LKML <linux-kernel@...r.kernel.org>,
Dmitriy Vyukov <dvyukov@...gle.com>,
James Y Knight <jyknight@...gle.com>
Subject: Re: Potentially missing "memory" clobbers in bitops.h for x86
On Thu, Mar 28, 2019 at 5:22 PM Paul E. McKenney <paulmck@...ux.ibm.com> wrote:
>
> On Thu, Mar 28, 2019 at 03:14:12PM +0100, Alexander Potapenko wrote:
> > Hello,
> >
> > arch/x86/include/asm/bitops.h defines clear_bit(nr, addr) for
> > non-constant |nr| values as follows:
> >
> > void clear_bit(long nr, volatile unsigned long *addr) {
> > asm volatile("lock; btr %1,%0"
> > : "+m"(*(volatile long *)addr)
> > : "Ir" (nr));
> > }
> > (https://elixir.bootlin.com/linux/latest/source/arch/x86/include/asm/bitops.h#L111)
> >
> > According to the comments in the file, |nr| may be arbitrarily large.
> > However the assembly constraints only imply that the first unsigned
> > long value at |addr| is written to.
> > This may result in the compiler ignoring the effect of the asm directive.
> >
> > Consider the following example (https://godbolt.org/z/naTmjn):
> >
> > #include <stdio.h>
> > void clear_bit(long nr, volatile unsigned long *addr) {
> > asm volatile("lock; btr %1,%0"
> > : "+m"(*(volatile long *)addr)
> > : "Ir" (nr));
> > }
> >
> > unsigned long foo() {
> > unsigned long addr[2] = {1, 2};
> > clear_bit(65, addr);
> > return addr[0] + addr[1];
> > }
> >
> > int main() {
> > printf("foo: %lu\n", foo());
> > }
> >
> > Depending on the optimization level, the program may print either 1
> > (for -O0 and -O1) or 3 (for -O2 and -O3).
> > This is because on higher optimization levels GCC assumes that addr[1]
> > is unchanged and directly propagates the constant to the result.
> >
> > I suspect the definitions of clear_bit() and similar functions are
> > lacking the "memory" clobber.
> > But the whole file tends to be very picky about whether this clobber
> > needs to be applied in each case, so in the case of a performance
> > penalty we may need to consider alternative approaches to fixing this
> > code.
>
> Is there a way of indicating a clobber only on the specific location
> affected?
We're unaware of such a way.
> I suppose that one approach would be to calculate in C code
> a pointer to the specific element of the addr[] array, which would
> put the specific clobbered memory location onto the outputs list.
> Or keep the current calculation, but also add "addr[nr / sizeof(long)]"
> to the output list, thus telling the compiler exactly what is being
> clobbered. Assuming that actually works...
That works, but sometimes GCC emits the code calculating addr[nr/64]
before the BTR instruction for no reason.
Adding this output for clear_bit() makes around 120 kernel functions
change their sizes. I only checked a handful, but some contained code
like:
cmovns %rdi, %rax
sarq $6, %rax
leaq (%rsi,%rax,8), %rax
Another idea we've come up with is to cast the pointer in BITOP_ADDR()
to a very big array: https://godbolt.org/z/apHvDc
Making so prevents the compiler from caching the assembly inputs, but
has its own drawbacks.
First, it may potentially yield incorrect assumptions about the size
of an object passed into the assembly (which may lead to e.g. dropping
some size checks)
Second, for smaller arrays these casts may result in excessive
clobbers (this happens in misc_register(), where GCC decides that the
input parameter |misc| may alias with the global |misc_minors|
bitmap).
Applying the cast to clear_bit() only changes the size of 3 kernel
functions: hub_port_reset(), misc_register(),
tick_broadcast_setup_oneshot().
None of them have visible semantic changes.
> Of course, this would force the compiler to actually compute the
> offset, which would slow things down. I have no idea whether this
> would be better or worse than just using the "memory" clobber.
Just adding the "memory" clobber to clear_bit() changes sizes of 5
kernel functions (the three mentioned above, plus hub_activate() and
native_send_call_func_ipi()) by a small margin.
This probably means the performance impact of this clobber is
negligible in this case.
> Thanx, Paul
>
--
Alexander Potapenko
Software Engineer
Google Germany GmbH
Erika-Mann-Straße, 33
80636 München
Geschäftsführer: Paul Manicle, Halimah DeLaine Prado
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg
Powered by blists - more mailing lists