[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAFULd4YsPnCfw-NV_0ck1_za=WPc-FxYKV1bb99UcAwMJ=34YA@mail.gmail.com>
Date: Mon, 2 Oct 2023 14:13:04 +0200
From: Uros Bizjak <ubizjak@...il.com>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: x86@...nel.org, linux-kernel@...r.kernel.org,
Andy Lutomirski <luto@...nel.org>,
Ingo Molnar <mingo@...nel.org>, Nadav Amit <namit@...are.com>,
Brian Gerst <brgerst@...il.com>,
Denys Vlasenko <dvlasenk@...hat.com>,
"H . Peter Anvin" <hpa@...or.com>,
Peter Zijlstra <peterz@...radead.org>,
Thomas Gleixner <tglx@...utronix.de>,
Borislav Petkov <bp@...en8.de>,
Josh Poimboeuf <jpoimboe@...hat.com>
Subject: Re: [RFC PATCH 0/4] x86/percpu: Use segment qualifiers
On Sun, Oct 1, 2023 at 11:47 PM Uros Bizjak <ubizjak@...il.com> wrote:
>
> On Sun, Oct 1, 2023 at 10:21 PM Linus Torvalds
> <torvalds@...ux-foundation.org> wrote:
> >
> > On Sun, 1 Oct 2023 at 12:53, Uros Bizjak <ubizjak@...il.com> wrote:
> > >
> > > Regarding x86 target specific code, the same functionality used for
> > > explicit address space is used internally to handle __thread
> > > qualifier.
> >
> > Ok, that's interesting, in that __thread is certainly widely used so
> > it will have seen testing.
> >
> > > Even *if* there are some issues with aliasing, the kernel
> > > is immune to them due to
> > >
> > > KBUILD_CFLAGS += -fno-strict-aliasing
> >
> > It's not aliasing I'd worry about. It's correctness.
> >
> > And indeed, the *very* first thing I tried shows that this is all very
> > very buggy in gcc.
> >
> > What did I try? A simple memory copy with a structure assignment.
> >
> > Try to compile this:
> >
> > #include <string.h>
> > struct a { long arr[30]; };
> >
> > __seg_fs struct a m;
> > void foo(struct a *dst) { *dst = m; }
> >
> > using the kernel compiler options (it's the "don't use sse/avx" ones
> > that matter):
> >
> > gcc -mno-avx -mno-sse -O2 -S t.c
> >
> > and look at the end result. It's complete and utter sh*t:
> >
> > foo:
> > xorl %eax, %eax
> > cmpq $240, %rax
> > jnb .L5
> > .L2:
> > movzbl %fs:m(%rax), %edx
> > movb %dl, (%rdi,%rax)
> > addq $1, %rax
> > cmpq $240, %rax
> > jb .L2
> > .L5:
> > ret
> >
> > to the point that I can only go "WTF"?
> >
> > I mean, it's not just that it does the copy one byte at a time. It
> > literally compares %rax to $240 just after it has cleared it. I look
> > at that code, and I go "a five-year old with a crayon could have done
> > better".
> >
> > In other words, no, we're not using this thing that generates that
> > kind of garbage.
> >
> > Somebody needs to open a bugzilla entry for this kind of code generation.
>
> Huh, this testcase triggers known issue with IVopts. I opened
> PR111657, but the issue with IVopts is already reported in PR79649
> [2].
Actually (or luckily), my preliminary analysis was wrong. This is just
the case of missing optimization, where target dependent code chose
the nonoptimal (but still correct) copy algorithm, under very unusual
circumstances. In GCC, the stringop can be implemented using several
(8) algorithms:
DEF_ALG (libcall, libcall)
DEF_ALG (rep_prefix_1_byte, rep_byte)
DEF_ALG (rep_prefix_4_byte, rep_4byte)
DEF_ALG (rep_prefix_8_byte, rep_8byte)
DEF_ALG (loop_1_byte, byte_loop)
DEF_ALG (loop, loop)
DEF_ALG (unrolled_loop, unrolled_loop)
DEF_ALG (vector_loop, vector_loop)
but some of them (rep_prefix ones) can not be used with non-default
address spaces. Obviously, vector_loop can not be used without SSE/AVX
instructions, so what remains is a severely limited selection of
algorithms. Target processors (-mtune=...) select their own selection
of algorithms, based on maximum copied block size. The generic tuning
selects:
static stringop_algs generic_memcpy[2] = {
{libcall, {{32, loop, false}, {8192, rep_prefix_4_byte, false},
{-1, libcall, false}}},
{libcall, {{32, loop, false}, {8192, rep_prefix_8_byte, false},
{-1, libcall, false}}}};
Now, rep_prefix_8_byte is not available with non-default address
space, so the algorithm falls back to libcall (the one after
unavailable algo). However, we can't call into the libc here (library
function also handles only default address space), so the target
independent part of the compiler emits "the-most-generic" one-byte
copy loop.
The "unusual circumstances" here are following:
- rep_prefix instructions are disabled, these are otherwise used by most targets
- vector loops are disabled
- libcall algo still produces correct code, with non-optimal loop instructions.
Very few users look into produced assembly to find the difference
between one-byte loop (admittedly with unwanted compare) and 8-byte
loop. Since the compiled code works as expected, there were no
bugreports for this, I would say minor issue.
As shown in the bugreport [1], the fix is to select the "loop"
fallback algorithm instead of libcall (a patch is effectively a couple
of lines) when non-default address space is used. The generated
assembly for the structure copy now reads:
xorl %eax, %eax
.L2:
movl %eax, %edx
addl $8, %eax
movq %gs:m(%rdx), %rcx
movq %rcx, (%rdi,%rdx)
cmpl $240, %eax
jb .L2
ret
The same assembly can be generated with -mstringop-strategy=loop
compiler argument. This argument will affect both, __seg_gs and
__thread copy loops, so (almost) the same code will be generated for
both loops (because, as claimed in previous post, the same compiler
code handles named and implicit name spaces).
> > Clang isn't much better, but at least it doesn't generate bad code. It
> > just crashes with an internal compiler error on the above trivial
> > test-case:
> >
> > fatal error: error in backend: cannot lower memory intrinsic in
> > address space 257
> >
> > which at least tells the user that they can't copy memory from that
> > address space. But once again shows that no, this feature is not ready
> > for prime-time.
> >
> > If literally the *first* thing I thought to test was this broken, what
> > else is broken in this model?
> >
> > And no, the kernel doesn't currently do the above kinds of things.
> > That's not the point. The point was "how well is this compiler support
> > tested". The answer is "not at all".
I don't agree with the above claims. The generated code was the
product of a too limited selection of available copy algorithms in
unusual circumstances, but even in the case of generic fallback code,
the generated code was *correct*. As said in the previous post, and
re-confirmed by the patch in the PR, the same code in GCC handles
implicit (__thread) and named address spaces. At the end of the day,
the problematic code was merely a missing-optimization (the bug with
the lowest severity in GCC).
> [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111657
> [2] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79649
Uros.
Powered by blists - more mailing lists