linux-kernel - Re: [PATCH 4/4] x86/percpu: Use C for percpu read/write accessors

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAFULd4YWjxoSTyCtMN0OzKgHtshMQOuMH1Z0n_OaWKVnUjy2iA@mail.gmail.com>
Date:   Sun, 8 Oct 2023 21:17:48 +0200
From:   Uros Bizjak <ubizjak@...il.com>
To:     Linus Torvalds <torvalds@...ux-foundation.org>
Cc:     x86@...nel.org, linux-kernel@...r.kernel.org,
        Andy Lutomirski <luto@...nel.org>,
        Ingo Molnar <mingo@...nel.org>, Nadav Amit <namit@...are.com>,
        Brian Gerst <brgerst@...il.com>,
        Denys Vlasenko <dvlasenk@...hat.com>,
        "H . Peter Anvin" <hpa@...or.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Borislav Petkov <bp@...en8.de>,
        Josh Poimboeuf <jpoimboe@...hat.com>
Subject: Re: [PATCH 4/4] x86/percpu: Use C for percpu read/write accessors

On Sun, Oct 8, 2023 at 8:00 PM Linus Torvalds
<torvalds@...ux-foundation.org> wrote:
>
> On Wed, 4 Oct 2023 at 07:51, Uros Bizjak <ubizjak@...il.com> wrote:
> >
> > The percpu code mostly uses inline assembly. Using segment qualifiers
> > allows to use C code instead, which enables the compiler to perform
> > various optimizations (e.g. propagation of memory arguments). Convert
> > percpu read and write accessors to C code, so the memory argument can
> > be propagated to the instruction that uses this argument.
>
> So apparently this causes boot failures.
>
> It might be worth testing a version where this:
>
> > +#define raw_cpu_read_1(pcp)            __raw_cpu_read(, pcp)
> > +#define raw_cpu_read_2(pcp)            __raw_cpu_read(, pcp)
> > +#define raw_cpu_read_4(pcp)            __raw_cpu_read(, pcp)
> > +#define raw_cpu_write_1(pcp, val)      __raw_cpu_write(, pcp, val)
> > +#define raw_cpu_write_2(pcp, val)      __raw_cpu_write(, pcp, val)
> > +#define raw_cpu_write_4(pcp, val)      __raw_cpu_write(, pcp, val)
>
> and this
>
> > +#ifdef CONFIG_X86_64
> > +#define raw_cpu_read_8(pcp)            __raw_cpu_read(, pcp)
> > +#define raw_cpu_write_8(pcp, val)      __raw_cpu_write(, pcp, val)
>
> was all using 'volatile' in the qualifier argument and see if that
> makes the boot failure go away.
>
> Because while the old code wasn't "asm volatile", even just a *plain*
> asm() is certainly a lot more serialized than a normal access.
>
> For example, the asm() version of raw_cpu_write() used "+m" for the
> destination modifier, which means that if you did multiple percpu
> writes to the same variable, gcc would output multiple asm calls,
> because it would see the subsequent ones as reading the old value
> (even if they don't *actually* do so).
>
> That's admittedly really just because it uses a common macro for
> raw_cpu_write() and the updates (like the percpu_add() code), so the
> fact that it uses "+m" instead of "=m" is just a random odd artifact
> of the inline asm version, but maybe we have code that ends up working
> just by accident.
>
> Also, I'm not sure gcc re-orders asms wrt each other - even when they
> aren't marked volatile.
>
> So it might be worth at least a trivial "make everything volatile"
> test to see if that affects anything.

I have managed to reproduce the bug, and I'm trying the following path:

Scrap the last patch and just add:

#define __raw_cpu_read_new(size, qual, pcp)                \
({                                    \
    *(qual __my_cpu_type(pcp) *)__my_cpu_ptr(&(pcp));        \
})

#define __raw_cpu_read(size, qual, _var)                \
({                                    \
    __pcpu_type_##size pfo_val__;                    \
    asm qual (__pcpu_op2_##size("mov", __percpu_arg([var]), "%[val]") \
        : [val] __pcpu_reg_##size("=", pfo_val__)            \
        : [var] "m" (__my_cpu_var(_var)));                \
    (typeof(_var))(unsigned long) pfo_val__;            \
})

Then changing *only*

#define raw_cpu_read_8(pcp)            __raw_cpu_read_new(8, , pcp)

brings the boot process far enough to report:

[    0.463711][    T0] Dentry cache hash table entries: 1048576
(order: 11, 8388608 bytes, linear)
[    0.465859][    T0] Inode-cache hash table entries: 524288 (order:
10, 4194304 bytes, linear)
PANIC: early exception 0x0d IP 10:ffffffff810c4cb9 error 0 cr2
0xffff8881ab1ff000
[    0.469084][    T0] CPU: 0 PID: 0 Comm: swapper Not tainted
6.5.0-11417-gca4256348660-dirty #7
[    0.470756][    T0] RIP: 0010:cpu_init_exception_handling+0x179/0x740
[    0.472045][    T0] Code: be 0f 00 00 00 4a 03 04 ed 40 19 15 85 48
89 c7 e8 9c bb ff ff 48 c7 c0 10 73 02 00 48 ba 00 00 00 00 00 fc ff
df 48 c1 e8 03 <80> 3c 10
00 0f 85 21 05 00 00 65 48 8b 05 45 26 f6 7e 48 8d 7b 24
[    0.475784][    T0] RSP: 0000:ffffffff85207e38 EFLAGS: 00010002
ORIG_RAX: 0000000000000000
[    0.477384][    T0] RAX: 0000000000004e62 RBX: ffff88817700a000
RCX: 0000000000000010
[    0.479093][    T0] RDX: dffffc0000000000 RSI: ffffffff85207e60
RDI: ffff88817700f078
[    0.481178][    T0] RBP: 000000000000f000 R08: 0040f50000000000
R09: 0040f50000000000
[    0.482655][    T0] R10: ffff8881ab02a000 R11: 0000000000000000
R12: 1ffffffff0a40fc8
[    0.484128][    T0] R13: 0000000000000000 R14: 0000000000000000
R15: ffffffff85151940
[    0.485604][    T0] FS:  0000000000000000(0000)
GS:ffff888177000000(0000) knlGS:0000000000000000
[    0.487246][    T0] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.488515][    T0] CR2: ffff8881ab1ff000 CR3: 00000000052d7000
CR4: 00000000000000b0
[    0.490002][    T0] Call Trace:
[    0.490600][    T0]  <TASK>
[    0.491145][    T0]  ? early_fixup_exception+0x10e/0x280
[    0.492176][    T0]  ? early_idt_handler_common+0x2f/0x40
[    0.493222][    T0]  ? cpu_init_exception_handling+0x179/0x740
[    0.494348][    T0]  ? cpu_init_exception_handling+0x164/0x740
[    0.495472][    T0]  ? syscall_init+0x1c0/0x1c0
[    0.496351][    T0]  ? per_cpu_ptr_to_phys+0x1ca/0x2c0
[    0.497336][    T0]  ? setup_cpu_entry_areas+0x138/0x980
[    0.498365][    T0]  trap_init+0xa/0x40

Let me see what happens here. I have changed *only* raw_cpu_read_8,
but the GP fault is reported in cpu_init_exception_handling, which
uses this_cpu_ptr. Please note that all per-cpu initializations go
through existing {this|raw}_cpu_write.

void cpu_init_exception_handling(void)
{
struct tss_struct *tss = this_cpu_ptr(&cpu_tss_rw);
int cpu = raw_smp_processor_id();
...

I have tried the trick with all reads volatile (and writes as they
were before the patch), but it didn't make a difference. Also, the
kernel from the git/tip branch works OK for default config, so I think
there is some config option that somehow disturbs the named-as enabled
kernel.

Uros.