lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2b857e20-5e3a-13ec-a0b0-1f69d2d047a5@suse.cz>
Date:   Wed, 18 Jan 2023 14:35:17 +0100
From:   Vlastimil Babka <vbabka@...e.cz>
To:     Linus Torvalds <torvalds@...ux-foundation.org>,
        Feng Tang <feng.tang@...el.com>
Cc:     "Sang, Oliver" <oliver.sang@...el.com>,
        Mike Kravetz <mike.kravetz@...cle.com>,
        "oe-lkp@...ts.linux.dev" <oe-lkp@...ts.linux.dev>,
        lkp <lkp@...el.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Jann Horn <jannh@...gle.com>,
        "Song, Youquan" <youquan.song@...el.com>,
        Andrea Arcangeli <aarcange@...hat.com>,
        Jan Kara <jack@...e.cz>, John Hubbard <jhubbard@...dia.com>,
        "Kirill A . Shutemov" <kirill@...temov.name>,
        Matthew Wilcox <willy@...radead.org>,
        Michal Hocko <mhocko@...nel.org>,
        Muchun Song <songmuchun@...edance.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        "linux-mm@...ck.org" <linux-mm@...ck.org>,
        Hyeonggon Yoo <42.hyeyoo@...il.com>,
        "Yin, Fengwei" <fengwei.yin@...el.com>, hongjiu.lu@...el.com
Subject: Re: [linus:master] [hugetlb] 7118fc2906:
 kernel_BUG_at_lib/list_debug.c

On 1/17/23 19:25, Linus Torvalds wrote:
> On Tue, Jan 17, 2023 at 4:22 AM Feng Tang <feng.tang@...el.com> wrote:
>>
>> With the following patch to use 'O1' instead 'O2' gcc optoin for
>> page_alloc.c, the list corruption issue can't be reproduced for
>> commit 7118fc2906 in 1000 runs.
> 
> Ugh.
> 
> It would be lovely if you could just narrow it down with
> 
>   #pragma GCC optimize ("O1")
>  ...
>   #pragma GCC optimize ("O2")
> 
> around just that prep_compound_page(), but when I tried it myself I
> get some function attribute mismatch errors.
> 
> 
>> As is can't be reproduced with X86_64 build, it could be i386
>> compiling related.
> 
> Your particular config causes a huge amount of nasty 64-bit arithmetic
> according to the objdump code, with sequences like
> 
>   c13b3cbb:       83 05 d0 28 6c c5 01    addl   $0x1,0xc56c28d0
>   c13b3cc2:       83 15 d4 28 6c c5 00    adcl   $0x0,0xc56c28d4
> 
> which seems to be just from some coverage profiling being on
> (CONFIG_GCOV?), or something. It makes it very hard to read the code.

I think whatever this is, it's also responsible for the bug.
Now from all the dumps I've seen so far, the struct page corruptions looked
like prep_compound_page() is called more times than it should. Which didn't
make sense as the code is very simple, a loop of 1 << order - 1 iterations.

If I look at Feng's disassembly of 48b8d744ea84 (the "good" parent commit),
I can work out that edi holds the 1 << order, ebx is initialized as 1, and
there's a nice clear "inc ebx, cmp ebx, edi, jne <loop>".

Now the disassembly of the "bad" commit is much harder to follow when it
comes to the iteration control. There are xors and or's involved so I didn't
even try to work out the meaning. Clearly it can't be deterministically
wrong or it would crash always and fast. But crucially I think it uses
memory addresses 0xc56c28f8 and 0xc56c28fc in a racy way and that could
explain the bug.

Before the loop iteration itself, it initializes several registers from
these addreses, and the values end up modifying also registers used for the
loop iteration control (ebx/esi) And crucially in the loop iteration it also
writes some values to the same addresses, and there seems to be no
synchronization whatsoever. So it seems to me, even if these addresses are
supposed to be "private" to prep_compound_page(), running two
prep_compound_page() in parallel, or even due to interrupt (Hyeonggon did
reproduce this with -smp 1 as well) can result in a race ultimately
affecting the number of iterations taken.

Attaching the disasm I annotated for more details. Everything register whose
value is coming from 0xc56c28f8/0xc56c28fc in some way is annotated as
"crap" there (sorry).

> You also have UBSAN enabled, which - again - makes for some really
> grotty asm that hides any actual logic.
> 
> Finally, your objdump version also does some horrendous decoding, like
> 
>   c13b3e29:       8d b4 26 00 00 00 00    lea    0x0(%esi,%eiz,1),%esi
> 
> which is just a 7-byte 'nop' instruction, but again, it makes it
> rather hard to actually read the code.
> 
> With the i386 defconfig, gcc generates a function that is just ~30
> instructions for me, so this makes a huge difference in the legibility
> of the code.
> 
> I wonder if you can recreate the issue with a much more
> straightforward config. By all means, leave DEBUG_PAGEALLOC and SLUB
> debugging on, but without the things like UBSAN and GCOV.
> 
>> I also objdumped 'prep_compound_page' for vmlinux of 7118fc2906 and
>> its parent commit 48b8d744ea84, which have big difference than the
>> simple 'set_page_count()' change, but I can't tell which part is
>> abnormal, so attach them for further check.
> 
> Yeah, I can't make heads or tails of them either, see above on how
> illegible the objdump files are. And that's despite not even having
> all of prep_compound_page() in them (it's missing
> prep_compound_page.cold, which is probably just UBSAN fixup code, but
> who knows..)
> 
> That said, with the i386 defconfig, the only change from adding
> set_page_count() to the loop seems to be exactly that:
> 
>  .L589:
> -       movl    $1024, 12(%eax)
> +       movl    $0, 28(%eax)
>         addl    $32, %eax
> +       movl    $1024, -20(%eax)
>         movl    %esi, -28(%eax)
>         movl    $0, -12(%eax)
>         cmpl    %edx, %eax
> 
> (don't ask me why gcc does *one* access using the pre-incremented
> pointer, and then the rest to the post-incremented ones, but whatever
> - it means that it's not just "add a mov $0", it's also changing how
> the
> 
>         p->mapping = TAIL_MAPPING;
> 
> instruction is done, which is that
> 
> -       movl    $1024, 12(%eax)
> +       movl    $1024, -20(%eax)
> 
> part of the change)
> 
>              Linus

View attachment "objdump_annotated.txt" of type "text/plain" (6776 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ