lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Y8gLHnUwnu5DkH0B@feng-clx>
Date:   Wed, 18 Jan 2023 23:07:10 +0800
From:   Feng Tang <feng.tang@...el.com>
To:     Vlastimil Babka <vbabka@...e.cz>
CC:     Linus Torvalds <torvalds@...ux-foundation.org>,
        "Sang, Oliver" <oliver.sang@...el.com>,
        Mike Kravetz <mike.kravetz@...cle.com>,
        "oe-lkp@...ts.linux.dev" <oe-lkp@...ts.linux.dev>,
        lkp <lkp@...el.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Jann Horn <jannh@...gle.com>,
        "Song, Youquan" <youquan.song@...el.com>,
        "Andrea Arcangeli" <aarcange@...hat.com>, Jan Kara <jack@...e.cz>,
        John Hubbard <jhubbard@...dia.com>,
        "Kirill A . Shutemov" <kirill@...temov.name>,
        "Matthew Wilcox" <willy@...radead.org>,
        Michal Hocko <mhocko@...nel.org>,
        Muchun Song <songmuchun@...edance.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        "linux-mm@...ck.org" <linux-mm@...ck.org>,
        Hyeonggon Yoo <42.hyeyoo@...il.com>,
        "Yin, Fengwei" <fengwei.yin@...el.com>, <hongjiu.lu@...el.com>
Subject: Re: [linus:master] [hugetlb] 7118fc2906:
 kernel_BUG_at_lib/list_debug.c

On Wed, Jan 18, 2023 at 02:35:17PM +0100, Vlastimil Babka wrote:
> On 1/17/23 19:25, Linus Torvalds wrote:
> > On Tue, Jan 17, 2023 at 4:22 AM Feng Tang <feng.tang@...el.com> wrote:
> >>
> >> With the following patch to use 'O1' instead 'O2' gcc optoin for
> >> page_alloc.c, the list corruption issue can't be reproduced for
> >> commit 7118fc2906 in 1000 runs.
> > 
> > Ugh.
> > 
> > It would be lovely if you could just narrow it down with
> > 
> >   #pragma GCC optimize ("O1")
> >  ...
> >   #pragma GCC optimize ("O2")
> > 
> > around just that prep_compound_page(), but when I tried it myself I
> > get some function attribute mismatch errors.
> > 
> > 
> >> As is can't be reproduced with X86_64 build, it could be i386
> >> compiling related.
> > 
> > Your particular config causes a huge amount of nasty 64-bit arithmetic
> > according to the objdump code, with sequences like
> > 
> >   c13b3cbb:       83 05 d0 28 6c c5 01    addl   $0x1,0xc56c28d0
> >   c13b3cc2:       83 15 d4 28 6c c5 00    adcl   $0x0,0xc56c28d4
> > 
> > which seems to be just from some coverage profiling being on
> > (CONFIG_GCOV?), or something. It makes it very hard to read the code.
> 
> I think whatever this is, it's also responsible for the bug.
> Now from all the dumps I've seen so far, the struct page corruptions looked
> like prep_compound_page() is called more times than it should. Which didn't
> make sense as the code is very simple, a loop of 1 << order - 1 iterations.
> 
> If I look at Feng's disassembly of 48b8d744ea84 (the "good" parent commit),
> I can work out that edi holds the 1 << order, ebx is initialized as 1, and
> there's a nice clear "inc ebx, cmp ebx, edi, jne <loop>".
> 
> Now the disassembly of the "bad" commit is much harder to follow when it
> comes to the iteration control. There are xors and or's involved so I didn't
> even try to work out the meaning. Clearly it can't be deterministically
> wrong or it would crash always and fast. But crucially I think it uses
> memory addresses 0xc56c28f8 and 0xc56c28fc in a racy way and that could
> explain the bug.
> 
> Before the loop iteration itself, it initializes several registers from
> these addreses, and the values end up modifying also registers used for the
> loop iteration control (ebx/esi) And crucially in the loop iteration it also
> writes some values to the same addresses, and there seems to be no
> synchronization whatsoever. So it seems to me, even if these addresses are
> supposed to be "private" to prep_compound_page(), running two
> prep_compound_page() in parallel, or even due to interrupt (Hyeonggon did
> reproduce this with -smp 1 as well) can result in a race ultimately
> affecting the number of iterations taken.

Thanks for the decoding!

For the strange memory address 0xc56c28xx in the assembly, they are
GCOV related  memory for prep_compound_page(), as from the System.map:

	c56c28a0 b __gcov0.free_compound_page
	c56c28c0 b __gcov0.prep_compound_page
	c56c2918 b __gcov0.early_debug_pagealloc

And your analysis of racing of these region makes sense, that some
pointer could be changed when a race happens. From the memory dump
in previous thread, it was always one off issue, that page[2] for
order 1 and page[8] for order 3 were wrongly written.

Thanks,
Feng

> Attaching the disasm I annotated for more details. Everything register whose
> value is coming from 0xc56c28f8/0xc56c28fc in some way is annotated as
> "crap" there (sorry).
> 
> > You also have UBSAN enabled, which - again - makes for some really
> > grotty asm that hides any actual logic.
> > 
> > Finally, your objdump version also does some horrendous decoding, like
> > 
> >   c13b3e29:       8d b4 26 00 00 00 00    lea    0x0(%esi,%eiz,1),%esi
> > 
> > which is just a 7-byte 'nop' instruction, but again, it makes it
> > rather hard to actually read the code.
> > 
> > With the i386 defconfig, gcc generates a function that is just ~30
> > instructions for me, so this makes a huge difference in the legibility
> > of the code.
> > 
> > I wonder if you can recreate the issue with a much more
> > straightforward config. By all means, leave DEBUG_PAGEALLOC and SLUB
> > debugging on, but without the things like UBSAN and GCOV.
> > 
> >> I also objdumped 'prep_compound_page' for vmlinux of 7118fc2906 and
> >> its parent commit 48b8d744ea84, which have big difference than the
> >> simple 'set_page_count()' change, but I can't tell which part is
> >> abnormal, so attach them for further check.
> > 
> > Yeah, I can't make heads or tails of them either, see above on how
> > illegible the objdump files are. And that's despite not even having
> > all of prep_compound_page() in them (it's missing
> > prep_compound_page.cold, which is probably just UBSAN fixup code, but
> > who knows..)
> > 
> > That said, with the i386 defconfig, the only change from adding
> > set_page_count() to the loop seems to be exactly that:
> > 
> >  .L589:
> > -       movl    $1024, 12(%eax)
> > +       movl    $0, 28(%eax)
> >         addl    $32, %eax
> > +       movl    $1024, -20(%eax)
> >         movl    %esi, -28(%eax)
> >         movl    $0, -12(%eax)
> >         cmpl    %edx, %eax
> > 
> > (don't ask me why gcc does *one* access using the pre-incremented
> > pointer, and then the rest to the post-incremented ones, but whatever
> > - it means that it's not just "add a mov $0", it's also changing how
> > the
> > 
> >         p->mapping = TAIL_MAPPING;
> > 
> > instruction is done, which is that
> > 
> > -       movl    $1024, 12(%eax)
> > +       movl    $1024, -20(%eax)
> > 
> > part of the change)
> > 
> >              Linus

> inputs:
> eax: page
> edx: order
> 
> stack below ebp
> -0x04 saved edi
> -0x08 saved esi
> -0x0c saved ebp
> (sub    $0x14,%esp)
> -0x10  crap
> -0x14  crap
> -0x18  page|1 (compound_head value)
> -0x1c  copy of order
> -0x20  page
> -0x24
> esp
> 
> eax = page|1 (compound_head value) / crap
> edx = page / crap
> edi = page / crap
> ebx = 1 << order (=2) / 1 << order -2 (=0) / crap!
> ecx = page + 1 (first tail page)
> esi = 0 / crap
> 
> c13b3b90 <prep_compound_page>:
> c13b3b90:	55                   	push   %ebp
> c13b3b91:	89 e5                	mov    %esp,%ebp
> c13b3b93:	57                   	push   %edi
> c13b3b94:	89 c7                	mov    %eax,%edi                            edi = page
> c13b3b96:	56                   	push   %esi
> c13b3b97:	53                   	push   %ebx
> c13b3b98:	83 ec 14             	sub    $0x14,%esp
> c13b3b9b:	83 fa 1f             	cmp    $0x1f,%edx                           
> c13b3b9e:	89 55 e4             	mov    %edx,-0x1c(%ebp)                    
> c13b3ba1:	0f 87 33 31 ed 01    	ja     c3286cda <prep_compound_page.cold>   only when order > 31 (?) 
> c13b3ba7:	0f b6 4d e4          	movzbl -0x1c(%ebp),%ecx                     ecx = order
> c13b3bab:	bb 01 00 00 00       	mov    $0x1,%ebx                            ebx = 1
> c13b3bb0:	d3 e3                	shl    %cl,%ebx                             ebx = 1 << order  (=2)
> c13b3bb2:	83 3f ff             	cmpl   $0xffffffff,(%edi)
> c13b3bb5:	0f 84 65 02 00 00    	je     c13b3e20 <prep_compound_page+0x290>  only when page flags == 0xffffffff (?)
> c13b3bbb:	83 05 d0 28 6c c5 01 	addl   $0x1,0xc56c28d0
> c13b3bc2:	83 15 d4 28 6c c5 00 	adcl   $0x0,0xc56c28d4
> c13b3bc9:	0f ba 2f 10          	btsl   $0x10,(%edi)                         sets bit 0x10 (PG_head) in page flags
> c13b3bcd:	83 05 f0 28 6c c5 01 	addl   $0x1,0xc56c28f0
> c13b3bd4:	83 15 f4 28 6c c5 00 	adcl   $0x0,0xc56c28f4
> c13b3bdb:	83 fb 01             	cmp    $0x1,%ebx
> c13b3bde:	0f 8e 80 00 00 00    	jle    c13b3c64 <prep_compound_page+0xd4>   nr_pages <= 1 -> skip over the whole loop
> c13b3be4:	8d 47 01             	lea    0x1(%edi),%eax                       eax = page|1 (compound_head value)
> c13b3be7:	8b 15 fc 28 6c c5    	mov    0xc56c28fc,%edx                      edx now some crap, we don't know how this was initialized                                          !!!
> c13b3bed:	89 45 e8             	mov    %eax,-0x18(%ebp)                     saves the page|1                                                                                   
> c13b3bf0:	a1 f8 28 6c c5       	mov    0xc56c28f8,%eax                      eax now some crap, again we don't know how this was initialized                                    !!!
> c13b3bf5:	8d 4f 28             	lea    0x28(%edi),%ecx                      ecx = edi + 0x28 - first tail page
> c13b3bf8:	89 7d e0             	mov    %edi,-0x20(%ebp)                     saves page pointer
> c13b3bfb:	83 c0 01             	add    $0x1,%eax                            crap + 1 
> c13b3bfe:	89 45 ec             	mov    %eax,-0x14(%ebp)                     saves the crap
> c13b3c01:	83 d2 00             	adc    $0x0,%edx                            crap in crap out
> c13b3c04:	a1 f8 28 6c c5       	mov    0xc56c28f8,%eax                      reset as crap
> c13b3c09:	89 55 f0             	mov    %edx,-0x10(%ebp)                     save more crap
> c13b3c0c:	8b 15 fc 28 6c c5    	mov    0xc56c28fc,%edx                      reset as crap
> c13b3c12:	83 eb 02             	sub    $0x2,%ebx                            ebx is nr_pages - 2 (for order-1 was 2, now 0) - remaining tail pages after first iteration?
> c13b3c15:	31 f6                	xor    %esi,%esi                            esi is 0
> c13b3c17:	83 c0 02             	add    $0x2,%eax                            crap + 2
> c13b3c1a:	83 d2 00             	adc    $0x0,%edx                            crap
> c13b3c1d:	01 c3                	add    %eax,%ebx                            ebx is now modified by crap !!!                                                                    !!!
> c13b3c1f:	8b 45 ec             	mov    -0x14(%ebp),%eax                     reset as crap
> c13b3c22:	11 d6                	adc    %edx,%esi                            esi now also crap
> c13b3c24:	8b 55 f0             	mov    -0x10(%ebp),%edx                     reset as crap
> c13b3c27:	89 f7                	mov    %esi,%edi                            edi now also crap, oh my
> c13b3c29:	89 de                	mov    %ebx,%esi                            supposed to be the remaining tail pages, byt modified by crap
> c13b3c2b:	8d 74 26 00          	lea    0x0(%esi,%eiz,1),%esi                elaborate nop
> c13b3c2f:	90                   	nop                                         nop
> 
> // this should be the loop for prep_compound_tail
> c13b3c30:	a3 f8 28 6c c5       	mov    %eax,0xc56c28f8                      uh now we saved some value to the memory we read previously assuming some known good value?        !!!
> c13b3c35:	8b 5d e8             	mov    -0x18(%ebp),%ebx                     ebx = page|1 (compound_head value)
> c13b3c38:	83 c0 01             	add    $0x1,%eax                            still crap
> c13b3c3b:	89 15 fc 28 6c c5    	mov    %edx,0xc56c28fc                      and now we wrote to the other address as well?                                                     !!!
> c13b3c41:	83 d2 00             	adc    $0x0,%edx                            still crap
> c13b3c44:	83 c1 28             	add    $0x28,%ecx                           ecx pointed to first tail pages, now points to second tail page
> c13b3c47:	c7 41 e4 00 04 00 00 	movl   $0x400,-0x1c(%ecx)                   (first tail page)->mapping = TAIL_MAPPING;
> c13b3c4e:	89 59 dc             	mov    %ebx,-0x24(%ecx)                     (first tail page)->compound_head is now set
> c13b3c51:	89 fb                	mov    %edi,%ebx                            ebx now crap
> c13b3c53:	31 d3                	xor    %edx,%ebx                            ebx xored with more crap
> c13b3c55:	89 5d ec             	mov    %ebx,-0x14(%ebp)                     we might need that crap later
> c13b3c58:	89 f3                	mov    %esi,%ebx                            supposed to be the remaining tail pages, byt modified by crap
> c13b3c5a:	31 c3                	xor    %eax,%ebx                            yeah why not xor it with more crap
> c13b3c5c:	0b 5d ec             	or     -0x14(%ebp),%ebx                     and "or" it with the previously saved crap
> c13b3c5f:	75 cf                	jne    c13b3c30 <prep_compound_page+0xa0>   and that's how we decided if we should do another iteration :(
> 
> c13b3c61:	8b 7d e0             	mov    -0x20(%ebp),%edi                     restore edi = page
> 
> // here we land if we skip everything at c13b3bde
> c13b3c64:	c6 47 30 01          	movb   $0x1,0x30(%edi)
> // the rest shouldn't be important

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ