linux-kernel - Re: [PATCH] mapletree-vs-khugepaged

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220531185626.yvlmymbxyoe5vags@revolver>
Date:   Tue, 31 May 2022 18:56:34 +0000
From:   Liam Howlett <liam.howlett@...cle.com>
To:     Guenter Roeck <linux@...ck-us.net>,
        Heiko Carstens <hca@...ux.ibm.com>,
        Sven Schnelle <svens@...ux.ibm.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        "linux-mm@...ck.org" <linux-mm@...ck.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "Matthew Wilcox (Oracle)" <willy@...radead.org>
Subject: Re: [PATCH] mapletree-vs-khugepaged

* Liam R. Howlett <Liam.Howlett@...cle.com> [220530 13:38]:
> * Guenter Roeck <linux@...ck-us.net> [220519 17:42]:
> > On 5/19/22 07:35, Liam Howlett wrote:
> > > * Guenter Roeck <linux@...ck-us.net> [220517 10:32]:
> > > 
> > > ...
> > > > 
> > > > Another bisect result, boot failures with nommu targets (arm:mps2-an385,
> > > > m68k:mcf5208evb). Bisect log is the same for both.
> > > ...
> > > > # first bad commit: [bd773a78705fb58eeadd80e5b31739df4c83c559] nommu: remove uses of VMA linked list
> > > 
> > > I cannot reproduce this on my side, even with that specific commit.  Can
> > > you point me to the failure log, config file, etc?  Do you still see
> > > this with the fixes I've sent recently?
> > > 
> > 
> > This was in linux-next; most recently with next-20220517.
> > I don't know if that was up-to-date with your patches.
> > The problem seems to be memory allocation failures.
> > A sample log is at
> > https://kerneltests.org/builders/qemu-m68k-next/builds/1065/steps/qemubuildcommand/logs/stdio
> > The log history at
> > https://kerneltests.org/builders/qemu-m68k-next?numbuilds=30
> > will give you a variety of logs.
> > 
> > The configuration is derived from m5208evb_defconfig, with initrd
> > and command line embedded in the image. You can see the detailed
> > configuration updates at
> > https://github.com/groeck/linux-build-test/blob/master/rootfs/m68k/run-qemu-m68k.sh
> > 
> > Qemu command line is
> > 
> > qemu-system-m68k -M mcf5208evb -kernel vmlinux \
> >     -cpu m5208 -no-reboot -nographic -monitor none
> >     -append "rdinit=/sbin/init console=ttyS0,115200"
> > 
> > with initrd from
> > https://github.com/groeck/linux-build-test/blob/master/rootfs/m68k/rootfs-5208.cpio.gz
> > 
> > I use qemu v6.2, but any recent qemu version should work.
> 
> I have qemu 7.0 which seems to change the default memory size from 32MB
> to 128MB. This can be seen on your log here:
> 
> Memory: 27928K/32768K available (2827K kernel code, 160K rwdata, 432K rodata, 1016K init, 66K bss, 4840K reserved, 0K cma-reserved)
> 
> With 128MB the kernel boots.  With 64MB it also boots.  32MB fails with
> an OOM. Looking into it more, I see that the OOM is caused by a
> contiguous page allocation of 1MB (order 7 at 8K pages).  This can be
> seen in the log as well:
> 
> Running sysctl: echo: page allocation failure: order:7, mode:0xcc0(GFP_KERNEL), nodemask=(null)
> ...
> nommu: Allocation of length 884736 from process 63 (echo) failed
> 
> This last log message above comes from the code path that uses
> alloc_pages_exact().
> 
> I don't see why my 256 byte nodes (order 0 allocations yield 32 nodes)
> would fragment the memory beyond use on boot.  I have checked for some
> sort of massive leak by adding a static node count to the code and have
> only ever hit ~12 nodes.  Consulting the OOM log from the above link
> again:
> 
> DMA: 0*8kB 1*16kB (U) 9*32kB (U) 7*64kB (U) 21*128kB (U) 7*256kB (U) 6*512kB (U) 0*1024kB 0*2048kB 0*4096kB 0*8192kB = 8304kB
> 
> So to get to the point of breaking up a 1MB block, we'd need an obscene
> number of nodes.
> 
> Furthermore, the OOM on boot is not always happening.  When boot
> succeeds without an oom,  I checked slabinfo and see that the maple_node
> has 32 active objects which is 1 order 0 allocation. The boot does
> mostly cause an OOM.  It is worth noting that the slabinfo count is lazy
> on counting the number of active objects so it is most likely lower than
> this value in reality.
> 
> Does anyone have any idea why nommu would be getting this fragmented?

Answer: Why, yes.  Matthew does.  Using alloc_pages_exact() means we
allocate the huge chunk of memory then free the leftovers immediately.
Those freed leftover pages are handed out on the next request - which
happens to be the maple tree.

It seems nommu is so close to OOMing already that this makes a
difference.  Attached is a patch which _almost_ solves the issue by
making it less likely to use those pages, but it's still a matter of
timing on if this will OOM anyways.  It reduces the potential by a large
margin, maybe 1/10 fail instead of 4/5 failing.  This patch is probably
worth taking on its own as it reduces memory fragmentation on
short-lived allocations that use alloc_pages_exact().

I changed the nommu code a bit to reduce memory usage as well.  During a
split even, I no longer delete then re-add the VMA and I only
preallocate a single time for the two writes associated with a split. I
also moved my pre-allocation ahead of the call path that does
alloc_pages_exact().  This all but ensures we won't fragment the larger
chunks of memory as we get enough nodes out of a single page to run at
least through boot.  However, the failure rate remained at 1/10 with
this change.

I had accepted the scenario that this all just worked before, but my
setup is different than that of Guenter.  I am using buildroot-2022.02.1
and qemu 7.0 for my testing.  My configuration OOMs 12/13 times without
maple tree, so I think we actually lowered the memory pressure on boot
with these changes.  Obviously there is a element of timing that causes
variation in the testing so exact numbers are not possible.

Thanks,
Liam


View attachment "0001-mm-page_alloc-Reduce-potential-fragmentation-in-make.patch" of type "text/x-diff" (1664 bytes)