linux-kernel - Re: kswapd0: page allocation failure: order:0, mode:0x820(GFP_ATOMIC), nodemask=(null),cpuset=/,mems

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAOUHufb6zXr14Wm3T-4-OJh7iAq+vzDKwVYfHLhMMt96SpiZXg@mail.gmail.com>
Date: Tue, 4 Jun 2024 11:53:39 -0600
From: Yu Zhao <yuzhao@...gle.com>
To: Yosry Ahmed <yosryahmed@...gle.com>
Cc: Erhard Furtner <erhard_f@...lbox.org>, linux-mm@...ck.org, linux-kernel@...r.kernel.org, 
	linuxppc-dev@...ts.ozlabs.org, Johannes Weiner <hannes@...xchg.org>, 
	Nhat Pham <nphamcs@...il.com>, Chengming Zhou <chengming.zhou@...ux.dev>, 
	Sergey Senozhatsky <senozhatsky@...omium.org>, Minchan Kim <minchan@...nel.org>
Subject: Re: kswapd0: page allocation failure: order:0, mode:0x820(GFP_ATOMIC),
 nodemask=(null),cpuset=/,mems_allowed=0 (Kernel v6.5.9, 32bit ppc)

On Tue, Jun 4, 2024 at 11:34 AM Yosry Ahmed <yosryahmed@...gle.com> wrote:
>
> On Tue, Jun 4, 2024 at 10:19 AM Yu Zhao <yuzhao@...gle.com> wrote:
> >
> > On Tue, Jun 4, 2024 at 10:12 AM Yosry Ahmed <yosryahmed@...gle.com> wrote:
> > >
> > > On Tue, Jun 4, 2024 at 4:45 AM Erhard Furtner <erhard_f@...lbox.org> wrote:
> > > >
> > > > On Mon, 3 Jun 2024 16:24:02 -0700
> > > > Yosry Ahmed <yosryahmed@...gle.com> wrote:
> > > >
> > > > > Thanks for bisecting. Taking a look at the thread, it seems like you
> > > > > have a very limited area of memory to allocate kernel memory from. One
> > > > > possible reason why that commit can cause an issue is because we will
> > > > > have multiple instances of the zsmalloc slab caches 'zspage' and
> > > > > 'zs_handle', which may contribute to fragmentation in slab memory.
> > > > >
> > > > > Do you have /proc/slabinfo from a good and a bad run by any chance?
> > > > >
> > > > > Also, could you check if the attached patch helps? It makes sure that
> > > > > even when we use multiple zsmalloc zpools, we will use a single slab
> > > > > cache of each type.
> > > >
> > > > Thanks for looking into this! I got you 'cat /proc/slabinfo' from a good HEAD, from a bad HEAD and from the bad HEAD + your patch applied.
> > > >
> > > > Good was 6be3601517d90b728095d70c14f3a04b9adcb166, bad was b8cf32dc6e8c75b712cbf638e0fd210101c22f17 which I got both from my bisect.log. I got the slabinfo shortly after boot and a 2nd time shortly before the OOM or the kswapd0: page allocation failure happens. I terminated the workload (stress-ng --vm 2 --vm-bytes 1930M --verify -v) manually shortly before the 2 GiB RAM exhausted and got the slabinfo then.
> > > >
> > > > The patch applied to git b8cf32dc6e8c75b712cbf638e0fd210101c22f17 unfortunately didn't make a difference, I got the kswapd0: page allocation failure nevertheless.
> > >
> > > Thanks for trying this out. The patch reduces the amount of wasted
> > > memory due to the 'zs_handle' and 'zspage' caches by an order of
> > > magnitude, but it was a small number to begin with (~250K).
> > >
> > > I cannot think of other reasons why having multiple zsmalloc pools
> > > will end up using more memory in the 0.25GB zone that the kernel
> > > allocations can be made from.
> > >
> > > The number of zpools can be made configurable or determined at runtime
> > > by the size of the machine, but I don't want to do this without
> > > understanding the problem here first. Adding other zswap and zsmalloc
> > > folks in case they have any ideas.
> >
> > Hi Erhard,
> >
> > If it's not too much trouble, could you "grep nr_zspages /proc/vmstat"
> > on kernels before and after the bad commit? It'd be great if you could
> > run the grep command right before the OOM kills.
> >
> > The overall internal fragmentation of multiple zsmalloc pools might be
> > higher than a single one. I suspect this might be the cause.
>
> I thought about the internal fragmentation of pools, but zsmalloc
> should have access to highmem, and if I understand correctly the
> problem here is that we are running out of space in the DMA zone when
> making kernel allocations.
>
> Do you suspect zsmalloc is allocating memory from the DMA zone
> initially, even though it has access to highmem?

There was a lot of user memory in the DMA zone. So at a point the
highmem zone was full and allocation fallback happened.

The problem with zone fallback is that recent allocations go into
lower zones, meaning they are further back on the LRU list. This
applies to both user memory and zsmalloc memory -- the latter has a
writeback LRU. On top of this, neither the zswap shrinker nor the
zsmalloc shrinker (compaction) is zone aware. So page reclaim might
have trouble hitting the right target zone.

We can't really tell how zspages are distributed across zones, but the
overall number might be helpful. It'd be great if someone could make
nr_zspages per zone :)