[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20250331155517.GB2110528@cmpxchg.org>
Date: Mon, 31 Mar 2025 11:55:17 -0400
From: Johannes Weiner <hannes@...xchg.org>
To: Brendan Jackman <jackmanb@...gle.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
Vlastimil Babka <vbabka@...e.cz>,
Mel Gorman <mgorman@...hsingularity.net>, Zi Yan <ziy@...dia.com>,
linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 3/5] mm: page_alloc: defrag_mode
Hi Brendan,
On Sun, Mar 23, 2025 at 07:04:29PM +0100, Brendan Jackman wrote:
> On Sun Mar 23, 2025 at 4:46 AM CET, Johannes Weiner wrote:
> > On Sat, Mar 22, 2025 at 09:34:09PM -0400, Johannes Weiner wrote:
> > > On Sat, Mar 22, 2025 at 08:58:27PM -0400, Johannes Weiner wrote:
> > > > On Sat, Mar 22, 2025 at 04:05:52PM +0100, Brendan Jackman wrote:
> > > > > On Thu Mar 13, 2025 at 10:05 PM CET, Johannes Weiner wrote:
> > > > > > + /* Reclaim/compaction failed to prevent the fallback */
> > > > > > + if (defrag_mode) {
> > > > > > + alloc_flags &= ALLOC_NOFRAGMENT;
> > > > > > + goto retry;
> > > > > > + }
> > > > >
> > > > > I can't see where ALLOC_NOFRAGMENT gets cleared, is it supposed to be
> > > > > here (i.e. should this be ~ALLOC_NOFRAGMENT)?
> > >
> > > Please ignore my previous email, this is actually a much more severe
> > > issue than I thought at first. The screwed up clearing is bad, but
> > > this will also not check the flag before retrying, which means the
> > > thread will retry reclaim/compaction and never reach OOM.
> > >
> > > This code has weeks of load testing, with workloads fine-tuned to
> > > *avoid* OOM. A blatant OOM test shows this problem immediately.
> > >
> > > A simple fix, but I'll put it through the wringer before sending it.
> >
> > Ok, here is the patch. I verified this with intentional OOMing 100
> > times in a loop; this would previously lock up on first try in
> > defrag_mode, but kills and recovers reliably with this applied.
> >
> > I also re-ran the full THP benchmarks, to verify that erroneous
> > looping here did not accidentally contribute to fragmentation
> > avoidance and thus THP success & latency rates. They were in fact not;
> > the improvements claimed for defrag_mode are unchanged with this fix:
>
> Sounds good :)
>
> Off topic, but could you share some details about the
> tests/benchmarks you're running here? Do you have any links e.g. to
> the scripts you're using to run them?
Sure! The numbers I quoted here are from a dual workload of kernel
build and THP allocation bursts. The kernel build is an x86_64
defconfig, -j16 on 8 cores (no ht). I boot this machine with mem=1800M
to make sure there is some memory pressure, but not hopeless
thrashing. Filesystem and conventional swap on an older SATA SSD.
While the kernel builds, every 20s another worker mmaps 80M, madvises
for THP, measures the time to memset-fault the range in, and unmaps.
THP policy is upstream defaults: enabled=always, defrag=madvise. So
the kernel build itself will also optimistically consume THPs, but
only the burst allocations will direct reclaim/compact for them.
Aside from that - and this is a lot less scientific - I just run the
patches on the machines I use every day, looking for interactivity
problems, kswapd or kcompactd going crazy, and generally paying
attention to how well they cope under pressure compared to upstream.
My desktop is an 8G ARM machine (with zswap), so it's almost always
under some form of memory pressure. It's also using 16k pages and
order-11 pageblocks (32M THPs), which adds extra spice.
Powered by blists - more mailing lists