linux-kernel - Re: [PATCH 5/5] mm: page_alloc: defrag

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <efa6eb69-cff3-421d-94c7-e37a9a1e26f8@suse.cz>
Date: Fri, 11 Apr 2025 18:51:51 +0200
From: Vlastimil Babka <vbabka@...e.cz>
To: Johannes Weiner <hannes@...xchg.org>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
 Mel Gorman <mgorman@...hsingularity.net>, Zi Yan <ziy@...dia.com>,
 linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 5/5] mm: page_alloc: defrag_mode kswapd/kcompactd
 watermarks

On 4/11/25 17:39, Johannes Weiner wrote:
> On Fri, Apr 11, 2025 at 10:19:58AM +0200, Vlastimil Babka wrote:
>> On 3/13/25 22:05, Johannes Weiner wrote:
>> > The previous patch added pageblock_order reclaim to kswapd/kcompactd,
>> > which helps, but produces only one block at a time. Allocation stalls
>> > and THP failure rates are still higher than they could be.
>> > 
>> > To adequately reflect ALLOC_NOFRAGMENT demand for pageblocks, change
>> > the watermarking for kswapd & kcompactd: instead of targeting the high
>> > watermark in order-0 pages and checking for one suitable block, simply
>> > require that the high watermark is entirely met in pageblocks.
>> 
>> Hrm.
> 
> Hrm!
> 
>> > @@ -2329,6 +2329,22 @@ static enum compact_result __compact_finished(struct compact_control *cc)
>> >  	if (!pageblock_aligned(cc->migrate_pfn))
>> >  		return COMPACT_CONTINUE;
>> >  
>> > +	/*
>> > +	 * When defrag_mode is enabled, make kcompactd target
>> > +	 * watermarks in whole pageblocks. Because they can be stolen
>> > +	 * without polluting, no further fallback checks are needed.
>> > +	 */
>> > +	if (defrag_mode && !cc->direct_compaction) {
>> > +		if (__zone_watermark_ok(cc->zone, cc->order,
>> > +					high_wmark_pages(cc->zone),
>> > +					cc->highest_zoneidx, cc->alloc_flags,
>> > +					zone_page_state(cc->zone,
>> > +							NR_FREE_PAGES_BLOCKS)))
>> > +			return COMPACT_SUCCESS;
>> > +
>> > +		return COMPACT_CONTINUE;
>> > +	}
>> 
>> Wonder if this ever succeds in practice. Is high_wmark_pages() even aligned
>> to pageblock size? If not, and it's X pageblocks and a half, we will rarely
>> have NR_FREE_PAGES_BLOCKS cover all of that? Also concurrent allocations can
>> put us below high wmark quickly and then we never satisfy this?
> 
> The high watermark is not aligned, but why does it have to be? It's a
> binary condition: met or not met. Compaction continues until it's met.

What I mean is, kswapd will reclaim until the high watermark, which would be
32.7 blocks, wake up kcompactd [*] but that can only create up to 32 blocks
of NR_FREE_PAGES_BLOCKS so it has already lost at that point? (unless
there's concurrent freeing pushing it above the high wmark)

> NR_FREE_PAGES_BLOCKS moves in pageblock_nr_pages steps. This means
> it'll really work until align_up(highmark, pageblock_nr_pages), as
> that's when NR_FREE_PAGES_BLOCKS snaps above the (unaligned) mark. But
> that seems reasonable, no?

How can it snap if it doesn't have enough free pages? Unlike kswapd,
kcompactd doesn't create them, only defragments.

> The allocator side is using low/min, so we have the conventional
> hysteresis between consumer and producer.

Sure but we cap kswapd at high wmark and the hunk quoted above also uses
high wmark so there's no hysteresis happening between kswapd and kcompactd?

> For illustration, on my 2G test box, the watermarks in DMA32 look like
> this:
> 
>   pages free     212057
>         boost    0
>         min      11164		(21.8 blocks)
>         low      13955		(27.3 blocks)
>         high     16746		(32.7 blocks)
>         promo    19537
>         spanned  456704
>         present  455680
>         managed  431617		(843.1 blocks)
> 
> So there are several blocks between the kblahds wakeup and sleep. The
> first allocation to cut into a whole free block will decrease
> NR_FREE_PAGES_BLOCK by a whole block. But subsequent allocs that fill
> the remaining space won't change that counter. So the distance between
> the watermarks didn't fundamentally change (modulo block rounding).
> 
>> Doesn't then happen that with defrag_mode, in practice kcompactd basically
>> always runs until scanners met?
> 
> Tracing kcompactd calls to compaction_finished() with defrag_mode:
> 
> @[COMPACT_CONTINUE]: 6955
> @[COMPACT_COMPLETE]: 19
> @[COMPACT_PARTIAL_SKIPPED]: 1
> @[COMPACT_SUCCESS]: 17
> @wakeuprequests: 3

OK that doesn't look that bad.

> Of course, similar to kswapd, it might not reach the watermarks and
> keep running if there is a continuous stream of allocations consuming
> the blocks it's making. Hence the ratio between wakeups & continues.
> 
> But when demand stops, it'll balance the high mark and quit.

Again, since kcompactd can only defragment free space and not create it, it
may be trying in vain?

[*] now when checking the code between kswapd and kcompactd handover, I
think I found a another problem?

we have:
kswapd_try_to_sleep()
  prepare_kswapd_sleep() - needs to succeed for wakeup_kcompactd()
   pgdat_balanced() - needs to be true for prepare_kswapd_sleep() to be true
    - with defrag_mode we want high watermark of NR_FREE_PAGES_BLOCKS, but
      we were only reclaiming until now and didn't wake up kcompactd and
      this actually prevents the wake up?