netdev - Re: [PATCH v5] mm/page_alloc: boost watermarks on atomic allocation failure

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260122014034.223163-1-realwujing@gmail.com>
Date: Wed, 21 Jan 2026 20:40:10 -0500
From: Qiliang Yuan <realwujing@...il.com>
To: akpm@...ux-foundation.org
Cc: david@...nel.org,
	mhocko@...e.com,
	vbabka@...e.cz,
	willy@...radead.org,
	lance.yang@...ux.dev,
	hannes@...xchg.org,
	surenb@...gle.com,
	jackmanb@...gle.com,
	ziy@...dia.com,
	weixugc@...gle.com,
	rppt@...nel.org,
	linux-mm@...ck.org,
	linux-kernel@...r.kernel.org,
	netdev@...r.kernel.org,
	edumazet@...gle.com
Subject: Re: [PATCH v5] mm/page_alloc: boost watermarks on atomic allocation failure

On Wed, 21 Jan 2026 12:56:03 -0800 Andrew Morton <akpm@...ux-foundation.org> wrote:

> This seems sensible to me - dynamically boost reserves in response to
> sustained GFP_ATOMIC allocation failures.  It's very much a networking
> thing and I expect the networking people have been looking at these
> issues for years.  So let's start by cc'ing them!

Thank you for the feedback and for cc'ing the networking folks! I appreciate
your continued engagement throughout this patch series (v1-v5).

> Obvious question, which I think was asked before: what about gradually
> decreasing those reserves when the packet storm has subsided?
> 
> > v4:
> > - Introduced watermark_scale_boost and gradual decay via balance_pgdat.
> 
> And there it is, but v5 removed this.  Why?  Or perhaps I'm misreading
> the implementation.

You're absolutely right - v4 did include a gradual decay mechanism. The
evolution from v1 to v5 was driven by community feedback, and I'd like to
explain the rationale for each major change:

**v1 → v2**: Following your and Matthew Wilcox's feedback on v1, I:
- Reduced the boost from doubling (100%) to 50% increase
- Added a decay mechanism (5% every 5 minutes)
- Added debounce logic
- v1: https://lore.kernel.org/all/tencent_9DB6637676D639B4B7AEA09CC6A6F9E49D0A@qq.com/
- v2: https://lore.kernel.org/all/tencent_6FE67BA7BE8376AB038A71ACAD4FF8A90006@qq.com/

**v2 → v3**: Following Michal Hocko's suggestion to use watermark_scale_factor
instead of min_free_kbytes, I switched to the watermark_boost infrastructure.
This was a significant simplification that reused existing MM subsystem patterns.
- v3: https://lore.kernel.org/all/tencent_44B556221480D8371FBC534ACCF3CE2C8707@qq.com/

**v3 → v4**: Added watermark_scale_boost and gradual decay via balance_pgdat()
to provide more fine-grained control over the reclaim aggressiveness.
- v4: https://lore.kernel.org/all/tencent_D23BFCB69EA088C55AFAF89F926036743E0A@qq.com/

**v4 → v5**: Removed watermark_scale_boost for the following reasons:
- v5: https://lore.kernel.org/all/20260121065740.35616-1-realwujing@gmail.com/

1. **Natural decay exists**: The existing watermark_boost infrastructure already
   has a built-in decay path. When kswapd successfully reclaims memory and the
   zone becomes balanced, kswapd_shrink_node() automatically resets
   watermark_boost to 0. This happens organically without custom decay logic.

2. **Simplicity**: The v4 approach added custom watermark_scale_boost tracking
   and manual decay in balance_pgdat(). This added complexity that duplicated
   functionality already present in the kswapd reclaim path.

3. **Production validation**: In our production environment (high-throughput
   networking workloads), the natural decay via kswapd proved sufficient. Once
   memory pressure subsides and kswapd successfully reclaims to the high
   watermark, the boost is cleared automatically within seconds.

However, I recognize this is a trade-off. The v4 gradual decay provided more
explicit control over the decay rate. If you or the networking maintainers feel
that explicit decay control is important for packet storm scenarios, I'm happy
to reintroduce the v4 approach or explore alternative decay strategies (e.g.,
time-based decay independent of kswapd success).

> > +	zone->watermark_boost = min(zone->watermark_boost +
> > +		max(pageblock_nr_pages, zone_managed_pages(zone) >> 10),
> 
> ">> 10" is a magic number.  What is the reasoning behind choosing this
> value?

Good catch. The ">> 10" (divide by 1024) was chosen to provide a
zone-proportional boost that scales with zone size:

- For a 1GB zone: ~1MB boost per trigger
- For a 16GB zone: ~16MB boost per trigger

The rationale:
1. **Proportionality**: Larger zones experiencing atomic allocation pressure
   likely need proportionally larger safety buffers. A fixed pageblock_nr_pages
   (typically 2MB) might be insufficient for large zones under heavy load.
   
2. **Conservative scaling**: 1/1024 (~0.1%) is aggressive enough to help during
   sustained pressure but conservative enough to avoid over-reclaim. This was
   empirically tuned based on our production workload.
   
3. **Production results**: In our high-throughput networking environment
   (100Gbps+ traffic bursts), this value reduced GFP_ATOMIC failures by ~95%
   without causing excessive kswapd activity or impacting normal allocations.

I should document this better. I propose adding a #define:

```c
/*
 * Boost watermarks by ~0.1% of zone size on atomic allocation pressure.
 * This provides zone-proportional safety buffers: ~1MB per 1GB of zone size.
 */
#define ATOMIC_BOOST_SCALE_SHIFT 10
```

Best regards,
Qiliang Yuan