linux-kernel - Re: [DISCUSSION] proposed mctl() API

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250604132121.GC1431@cmpxchg.org>
Date: Wed, 4 Jun 2025 09:21:21 -0400
From: Johannes Weiner <hannes@...xchg.org>
To: Matthew Wilcox <willy@...radead.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Shakeel Butt <shakeel.butt@...ux.dev>,
	"Liam R . Howlett" <Liam.Howlett@...cle.com>,
	David Hildenbrand <david@...hat.com>,
	Vlastimil Babka <vbabka@...e.cz>, Jann Horn <jannh@...gle.com>,
	Arnd Bergmann <arnd@...db.de>,
	Christian Brauner <brauner@...nel.org>,
	SeongJae Park <sj@...nel.org>, Usama Arif <usamaarif642@...il.com>,
	Mike Rapoport <rppt@...nel.org>, Barry Song <21cnbao@...il.com>,
	linux-mm@...ck.org, linux-arch@...r.kernel.org,
	linux-kernel@...r.kernel.org, linux-api@...r.kernel.org,
	Pedro Falcato <pfalcato@...e.de>
Subject: Re: [DISCUSSION] proposed mctl() API

On Mon, Jun 02, 2025 at 07:01:56PM +0100, Matthew Wilcox wrote:
> On Thu, May 29, 2025 at 05:14:23PM -0400, Johannes Weiner wrote:
> > On Thu, May 29, 2025 at 04:28:46PM +0100, Matthew Wilcox wrote:
> > > Barry's problem is that we're all nervous about possibly regressing
> > > performance on some unknown workloads.  Just try Barry's proposal, see
> > > if anyone actually compains or if we're just afraid of our own shadows.
> > 
> > I actually explained why I think this is a terrible idea. But okay, I
> > tried the patch anyway.
> 
> Sorry, I must've missed that one ;-(

Apologies for my tone. The discussion is spread out over too many
threads...

> > This is 'git log' on a hot kernel repo after a large IO stream:
> > 
> >                                      VANILLA                      BARRY
> > Real time                 49.93 (    +0.00%)         60.36 (   +20.48%)
> > User time                 32.10 (    +0.00%)         32.09 (    -0.04%)
> > System time               14.41 (    +0.00%)         14.64 (    +1.50%)
> > pgmajfault              9227.00 (    +0.00%)      18390.00 (   +99.30%)
> > workingset_refault_file  184.00 (    +0.00%)    236899.00 (+127954.05%)
> > 
> > Clearly we can't generally ignore page cache hits just because the
> > mmaps() are intermittent.
> > 
> > The whole point is to cache across processes and their various
> > apertures into a common, long-lived filesystem space.
> > 
> > Barry knows something about the relationship between certain processes
> > and certain files that he could exploit with MADV_COLD-on-exit
> > semantics. But that's not something the kernel can safely assume. Not
> > without defeating the page cache for an entire class of file accesses.
> 
> So what about distinguishing between exited-normally processes (ie git
> log) vs killed-by-oom processes (ie Barry's usecase)?  Update the
> referenced bit in the first case and not the second?

In cloud environments, it's common to restart a workload immediately
after an OOM kill.

The hosts tend to handle a fairly dynamic mix of batch jobs and
semi-predictable user request load, all while also trying to target
decent average host utilization. Adapting to external load peaks is
laggy (spawning new workers, rebalancing).

In such setups, OOM conditions are generally assumed to be highly
transient. And quick restarts are important to avoid cascading
failures in the worker pool during load peaks.

So I don't think OOM is a good universal signal that the workload is
gone and the memory is cold.