linux-kernel - Re: [PATCH] bcachefs: Switch to memalloc_flags

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <czqac5lskwgsqoeba54omj5cfjouklnkgti6sl5a5n4kr7r7jv@bts5jicfq2dy>
Date: Sat, 31 Aug 2024 11:46:17 -0400
From: Kent Overstreet <kent.overstreet@...ux.dev>
To: Theodore Ts'o <tytso@....edu>
Cc: Dave Chinner <david@...morbit.com>, Michal Hocko <mhocko@...e.com>, 
	Matthew Wilcox <willy@...radead.org>, linux-fsdevel@...r.kernel.org, linux-mm@...ck.org, 
	linux-kernel@...r.kernel.org, Dave Chinner <dchinner@...hat.com>
Subject: Re: [PATCH] bcachefs: Switch to memalloc_flags_do() for vmalloc
 allocations

On Thu, Aug 29, 2024 at 11:39:05PM GMT, Theodore Ts'o wrote:
> On Fri, Aug 30, 2024 at 12:27:11AM +1000, Dave Chinner wrote:
> > 
> > We've been using __GFP_NOFAIL semantics in XFS heavily for 30 years
> > now. This was the default Irix kernel allocator behaviour (it had a
> > forwards progress guarantee and would never fail allocation unless
> > told it could do so). We've been using the same "guaranteed not to
> > fail" semantics on Linux since the original port started 25 years
> > ago via open-coded loops.
> 
> Ext3/ext4 doesn't have quite the history as XFS --- it's only been
> around for 23 years --- but we've also used __GFP_NOFAIL or its
> moral equivalent, e.g.:
> 
> > 	do {
> > 		p = kmalloc(size);
> > 	while (!p);
> 
> For the entire existence of ext3.
> 
> > Put simply: __GFP_NOFAIL will be rendered completely useless if it
> > can fail due to external scoped memory allocation contexts.  This
> > will force us to revert all __GFP_NOFAIL allocations back to
> > open-coded will-not-fail loops.
> 
> The same will be true for ext4.  And as Dave has said, the MM
> developers want to have visibility to when file systems have basically
> said, "if you can't allow us to allocate memory, our only alternative
> is to cause user data loss, crash the kernel, or loop forever; we will
> choose the latter".  The MM developers tried to make __GFP_NOFAIL go
> away several years ago, and ext4 put the retry loop back, As a result,
> the compromise was that the MM developers restored __GFP_NOFAIL, and
> the file systems developers have done their best to reduce the use of
> __GFP_NOFAIL as much as possible.
> 
> So if you try to break the GFP_NOFAIL promise, both xfs and ext4 will
> back to the retry loop.  And the MM devs will be sad, and they will
> forcibly revert your change to *ther* code, even if that means
> breaking bcachefs.  Becuase otherwise, you will be breaking ext4 and
> xfs, and so we will go back to using a retry loop, which will be worse
> for Linux users.

GFP_NOFAIL may be better than the retry loop, but it's still not good.

Consider what happens when you have a GFP_NOFAIL in a critical IO path,
when the system is almost exhausted on memory; yes, that allocation will
succeed _eventually_, but without any latency bounds. When you're
thrashing or being fork bombed, that allocation is contending with
everything else.

Much the same way that a lock in a critical path where the work done
under the lock grows when the system is loaded, it's a contention point
subject to catastrophic failure.

Much better to preallocate, e.g. with a mempool, or have some other kind
of fallback.

It might work to do __GFP_NOFAIL|__GFP_HIGH in critical paths, but I've
never seen that investigated or tried.

And this is an area filesystem people really need to be thinking about.
Block layer gets this right, filesystems do not, and I suspect this is a
key contributor to our performance and behaviour sucking when we're
thrashing.

bcachefs puts a lot of effort into making sure we can run in bounded
memory, because I put a lot of emphasiss on consistent performance and
bounded latency, not just winning benchmarks. There's only two
__GFP_NOFAIL allocations in bcachefs, and I'll likely remove both of
them when I get around to it.