linux-kernel - Re: [PATCH 0/2 v2] remove PF_MEMALLOC

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <xjtcom43unuubdtzj7pudew3m5yk34jdrhim5nynvoalk3bgbu@4aohsslg5c5m>
Date: Wed, 4 Sep 2024 12:05:56 -0400
From: Kent Overstreet <kent.overstreet@...ux.dev>
To: Michal Hocko <mhocko@...e.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>, 
	Christoph Hellwig <hch@....de>, Yafang Shao <laoar.shao@...il.com>, jack@...e.cz, 
	Vlastimil Babka <vbabka@...e.cz>, Dave Chinner <dchinner@...hat.com>, 
	Christian Brauner <brauner@...nel.org>, Alexander Viro <viro@...iv.linux.org.uk>, 
	Paul Moore <paul@...l-moore.com>, James Morris <jmorris@...ei.org>, 
	"Serge E. Hallyn" <serge@...lyn.com>, linux-fsdevel@...r.kernel.org, linux-mm@...ck.org, 
	linux-bcachefs@...r.kernel.org, linux-security-module@...r.kernel.org, 
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH 0/2 v2] remove PF_MEMALLOC_NORECLAIM

On Wed, Sep 04, 2024 at 09:14:29AM GMT, Michal Hocko wrote:
> On Tue 03-09-24 19:53:41, Kent Overstreet wrote:
> [...]
> > However, if we agreed that GFP_NOFAIL meant "only fail if it is not
> > possible to satisfy this allocation" (and I have been arguing that that
> > is the only sane meaning) - then that could lead to a lot of error paths
> > getting simpler.
> >
> > Because there are a lot of places where there's essentially no good
> > reason to bubble up an -ENOMEM to userspace; if we're actually out of
> > memory the current allocation is just one out of many and not
> > particularly special, better to let the oom killer handle it...
> 
> This is exactly GFP_KERNEL semantic for low order allocations or
> kvmalloc for that matter. They simply never fail unless couple of corner
> cases - e.g. the allocating task is an oom victim and all of the oom
> memory reserves have been consumed. This is where we call "not possible
> to allocate".

*nod*

Which does beg the question of why GFP_NOFAIL exists.

> > So the error paths would be more along the lines of "there's a bug, or
> > userspace has requested something crazy, just shut down gracefully".
> 
> How do you expect that to be done? Who is going to go over all those
> GFP_NOFAIL users? And what kind of guide lines should they follow? It is
> clear that they believe they cannot handle the failure gracefully
> therefore they have requested GFP_NOFAIL. Many of them do not have
> return value to return.

They can't handle the allocatian failure and continue normal operation,
but that's entirely different from not being able to handle the
allocation failure at all - it's not hard to do an emergency shutdown,
that's a normal thing for filesystems to do.

And if you scan for GFP_NOFAIL uses in the kernel, a decent number
already do just that.

> So really what do you expect proper GFP_NOFAIL users to do and what
> should happen to those that are requesting unsupported size or
> allocation mode?

Emergency shutdwon.

And I'm not saying we have to rush to fix all the existing callers;
they're clearly in existing well tested code, there's not much point to
that.

Additionally most of them are deep in the guts of filesystem transaction
code where call paths to that site are limited - they're not library
code that gets called by anything.

But as a matter of policy going forward, yes we should be saying that
even GFP_NOFAIL allocations should be checking for -ENOMEM.

> Yes, we need to define some reasonable maximum supported sizes. For the
> page allocator this has been order > 1 and we considering we have a
> warning about those requests for years without a single report then we
> can assume we do not have such abusers. for kvmalloc to story is
> different. Current INT_MAX is just not any practical limit. Past
> experience says that anything based on the amount of memory just doesn't
> work (e.g. hash table sizes that used to that scaling and there are
> other examples). So we should be practical here and look at existing
> users and see what they really need and put a cap above that.

Not following what you're saying about hash tables? Hash tables scale
roughly with the amount of system memory/workingset.

But it seems to me that the limit should be lower if you're on e.g. a 2
GB machine (not failing with a warning, just failing immediately rather
than oom killing a bunch of stuff first) - and it's going to need to be
raised above INT_MAX as large memory machines keep growing, I keep
hitting it in bcachefs fsck code.