linux-ext4 - Re: [PATCH 2/6] MM: improve documentation for __GFP

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YWQmsESyyiea0zle@dhcp22.suse.cz>
Date:   Mon, 11 Oct 2021 13:57:36 +0200
From:   Michal Hocko <mhocko@...e.com>
To:     Dave Chinner <david@...morbit.com>
Cc:     NeilBrown <neilb@...e.de>, Vlastimil Babka <vbabka@...e.cz>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Theodore Ts'o <tytso@....edu>,
        Andreas Dilger <adilger.kernel@...ger.ca>,
        "Darrick J. Wong" <djwong@...nel.org>,
        Matthew Wilcox <willy@...radead.org>,
        Mel Gorman <mgorman@...e.de>, Jonathan Corbet <corbet@....net>,
        linux-xfs@...r.kernel.org, linux-ext4@...r.kernel.org,
        linux-fsdevel@...r.kernel.org, linux-nfs@...r.kernel.org,
        linux-mm@...ck.org, linux-kernel@...r.kernel.org,
        linux-doc@...r.kernel.org
Subject: Re: [PATCH 2/6] MM: improve documentation for __GFP_NOFAIL

On Sat 09-10-21 09:36:49, Dave Chinner wrote:
> On Fri, Oct 08, 2021 at 09:48:39AM +0200, Michal Hocko wrote:
> > __GFP_NOFAIL usage is a risk on its own. It is a hard requirement that
> > the allocator cannot back off.
[...]
> 
> No, "allocator cannot back off" isn't a hard requirement for most
> GFP_NOFAIL uses. *Not failing the allocation* is the hard
> requirement.

We are talking about the same thing here I belive. By cannot back off I
really mean cannot fail. Just for the clarification.

> How long it takes for the allocation to actually succeed is
> irrelevant to most callers, and given that we are replacing loops
> that do
> 
> 	while (!(p = kmalloc(sizeof(*p), GFP_KERNEL))
> 
> with __GFP_NOFAIL largely indicates that allocation *latency* and/or
> deadlocks are not an issue here.

Agreed. 

> Indeed, if we deadlock in XFS because there is no memory available,
> that is *not a problem kmalloc() should be trying to solve*. THe
> problem is the caller being unable to handle allocation failure, so
> if allocation cannot make progress, that needs to be fixed by the
> caller getting rid of the unfailable allocation.
> 
> The fact is that we've had these loops in production code for a
> couple of decades and these subsystems just aren't failing or
> deadlocking with such loops. IOWs, we don't need __GFP_NOFAIL to dig
> deep into reserves or drive the system to OOM killing - we just need
> to it keep retrying the same allocation until it succeeds.
> 
> Put simply, we want "retry forever" semantics to match what
> production kernels have been doing for the past couple of decades,
> but all we've been given are "never fail" semantics that also do
> something different and potentially much more problematic.
> 
> Do you see the difference here? __GFP_NOFAIL is not what we
> need in the vast majority of cases where it is used. We don't want
> the failing allocations to drive the machine hard into critical
> reserves, we just want the allocation to -eventually succeed- and if
> it doesn't, that's our problem to handle, not kmalloc()....

I can see your point. I do have a recollection that there were some
instance involved where an emergency access to memory reserves helped
in OOM situations.

Anway as I've tried to explain earlier that this all is an
implementation detail users of the flag shouldn't really care about. If
this heuristic is not doing any good then it should be removed.

[...]
> > > > > Even the API constaints of kvmalloc() w.r.t. only doing the vmalloc
> > > > > fallback if the gfp context is GFP_KERNEL - we already do GFP_NOFS
> > > > > kvmalloc via memalloc_nofs_save/restore(), so this behavioural
> > > > > restriction w.r.t. gfp flags just makes no sense at all.
> > > > 
> > > > GFP_NOFS (without using the scope API) has the same problem as NOFAIL in
> > > > the vmalloc. Hence it is not supported. If you use the scope API then
> > > > you can GFP_KERNEL for kvmalloc. This is clumsy but I am not sure how to
> > > > define these conditions in a more sensible way. Special case NOFS if the
> > > > scope api is in use? Why do you want an explicit NOFS then?
> 
> Exactly my point - this is clumsy and a total mess. I'm not asking
> for an explicit GFP_NOFS, just pointing out that the documented
> restrictions that "vmalloc can only do GFP_KERNEL allocations" is
> completely wrong.
> 
> vmalloc()
> {
> 	if (!(gfp_flags &  __GFP_FS))
> 		memalloc_nofs_save();
> 	p = __vmalloc(gfp_flags | GFP_KERNEL)
> 	if (!(gfp_flags &  __GFP_FS))
> 		memalloc_nofs_restore();
> }
> 
> Yup, that's how simple it is to support GFP_NOFS support in
> vmalloc().

Yes, this would work from the functionality POV but it defeats the
philosophy behind the scope API. Why would you even need this if the
scope was defined by the caller of the allocator? The initial hope was
to get rid of the NOFS abuse that can be seen in many filesystems. All
allocations from the scope would simply inherit the NOFS semantic so
an explicit NOFS shouldn't be really necessary, right?

> This goes along with the argument that "it's impossible to do
> GFP_NOFAIL with vmalloc" as I addressed above. These things are not
> impossible, but we hide behind "we don't want people to use vmalloc"
> as an excuse for having shitty behaviour whilst ignoring that
> vmalloc is *heavily used* by core subsystems like filesystems
> because they cannot rely on high order allocations succeeding....

I do not think there is any reason to discourage anybody from using
vmalloc these days. 32b is dying out and vmalloc space is no longer a
very scarce resource.

> It also points out that the scope API is highly deficient.
> We can do GFP_NOFS via the scope API, but we can't
> do anything else because *there is no scope API for other GFP
> flags*.
> 
> Why don't we have a GFP_NOFAIL/__GFP_RETRY_FOREVER scope API?

NO{FS,IO} where first flags to start this approach. And I have to admit
the experiment was much less successful then I hoped for. There are
still thousands of direct NOFS users so for some reason defining scopes
is not an easy thing to do.

I am not against NOFAIL scopes in principle but seeing the nofs
"success" I am worried this will not go really well either and it is
much more tricky as NOFAIL has much stronger requirements than NOFS.
Just imagine how tricky this can be if you just call a library code
that is not under your control within a NOFAIL scope. What if that
library code decides to allocate (e.g. printk that would attempt to do
an optimistic NOWAIT allocation).

> That
> would save us a lot of bother in XFS. What about GFP_DIRECT_RECLAIM?
> I'd really like to turn that off for allocations in the XFS
> transaction commit path (as noted already in this thread) because
> direct reclaim that can make no progress is actively harmful (as
> noted already in this thread)

As always if you have reasonable usecases then it is best to bring them
up on the MM list and we can discuss them.

> Like I said - this is more than just bad documentation - the problem
> is that the whole allocation API is an inconsistent mess of control
> mechanisms to begin with...

I am not going to disagree. There is a lot of historical baggage and
it doesn't help that any change is really hard to review because this
interface is used throughout the kernel. I have tried to change some
most obvious inconsistencies and I can tell this has always been a
frustrating experience with a very small "reward" in the end because
there are so many other problems.

That being said, I would more than love to have a consistent and well
defined interface and if you want to spend a lot of time on that then be
my guest.

> > > It would seem to make sense for kvmalloc to WARN_ON if it is passed
> > > flags that does not allow it to use vmalloc.
> > 
> > vmalloc is certainly not the hottest path in the kernel so I wouldn't be
> > opposed.
> 
> kvmalloc is most certainly becoming one of the hottest paths in XFS.
> IOWs, arguments that "vmalloc is not a hot path" are simply invalid
> these days because they are simply untrue. e.g. the profiles I
> posted in this thread...

Is it such a hot path that a check for compatible flags would be visible
in profiles though?
-- 
Michal Hocko
SUSE Labs