lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z-28YCNy08rwJZhR@dread.disaster.area>
Date: Thu, 3 Apr 2025 09:38:24 +1100
From: Dave Chinner <david@...morbit.com>
To: Matthew Wilcox <willy@...radead.org>
Cc: Michal Hocko <mhocko@...e.com>, Yafang Shao <laoar.shao@...il.com>,
	Harry Yoo <harry.yoo@...cle.com>, Kees Cook <kees@...nel.org>,
	joel.granados@...nel.org, linux-fsdevel@...r.kernel.org,
	linux-kernel@...r.kernel.org, Josef Bacik <josef@...icpanda.com>,
	linux-mm@...ck.org, Vlastimil Babka <vbabka@...e.cz>
Subject: Re: [PATCH] proc: Avoid costly high-order page allocations when
 reading proc files

On Wed, Apr 02, 2025 at 06:24:10PM +0100, Matthew Wilcox wrote:
> On Wed, Apr 02, 2025 at 02:24:45PM +0200, Michal Hocko wrote:
> > On Wed 02-04-25 22:32:14, Dave Chinner wrote:
> > > > > > >+    /*
> > > > > > >+     * Use vmalloc if the count is too large to avoid costly high-order page
> > > > > > >+     * allocations.
> > > > > > >+     */
> > > > > > >+    if (count < (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
> > > > > > >+            kbuf = kvzalloc(count + 1, GFP_KERNEL);
> > > > > >
> > > > > > Why not move this check into kvmalloc family?
> > > > >
> > > > > Hmm should this check really be in kvmalloc family?
> > > > 
> > > > Modifying the existing kvmalloc functions risks performance regressions.
> > > > Could we instead introduce a new variant like vkmalloc() (favoring
> > > > vmalloc over kmalloc) or kvmalloc_costless()?
> > > 
> > > We should fix kvmalloc() instead of continuing to force
> > > subsystems to work around the limitations of kvmalloc().
> > 
> > Agreed!
> > 
> > > Have a look at xlog_kvmalloc() in XFS. It implements a basic
> > > fast-fail, no retry high order kmalloc before it falls back to
> > > vmalloc by turning off direct reclaim for the kmalloc() call.
> > > Hence if the there isn't a high-order page on the free lists ready
> > > to allocate, it falls back to vmalloc() immediately.
> 
> ... but if vmalloc fails, it goes around again!  This is exactly why
> we don't want filesystems implementing workarounds for MM problems.
> What a mess.

That's because we need __GFP_NOFAIL semantics for the overall
operation, and we can't pass that to kvmalloc() because it doesn't
support __GFP_NOFAIL. And when this code was written, vmalloc didn't
support __GFP_NOFAIL, either. We *had* to open code nofail
semantics, because the mm infrastructure did not provide it.

Yes, we can fix this now that __vmalloc(__GFP_NOFAIL) is a thing.
We still need to open code the kmalloc() side of the operation right
now because....

> >  	if (size > PAGE_SIZE) {
> >  		flags |= __GFP_NOWARN;
> >  
> >  		if (!(flags & __GFP_RETRY_MAYFAIL))
> >  			flags |= __GFP_NORETRY;

.... this is a built-in catch-22.

If we use kvmalloc(__GFP_NOFAIL), this code results in kmalloc
with __GFP_NORETRY | __GFP_NOFAIL flags set. i.e. we are telling
the allocation that it must not retry but it also must retry until
it succeeds.

To work around this, the caller then has to use __GFP_RETRY_MAYFAIL
| __GFP_NOFAIL, which is telling the allocation that it is allowed
to fail but it also must not fail. Again, this makes no sense at
all, and on top of that it doesn't give us fast-fail semantics
we want from the kmalloc side of kvmalloc.

i.e. high order page allocation from kmalloc() is an optimisation,
not a requirement for kvmalloc(). If high order page allocation is
frequently more expensive than simply falling back to vmalloc(),
then we've made the wrong optimisation choices for the kvmalloc()
implementation...

> I think it might be better to do this:
> 
> 		flags |= __GFP_NOWARN;
> 
> 		if (!(flags & __GFP_RETRY_MAYFAIL))
> 			flags |= __GFP_NORETRY;
> +		else if (size > (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
> +			flags &= ~__GFP_DIRECT_RECLAIM;
> 
> I think it's entirely appropriate for a call to kvmalloc() to do
> direct reclaim if it's asking for, say, 16KiB and we don't have any of
> those available.

I disagree - we have background compaction to address the lack of
high order folios in the allocator reserves. Let that do the work of
resolving the internal resource shortage instead of slowing down
allocations that *do not require high order pages to be allocated*.

> Better than exacerbating the fragmentation problem by
> allocating 4x4KiB pages, each from different groupings.

We have no evidence that this allocation behaviour in XFS causes or
exacerbates memory fragmentation. We have been running it in
production systems for a few years now....

-Dave.
-- 
Dave Chinner
david@...morbit.com

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ