[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z-0sjd8SEtldbxB1@tiehlicka>
Date: Wed, 2 Apr 2025 14:24:45 +0200
From: Michal Hocko <mhocko@...e.com>
To: Dave Chinner <david@...morbit.com>
Cc: Yafang Shao <laoar.shao@...il.com>, Harry Yoo <harry.yoo@...cle.com>,
Kees Cook <kees@...nel.org>, joel.granados@...nel.org,
linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
Josef Bacik <josef@...icpanda.com>, linux-mm@...ck.org,
Vlastimil Babka <vbabka@...e.cz>
Subject: Re: [PATCH] proc: Avoid costly high-order page allocations when
reading proc files
On Wed 02-04-25 22:32:14, Dave Chinner wrote:
> On Wed, Apr 02, 2025 at 04:42:06PM +0800, Yafang Shao wrote:
> > On Wed, Apr 2, 2025 at 12:15 PM Harry Yoo <harry.yoo@...cle.com> wrote:
> > >
> > > On Tue, Apr 01, 2025 at 07:01:04AM -0700, Kees Cook wrote:
> > > >
> > > >
> > > > On April 1, 2025 12:30:46 AM PDT, Yafang Shao <laoar.shao@...il.com> wrote:
> > > > >While investigating a kcompactd 100% CPU utilization issue in production, I
> > > > >observed frequent costly high-order (order-6) page allocations triggered by
> > > > >proc file reads from monitoring tools. This can be reproduced with a simple
> > > > >test case:
> > > > >
> > > > > fd = open(PROC_FILE, O_RDONLY);
> > > > > size = read(fd, buff, 256KB);
> > > > > close(fd);
> > > > >
> > > > >Although we should modify the monitoring tools to use smaller buffer sizes,
> > > > >we should also enhance the kernel to prevent these expensive high-order
> > > > >allocations.
> > > > >
> > > > >Signed-off-by: Yafang Shao <laoar.shao@...il.com>
> > > > >Cc: Josef Bacik <josef@...icpanda.com>
> > > > >---
> > > > > fs/proc/proc_sysctl.c | 10 +++++++++-
> > > > > 1 file changed, 9 insertions(+), 1 deletion(-)
> > > > >
> > > > >diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
> > > > >index cc9d74a06ff0..c53ba733bda5 100644
> > > > >--- a/fs/proc/proc_sysctl.c
> > > > >+++ b/fs/proc/proc_sysctl.c
> > > > >@@ -581,7 +581,15 @@ static ssize_t proc_sys_call_handler(struct kiocb *iocb, struct iov_iter *iter,
> > > > > error = -ENOMEM;
> > > > > if (count >= KMALLOC_MAX_SIZE)
> > > > > goto out;
> > > > >- kbuf = kvzalloc(count + 1, GFP_KERNEL);
> > > > >+
> > > > >+ /*
> > > > >+ * Use vmalloc if the count is too large to avoid costly high-order page
> > > > >+ * allocations.
> > > > >+ */
> > > > >+ if (count < (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
> > > > >+ kbuf = kvzalloc(count + 1, GFP_KERNEL);
> > > >
> > > > Why not move this check into kvmalloc family?
> > >
> > > Hmm should this check really be in kvmalloc family?
> >
> > Modifying the existing kvmalloc functions risks performance regressions.
> > Could we instead introduce a new variant like vkmalloc() (favoring
> > vmalloc over kmalloc) or kvmalloc_costless()?
>
> We should fix kvmalloc() instead of continuing to force
> subsystems to work around the limitations of kvmalloc().
Agreed!
> Have a look at xlog_kvmalloc() in XFS. It implements a basic
> fast-fail, no retry high order kmalloc before it falls back to
> vmalloc by turning off direct reclaim for the kmalloc() call.
> Hence if the there isn't a high-order page on the free lists ready
> to allocate, it falls back to vmalloc() immediately.
>
> For XFS, using xlog_kvmalloc() reduced the high-order per-allocation
> overhead by around 80% when compared to a standard kvmalloc()
> call. Numbers and profiles were documented in the commit message
> (reproduced in whole below)...
Btw. it would be really great to have such concerns to be posted to the
linux-mm ML so that we are aware of that.
kvmalloc currently doesn't support GFP_NOWAIT semantic but it does allow
to express - I prefer SLAB allocator over vmalloc. I think we could make
the default kvmalloc slab path weaker by default as those who really
want slab already have means to achieve that. There is a risk of long
term fragmentation but I think this is worth trying
diff --git a/mm/util.c b/mm/util.c
index 60aa40f612b8..8386f6976d7d 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -601,14 +601,18 @@ static gfp_t kmalloc_gfp_adjust(gfp_t flags, size_t size)
* We want to attempt a large physically contiguous block first because
* it is less likely to fragment multiple larger blocks and therefore
* contribute to a long term fragmentation less than vmalloc fallback.
- * However make sure that larger requests are not too disruptive - no
- * OOM killer and no allocation failure warnings as we have a fallback.
+ * However make sure that larger requests are not too disruptive - i.e.
+ * do not direct reclaim unless physically continuous memory is preferred
+ * (__GFP_RETRY_MAYFAIL mode). We still kick in kswapd/kcompactd to start
+ * working in the background but the allocation itself.
*/
if (size > PAGE_SIZE) {
flags |= __GFP_NOWARN;
if (!(flags & __GFP_RETRY_MAYFAIL))
flags |= __GFP_NORETRY;
+ else
+ flags &= ~__GFP_DIRECT_RECLAIM;
/* nofail semantic is implemented by the vmalloc fallback */
flags &= ~__GFP_NOFAIL;
--
Michal Hocko
SUSE Labs
Powered by blists - more mailing lists