[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200814180141.GP4295@paulmck-ThinkPad-P72>
Date: Fri, 14 Aug 2020 11:01:41 -0700
From: "Paul E. McKenney" <paulmck@...nel.org>
To: Michal Hocko <mhocko@...e.com>
Cc: Uladzislau Rezki <urezki@...il.com>,
Thomas Gleixner <tglx@...utronix.de>,
LKML <linux-kernel@...r.kernel.org>, RCU <rcu@...r.kernel.org>,
linux-mm@...ck.org, Andrew Morton <akpm@...ux-foundation.org>,
Vlastimil Babka <vbabka@...e.cz>,
Matthew Wilcox <willy@...radead.org>,
"Theodore Y . Ts'o" <tytso@....edu>,
Joel Fernandes <joel@...lfernandes.org>,
Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
Oleksiy Avramchenko <oleksiy.avramchenko@...ymobile.com>,
Peter Zijlstra <peterz@...radead.org>
Subject: Re: [RFC-PATCH 1/2] mm: Add __GFP_NO_LOCKS flag
On Fri, Aug 14, 2020 at 04:06:04PM +0200, Michal Hocko wrote:
> On Fri 14-08-20 06:34:50, Paul E. McKenney wrote:
> > On Fri, Aug 14, 2020 at 02:48:32PM +0200, Michal Hocko wrote:
> > > On Fri 14-08-20 14:15:44, Uladzislau Rezki wrote:
> > > > > On Thu 13-08-20 19:09:29, Thomas Gleixner wrote:
> > > > > > Michal Hocko <mhocko@...e.com> writes:
> > > > > [...]
> > > > > > > Why should we limit the functionality of the allocator for something
> > > > > > > that is not a real problem?
> > > > > >
> > > > > > We'd limit the allocator for exactly ONE new user which was aware of
> > > > > > this problem _before_ the code hit mainline. And that ONE user is
> > > > > > prepared to handle the fail.
> > > > >
> > > > > If we are to limit the functionality to this one particular user then
> > > > > I would consider a dedicated gfp flag a huge overkill. It would be much
> > > > > more easier to have a preallocated pool of pages and use those and
> > > > > completely avoid the core allocator. That would certainly only shift the
> > > > > complexity to the caller but if it is expected there would be only that
> > > > > single user then it would be probably better than opening a can of worms
> > > > > like allocator usable from raw spin locks.
> > > > >
> > > > Vlastimil raised same question earlier, i answered, but let me answer again:
> > > >
> > > > It is hard to achieve because the logic does not stick to certain static test
> > > > case, i.e. it depends on how heavily kfree_rcu(single/double) are used. Based
> > > > on that, "how heavily" - number of pages are formed, until the drain/reclaimer
> > > > thread frees them.
> > >
> > > How many pages are talking about - ball park? 100s, 1000s?
> >
> > Under normal operation, a couple of pages per CPU, which would make
> > preallocation entirely reasonable. Except that if someone does something
> > that floods RCU callbacks (close(open) in a tight userspace loop, for but
> > one example), then 2000 per CPU might not be enough, which on a 64-CPU
> > system comes to about 500MB. This is beyond excessive for preallocation
> > on the systems I am familiar with.
> >
> > And the flooding case is where you most want the reclamation to be
> > efficient, and thus where you want the pages.
>
> I am not sure the page allocator would help you with this scenario
> unless you are on very large machines. Pagesets scale with the available
> memory and percpu_pagelist_fraction sysctl (have a look at
> pageset_set_high_and_batch). It is roughly 1000th of the zone size for
> each zone. You can check that in /proc/vmstat (my 8G machine)
Small systems might have ~64G. The medium-sized systems might have
~250G. There are a few big ones that might have 1.5T. None of the
/proc/vmstat files from those machines contain anything resembling
the list below, though.
> Node 0, zone DMA
> Not interesting at all
> Node 0, zone DMA32
> pagesets
> cpu: 0
> count: 242
> high: 378
> batch: 63
> cpu: 1
> count: 355
> high: 378
> batch: 63
> cpu: 2
> count: 359
> high: 378
> batch: 63
> cpu: 3
> count: 366
> high: 378
> batch: 63
> Node 0, zone Normal
> pagesets
> cpu: 0
> count: 359
> high: 378
> batch: 63
> cpu: 1
> count: 241
> high: 378
> batch: 63
> cpu: 2
> count: 297
> high: 378
> batch: 63
> cpu: 3
> count: 227
> high: 378
> batch: 63
>
> Besides that do you need to be per-cpu? Having 1000 pages available and
> managed under your raw spinlock should be good enough already no?
It needs to be almost entirely per-CPU for performance reasons. Plus
a user could do a tight close(open()) loop on each CPU.
> > This of course raises the question of how much memory the lockless caches
> > contain, but fortunately these RCU callback flooding scenarios also
> > involve process-context allocation of the memory that is being passed
> > to kfree_rcu(). That allocation should keep the lockless caches from
> > going empty in the common case, correct?
>
> Yes, those are refilled both on the allocation/free paths. But you
> cannot really rely on that to happen early enough.
So the really ugly scenarios with the tight loops normally allocate
something and immediately either call_rcu() or kfree_rcu() it.
But you are right, someone doing "rm -rf" on a large file tree
with lots of small files might not be doing that many allocations.
> Do you happen to have any numbers that would show the typical usage
> and how often the slow path has to be taken becase pcp lists are
> depleted? In other words even if we provide a functionality to give
> completely lockless way to allocate memory how useful that is?
Not yet, but let's see what we can do.
Thanx, Paul
Powered by blists - more mailing lists