[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200423180249.GT17661@paulmck-ThinkPad-P72>
Date: Thu, 23 Apr 2020 11:02:49 -0700
From: "Paul E. McKenney" <paulmck@...nel.org>
To: Johannes Weiner <hannes@...xchg.org>
Cc: Joel Fernandes <joel@...lfernandes.org>,
Uladzislau Rezki <urezki@...il.com>,
linux-kernel@...r.kernel.org,
Josh Triplett <josh@...htriplett.org>,
Lai Jiangshan <jiangshanlai@...il.com>,
Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
rcu@...r.kernel.org, Steven Rostedt <rostedt@...dmis.org>
Subject: Re: [PATCH RFC] rcu/tree: Refactor object allocation and try harder
for array allocation
On Thu, Apr 23, 2020 at 01:48:31PM -0400, Johannes Weiner wrote:
> On Wed, Apr 22, 2020 at 08:35:03AM -0700, Paul E. McKenney wrote:
> > On Wed, Apr 22, 2020 at 10:57:52AM -0400, Johannes Weiner wrote:
> > > On Thu, Apr 16, 2020 at 11:01:00AM -0700, Paul E. McKenney wrote:
> > > > On Thu, Apr 16, 2020 at 09:17:45AM -0400, Joel Fernandes wrote:
> > > > > On Thu, Apr 16, 2020 at 12:30:07PM +0200, Uladzislau Rezki wrote:
> > > > > > I have a question about dynamic attaching of the rcu_head. Do you think
> > > > > > that we should drop it? We have it because of it requires 8 + syzeof(struct rcu_head)
> > > > > > bytes and is used when we can not allocate 1 page what is much more for array purpose.
> > > > > > Therefore, dynamic attaching can succeed because of using SLAB and requesting much
> > > > > > less memory then one page. There will be higher chance of bypassing synchronize_rcu()
> > > > > > and inlining freeing on a stack.
> > > > > >
> > > > > > I agree that we should not use GFP_* flags instead we could go with GFP_NOWAIT |
> > > > > > __GFP_NOWARN when head attaching only. Also dropping GFP_ATOMIC to keep
> > > > > > atomic reserved memory for others.
> > > >
> > > > I must defer to people who understand the GFP flags better than I do.
> > > > The suggestion of __GFP_RETRY_MAYFAIL for no memory pressure (or maybe
> > > > when the CPU's reserve is not yet full) and __GFP_NORETRY otherwise came
> > > > from one of these people. ;-)
> > >
> > > The exact flags we want here depends somewhat on the rate and size of
> > > kfree_rcu() bursts we can expect. We may want to start with one set
> > > and instrument allocation success rates.
> > >
> > > Memory tends to be fully consumed by the filesystem cache, so some
> > > form of light reclaim is necessary for almost all allocations.
> > >
> > > GFP_NOWAIT won't do any reclaim by itself, but it'll wake kswapd.
> > > Kswapd maintains a small pool of free pages so that even allocations
> > > that are allowed to enter reclaim usually don't have to. It would be
> > > safe for RCU to dip into that.
> > >
> > > However, there are some cons to using it:
> > >
> > > - Depending on kfree_rcu() burst size, this pool could exhaust (it's
> > > usually about half a percent of memory, but is affected by sysctls),
> > > and then it would fail NOWAIT allocations until kswapd has caught up.
> > >
> > > - This pool is shared by all GFP_NOWAIT users, and many (most? all?)
> > > of them cannot actually sleep. Often they would have to drop locks,
> > > restart list iterations, or suffer some other form of deterioration to
> > > work around failing allocations.
> > >
> > > Since rcu wouldn't have anything better to do than sleep at this
> > > juncture, it may as well join the reclaim effort.
> > >
> > > Using __GFP_NORETRY or __GFP_RETRY_MAYFAIL would allow them that
> > > without exerting too much pressure on the VM.
> >
> > Thank you for looking this over and for the feedback!
> >
> > Good point on the sleeping. My assumption has been that sleeping waiting
> > for a grace period was highly likely to allow memory to eventually be
> > freed, and that there is a point of diminishing returns beyond which
> > adding additional tasks to the reclaim effort does not help much.
>
> There is when the VM is struggling, but not necessarily when there is
> simply a high, concurrent rate of short-lived file cache allocations.
>
> Kswapd can easily reclaim gigabytes of clean page cache each second,
> but there might be enough allocation concurrency from other threads to
> starve a kfree_rcu() that only makes a very cursory attempt at getting
> memory out of being able to snap up some of those returns.
>
> In that scenario it makes sense to be a bit more persistent, or even
> help scale out the concurrency of reclaim to that of allocations.
>
> > Here are some strategies right offhand when sleeping is required:
> >
> > 1. Always sleep in synchronize_rcu() in order to (with high
> > probability) free the memory. This might mean that the reclaim
> > effort goes slower than would be good.
> >
> > 2. Always sleep in the memory allocator in order to help reclaim
> > along. (This is a strawman version of what I expect your
> > proposal really is, but putting it here for completeness, please
> > see below.)
> >
> > 3. Always sleep in the memory allocator in order to help reclaim
> > along, but return failure at some point. Then the caller
> > invokes synchronize_rcu(). When to return failure?
> >
> > o After some substantial but limited amount of effort has
> > been spent on reclaim.
> >
> > o When it becomes likely that further reclaim effort
> > is not going to free up additional memory.
> >
> > I am guessing that you are thinking in terms of specifying GFP flags to
> > result in some variant of #3.
>
> Yes, although I would add
>
> o After making more than one attempt at the freelist to
> prevent merely losing races when the allocator/reclaim
> subsystem is mobbed by a high concurrency of requests.
>
> __GFP_NORETRY (despite its name) accomplishes this.
>
> __GFP_RETRY_MAYFAIL is yet more persistent, but may retry for quite a
> while if the allocation keeps losing the race for a page. This
> increases the chance of the allocation eventually suceeding, but also
> the risk of 1) trying to get memory for longer than a
> synchronize_rcu() might have taken and 2) exerting more temporary
> memory pressure on the workload* than might be productive.
>
> So I'm inclined to suggest __GFP_NORETRY as a starting point, and make
> further decisions based on instrumentation of the success rates of
> these opportunistic allocations.
>
> * Reclaim and OOM handling will be fine since no reserves are tapped
Thank you for the explanation! Makes sense to me!!!
Joel, Vlad, does this seem reasonable?
Thanx, Paul
Powered by blists - more mailing lists