linux-kernel - Re: [PATCH RFC] rcu/tree: Refactor object allocation and try harder for array allocation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200423195955.GA476@pc636>
Date:   Thu, 23 Apr 2020 21:59:55 +0200
From:   Uladzislau Rezki <urezki@...il.com>
To:     "Paul E. McKenney" <paulmck@...nel.org>,
        Johannes Weiner <hannes@...xchg.org>
Cc:     Uladzislau Rezki <urezki@...il.com>,
        Johannes Weiner <hannes@...xchg.org>,
        Joel Fernandes <joel@...lfernandes.org>,
        linux-kernel@...r.kernel.org,
        Josh Triplett <josh@...htriplett.org>,
        Lai Jiangshan <jiangshanlai@...il.com>,
        Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
        rcu@...r.kernel.org, Steven Rostedt <rostedt@...dmis.org>
Subject: Re: [PATCH RFC] rcu/tree: Refactor object allocation and try harder
 for array allocation

> > > > > 
> > > > > Thank you for looking this over and for the feedback!
> > > > > 
> > > > > Good point on the sleeping.  My assumption has been that sleeping waiting
> > > > > for a grace period was highly likely to allow memory to eventually be
> > > > > freed, and that there is a point of diminishing returns beyond which
> > > > > adding additional tasks to the reclaim effort does not help much.
> > > > 
> > > > There is when the VM is struggling, but not necessarily when there is
> > > > simply a high, concurrent rate of short-lived file cache allocations.
> > > > 
> > > > Kswapd can easily reclaim gigabytes of clean page cache each second,
> > > > but there might be enough allocation concurrency from other threads to
> > > > starve a kfree_rcu() that only makes a very cursory attempt at getting
> > > > memory out of being able to snap up some of those returns.
> > > > 
> > > > In that scenario it makes sense to be a bit more persistent, or even
> > > > help scale out the concurrency of reclaim to that of allocations.
> > > > 
> > > > > Here are some strategies right offhand when sleeping is required:
> > > > > 
> > > > > 1.	Always sleep in synchronize_rcu() in order to (with high
> > > > > 	probability) free the memory.  This might mean that the reclaim
> > > > > 	effort goes slower than would be good.
> > > > > 
> > > > > 2.	Always sleep in the memory allocator in order to help reclaim
> > > > > 	along.	(This is a strawman version of what I expect your
> > > > > 	proposal really is, but putting it here for completeness, please
> > > > > 	see below.)
> > > > > 
> > > > > 3.	Always sleep in the memory allocator in order to help reclaim
> > > > > 	along, but return failure at some point.  Then the caller
> > > > > 	invokes synchronize_rcu().  When to return failure?
> > > > > 
> > > > > 	o	After some substantial but limited amount of effort has
> > > > > 		been spent on reclaim.
> > > > > 
> > > > > 	o	When it becomes likely that further reclaim effort
> > > > > 		is not going to free up additional memory.
> > > > > 
> > > > > I am guessing that you are thinking in terms of specifying GFP flags to
> > > > > result in some variant of #3.
> > > > 
> > > > Yes, although I would add
> > > > 
> > > > 	o	After making more than one attempt at the freelist to
> > > > 		prevent merely losing races when the allocator/reclaim
> > > > 		subsystem is mobbed by a high concurrency of requests.
> > > > 
> > > > __GFP_NORETRY (despite its name) accomplishes this.
> > > > 
> > > > __GFP_RETRY_MAYFAIL is yet more persistent, but may retry for quite a
> > > > while if the allocation keeps losing the race for a page. This
> > > > increases the chance of the allocation eventually suceeding, but also
> > > > the risk of 1) trying to get memory for longer than a
> > > > synchronize_rcu() might have taken and 2) exerting more temporary
> > > > memory pressure on the workload* than might be productive.
> > > > 
> > > > So I'm inclined to suggest __GFP_NORETRY as a starting point, and make
> > > > further decisions based on instrumentation of the success rates of
> > > > these opportunistic allocations.
> > > > 
> > > > * Reclaim and OOM handling will be fine since no reserves are tapped
> > > 
> > > Thank you for the explanation!  Makes sense to me!!!
> > > 
> > > Joel, Vlad, does this seem reasonable?
> > >
> > To me that makes sense. I think such strategy does fit to what we do,
> > i mean we need to release memory asap. Doing it without initiating of
> > long process of memory reclaim and do it only lightly(what __GFP_NORETRY does)
> > is a good approach. We have an option to fallback to synchronize_rcu().
> > 
> > But that is for sleepable context.
> > 
> > I have a question about non-sleeping context as well and how we allocate one
> > page:
> > 
> > <snip>
> >    bnode = (struct kvfree_rcu_bulk_data *)
> >        __get_free_page(GFP_NOWAIT | __GFP_NOWARN);
> > 
> > <snip>
> > 
> > Johannes, i saw you mentioned earlier that waking up a kswapd is not a
> > good idea, what actually GFP_NOWAIT does. Do you recommend to exclude
> > it? Also to replace by what? __GFP_HIGH|__GFP_ATOMIC?
> 
> This is best-effort, correct?
> 
a) Single argument(headless)
In this case we can make use an allocator with sleepable flags,
because we document that headleass variant must follow might_sleep()
annotation. For example __GFP_NORETRY | __GFP_NOWARN. __GFP_NORETRY
can do some light direct reclaim, thus the caller can call schedule().
To do such allocation we just drop our local spinlock.

If an allocation gets failed, we directly fall into synchronize_rcu()
i.e. inline freeing.

I also call it sleepable case, that is (a).

b) Double argument(with rcu_head)
This case we consider as it gets called from atomic context even though
it can be not. Why we consider such case as atomic: we just assume that.
The reason is to keep it simple, because it is not possible to detect whether
a current context is attomic or not(for all type of kernels), i mean the one
that calls kfree_rcu().

In this case we do not have synchronize_rcu() option. Instead we have an
object with rcu_head inside. If an allocation gets failed we just make
use of rcu_head inside the object, regular queuing.

In this case we do not need to hard in order to obtain memory. Therefore
my question was to Johannes what is best way here. Since we decided to
minimize reclaiming, whereas GFP_NOWAIT wakes up kswapd if no memory.
GFP_ATOMIC also is not good, because for (b) we do not need to waste
it.

>
> Upon memory-allocation failure, the single-argument kfree_rcu() can leak
> the memory (it has presumably already splatted) and the double-argument
> kfree_rcu() can make use of the rcu_head structure that was provided.
>
For single argument we inline freeing into current context after
synchronize_rcu() because it follows might_sleep() annotation.

Sorry for long email, i hope i covered everything :)

--
Vlad Rezki