[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200730231205.GA11265@paulmck-ThinkPad-P72>
Date: Thu, 30 Jul 2020 16:12:05 -0700
From: "Paul E. McKenney" <paulmck@...nel.org>
To: cl@...ux.com, penberg@...nel.org, rientjes@...gle.com,
iamjoonsoo.kim@....com, akpm@...ux-foundation.org
Cc: hannes@...xchg.org, willy@...radead.org, urezki@...il.com,
linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Raw spinlocks and memory allocation
Hello!
We have an interesting issue involving interactions between RCU,
memory allocation, and "raw atomic" contexts. The most attractive
solution to this issue requires adding a new GFP_ flag. Perhaps this
is a big ask, but on the other hand, the benefit is a large reduction
in linked-list-induced cache misses when invoking RCU callbacks.
For more details, please read on!
Examples of raw atomic contexts include disabled hardware interrupts
(that is, a hardware irq handler rather than a threaded irq handler),
code holding a raw_spinlock_t, and code with preemption disabled (but
only in cases where -rt cannot safely map it to disabled migration).
It turns out that call_rcu() is already invoked from raw atomic contexts,
and we therefore anticipate that kfree_rcu() will also be at some point.
This matters due to recent work to fix a weakness in both call_rcu()
and kfree_rcu() that was pointed out long ago by Christoph Lameter,
among others. The weakness is that RCU traverses linked callback lists
when invoking those callbacks. Because the just-ended grace period will
have rendered these lists cache-cold, this results in an expensive cache
miss on each and every callback invocation. Uladzislau Rezki (CCed) has
recently produced patches for kfree_rcu() that instead store pointers
to callbacks in arrays, so that callback invocation can step through
the array using the kfree_bulk() interface. This greatly reducing the
number of cache misses. The benefits are not subtle:
https://lore.kernel.org/lkml/20191231122241.5702-1-urezki@gmail.com/
Of course, the arrays have to come from somewhere, and that somewhere
is the memory allocator. Yes, memory allocation can fail, but in that
rare case, kfree_rcu() just falls back to the old approach, taking a
few extra cache misses, but making good (if expensive) forward progress.
This works well until someone invokes kfree_rcu() with a raw spinlock
held. Even that works fine unless the memory allocator has exhausted
its caches, at which point it will acquire a normal spinlock. In kernels
built with CONFIG_PROVE_RAW_LOCK_NESTING=y this will result in a lockdep
splat. Worse yet, in -rt kernels, this can result in scheduling while
atomic.
So, may we add a GFP_ flag that will cause kmalloc() and friends to return
NULL when they would otherwise need to acquire their non-raw spinlock?
This avoids adding any overhead to the slab-allocator fastpaths, but
allows callback invocation to reduce cache misses without having to
restructure some existing callers of call_rcu() and potential future
callers of kfree_rcu().
Thoughts?
Thanx, Paul
PS. Other avenues investigated:
o Just don't invoke kmalloc() when kfree_rcu() is invoked
from raw atomic contexts. The problem with this is that
there is no way to detect raw atomic contexts in production
kernels built with CONFIG_PREEMPT=n. Adding means to detect
this would increase overhead on numerous fastpaths.
o Just say "no" to invoking call_rcu() and kfree_rcu() from
raw atomic contexts. This would require that the affected
call_rcu() and kfree_rcu() invocations be deferred. This is
in theory simple, but can get quite messy, and often requires
fallbacks such as timers that can degrade energy efficiency and
realtime response.
o Provide a different non-allocating API such as kfree_rcu_raw()
and call_rcu_raw() that are used from raw atomic contexts and also
on memory-allocation failure from kfree_rcu() and call_rcu().
This results in unconditional callback-invocation cache misses
for calls from raw contexts, including for common code that is
only occasionally invoked from raw atomic contexts. This approach
also forces developers to worry about two more RCU API members.
o Move the memory allocator's spinlocks to raw_spinlock_t.
This would be bad for realtime response, and would likely require
even more conversions when the allocator invokes other subsystems
that also use non-raw spinlocks.
Powered by blists - more mailing lists