linux-kernel - Re: [PATCH tip/core/rcu 14/15] rcu/tree: Allocate a page when caller is preemptible

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200930015327.GX29330@paulmck-ThinkPad-P72>
Date:   Tue, 29 Sep 2020 18:53:27 -0700
From:   "Paul E. McKenney" <paulmck@...nel.org>
To:     Michal Hocko <mhocko@...e.com>
Cc:     rcu@...r.kernel.org, linux-kernel@...r.kernel.org,
        kernel-team@...com, mingo@...nel.org, jiangshanlai@...il.com,
        akpm@...ux-foundation.org, mathieu.desnoyers@...icios.com,
        josh@...htriplett.org, tglx@...utronix.de, peterz@...radead.org,
        rostedt@...dmis.org, dhowells@...hat.com, edumazet@...gle.com,
        fweisbec@...il.com, oleg@...hat.com, joel@...lfernandes.org,
        mgorman@...hsingularity.net, torvalds@...ux-foundation.org,
        "Uladzislau Rezki (Sony)" <urezki@...il.com>
Subject: Re: [PATCH tip/core/rcu 14/15] rcu/tree: Allocate a page when caller
 is preemptible

On Tue, Sep 29, 2020 at 02:07:56PM +0200, Michal Hocko wrote:
> On Mon 28-09-20 16:31:01, paulmck@...nel.org wrote:
> [...]

Apologies for the delay, but today has not been boring.

> > This commit therefore uses preemptible() to determine whether allocation
> > is possible at all for double-argument kvfree_rcu().
> 
> This deserves a comment. Because GFP_ATOMIC is possible for many
> !preemptible() contexts. It is the raw_spin_lock, NMIs and likely few
> others that are a problem. You are taking a conservative approach which
> is fine but it would be good to articulate that explicitly.

Good point, and so I have added the following as a header comment to
the add_ptr_to_bulk_krc_lock() function:

// Record ptr in a page managed by krcp, with the pre-krc_this_cpu_lock()
// state specified by flags.  If can_sleep is true, the caller must
// be schedulable and not be holding any locks or mutexes that might be
// acquired by the memory allocator or anything that it might invoke.
// If !can_sleep, then if !preemptible() no allocation will be undertaken,
// otherwise the allocation will use GFP_ATOMIC to avoid the remainder of
// the aforementioned deadlock possibilities.  Returns true iff ptr was
// successfully recorded, else the caller must use a fallback.

The updated patch is shown below.

Of course, to Vlastimil's point, lockless access to the allocator
would significantly reduce the level of confusion posed by this code.
But that is a separate issue at the moment.

> > If !preemptible(),
> > then allocation is not possible, and kvfree_rcu() falls back to using
> > the less cache-friendly rcu_head approach.  Even when preemptible(),
> > the caller might be involved in reclaim, so the GFP_ flags used by
> > double-argument kvfree_rcu() must avoid invoking reclaim processing.
> 
> Could you be more specific? Is this about being called directly in the
> reclaim context and you want to prevent a recursion? If that is the
> case, do you really need to special case this in any way? Any memory
> reclaim will set PF_MEMALLOC so allocations called from that context
> will not perform reclaim. So if you are called from the reclaim directly
> then you might want to do GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN.
> That should handle both from-the-recalim and outside of reclaim contexts
> just fine (assuming you don't allocated from !preemptible()) context.

My thought was to point you at Daniel Vetter, but you already got in
touch with him, so thank you for that!

> > Note that single-argument kvfree_rcu() must be invoked in sleepable
> > contexts, and that its fallback is the relatively high latency
> > synchronize_rcu().  Single-argument kvfree_rcu() therefore uses
> > GFP_KERNEL|__GFP_RETRY_MAYFAIL to allow limited sleeping within the
> > memory allocator.
> [...]
> >  static inline bool
> > -kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
> > +add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
> > +	unsigned long *flags, void *ptr, bool can_sleep)
> >  {
> >  	struct kvfree_rcu_bulk_data *bnode;
> > +	bool can_alloc_page = preemptible();
> > +	gfp_t gfp = (can_sleep ? GFP_KERNEL | __GFP_RETRY_MAYFAIL : GFP_ATOMIC) | __GFP_NOWARN;
> 
> This is quite confusing IMHO. At least without a further explanation.
> can_sleep is not as much about sleeping as it is about the reclaim
> recursion AFAIU your changelog, right?

No argument on it being confusing, and I hope that the added header
comment helps.  But specifically, can_sleep==true is a promise by the
caller to be schedulable and not to be holding any lock/mutex/whatever
that might possibly be acquired by the memory allocator or by anything
else that the memory allocator might invoke, to your point, including
for but one example the reclaim logic.

The only way that can_sleep==true is if this function was invoked due
to a call to single-argument kvfree_rcu(), which must be schedulable
because its fallback is to invoke synchronize_rcu().

> >  	int idx;
> >  
> > -	if (unlikely(!krcp->initialized))
> > +	*krcp = krc_this_cpu_lock(flags);
> > +	if (unlikely(!(*krcp)->initialized))
> >  		return false;
> >  
> > -	lockdep_assert_held(&krcp->lock);
> >  	idx = !!is_vmalloc_addr(ptr);
> >  
> >  	/* Check if a new block is required. */
> > -	if (!krcp->bkvhead[idx] ||
> > -			krcp->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) {
> > -		bnode = get_cached_bnode(krcp);
> > -		if (!bnode) {
> > -			/*
> > -			 * To keep this path working on raw non-preemptible
> > -			 * sections, prevent the optional entry into the
> > -			 * allocator as it uses sleeping locks. In fact, even
> > -			 * if the caller of kfree_rcu() is preemptible, this
> > -			 * path still is not, as krcp->lock is a raw spinlock.
> > -			 * With additional page pre-allocation in the works,
> > -			 * hitting this return is going to be much less likely.
> > -			 */
> > -			if (IS_ENABLED(CONFIG_PREEMPT_RT))
> > -				return false;
> > -
> > -			/*
> > -			 * NOTE: For one argument of kvfree_rcu() we can
> > -			 * drop the lock and get the page in sleepable
> > -			 * context. That would allow to maintain an array
> > -			 * for the CONFIG_PREEMPT_RT as well if no cached
> > -			 * pages are available.
> > -			 */
> > -			bnode = (struct kvfree_rcu_bulk_data *)
> > -				__get_free_page(GFP_NOWAIT | __GFP_NOWARN);
> > +	if (!(*krcp)->bkvhead[idx] ||
> > +			(*krcp)->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) {
> > +		bnode = get_cached_bnode(*krcp);
> > +		if (!bnode && can_alloc_page) {
> > +			krc_this_cpu_unlock(*krcp, *flags);
> > +			bnode = kmalloc(PAGE_SIZE, gfp);
> 
> What is the point of calling kmalloc  for a PAGE_SIZE object? Wouldn't
> using the page allocator directly be better?

Well, you guys gave me considerable heat about abusing internal allocator
interfaces, and kmalloc() and kfree() seem to be about as non-internal
as you can get and still be invoking the allocator.  ;-)

Again, please see below for the updated patch.  The only change from
the earlier verison is the addition of the header comment mentioned above.

							Thanx, Paul

------------------------------------------------------------------------

commit 3a8e10ea2a2ecf59a33d7d80ad6c4372c0c27f78
Author: Uladzislau Rezki (Sony) <urezki@...il.com>
Date:   Tue Sep 22 21:06:22 2020 +0200

    rcu/tree: Allocate a page when caller is preemptible
    
    The current memory-allocation interface poses the following challenges:
    
    a)      In kernels built with CONFIG_PROVE_RAW_LOCK_NESTING, lockdep
            complains ("BUG: Invalid wait context").  This complaint is due
            to the memory allocator acquiring non-raw spinlocks while a raw
            spinlocks is held.  This problem can also arise if kvfree_rcu()
            is invoked while holding a raw spinlock.
    
    b)      In -rt kernels built with CONFIG_PREEMPT_RT, the situation
            described in (a) above results in an attempt to acquire a
            sleeplock while holding a spinlock, which is of course forbidden.
            This can lead to "BUG: scheduling while atomic".
    
    c)      Please note that call_rcu() is invoked from raw atomic context,
            so that kfree_rcu() and kvfree_rcu() are therefore also expected
            to be callable from atomic raw context as well.
    
    However given that CONFIG_PREEMPT_COUNT is unconditionally enabled
    by the earlier commits in this series, the preemptible() macro now
    properly detects preempt-disable code regions even in kernels built
    with CONFIG_PREEMPT_NONE.
    
    This commit therefore uses preemptible() to determine whether allocation
    is possible at all for double-argument kvfree_rcu().  If !preemptible(),
    then allocation is not possible, and kvfree_rcu() falls back to using
    the less cache-friendly rcu_head approach.  Even when preemptible(),
    the caller might be involved in reclaim, so the GFP_ flags used by
    double-argument kvfree_rcu() must avoid invoking reclaim processing.
    
    Note that single-argument kvfree_rcu() must be invoked in sleepable
    contexts, and that its fallback is the relatively high latency
    synchronize_rcu().  Single-argument kvfree_rcu() therefore uses
    GFP_KERNEL|__GFP_RETRY_MAYFAIL to allow limited sleeping within the
    memory allocator.
    
    Link: https://lore.kernel.org/lkml/20200630164543.4mdcf6zb4zfclhln@linutronix.de/
    Fixes: 3042f83f19be ("rcu: Support reclaim for head-less object")
    Reported-by: Sebastian Andrzej Siewior <bigeasy@...utronix.de>
    [ paulmck: Add add_ptr_to_bulk_krc_lock header comment per Michal Hocko. ]
    Signed-off-by: Uladzislau Rezki (Sony) <urezki@...il.com>
    Signed-off-by: Paul E. McKenney <paulmck@...nel.org>

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 8ce77d9..39ac930 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3166,7 +3166,7 @@ static void kfree_rcu_work(struct work_struct *work)
 			krc_this_cpu_unlock(krcp, flags);
 
 			if (bkvhead[i])
-				free_page((unsigned long) bkvhead[i]);
+				kfree(bkvhead[i]);
 
 			cond_resched_tasks_rcu_qs();
 		}
@@ -3290,44 +3290,37 @@ static void kfree_rcu_monitor(struct work_struct *work)
 		raw_spin_unlock_irqrestore(&krcp->lock, flags);
 }
 
+// Record ptr in a page managed by krcp, with the pre-krc_this_cpu_lock()
+// state specified by flags.  If can_sleep is true, the caller must
+// be schedulable and not be holding any locks or mutexes that might be
+// acquired by the memory allocator or anything that it might invoke.
+// If !can_sleep, then if !preemptible() no allocation will be undertaken,
+// otherwise the allocation will use GFP_ATOMIC to avoid the remainder of
+// the aforementioned deadlock possibilities.  Returns true iff ptr was
+// successfully recorded, else the caller must use a fallback.
 static inline bool
-kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
+add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
+	unsigned long *flags, void *ptr, bool can_sleep)
 {
 	struct kvfree_rcu_bulk_data *bnode;
+	bool can_alloc_page = preemptible();
+	gfp_t gfp = (can_sleep ? GFP_KERNEL | __GFP_RETRY_MAYFAIL : GFP_ATOMIC) | __GFP_NOWARN;
 	int idx;
 
-	if (unlikely(!krcp->initialized))
+	*krcp = krc_this_cpu_lock(flags);
+	if (unlikely(!(*krcp)->initialized))
 		return false;
 
-	lockdep_assert_held(&krcp->lock);
 	idx = !!is_vmalloc_addr(ptr);
 
 	/* Check if a new block is required. */
-	if (!krcp->bkvhead[idx] ||
-			krcp->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) {
-		bnode = get_cached_bnode(krcp);
-		if (!bnode) {
-			/*
-			 * To keep this path working on raw non-preemptible
-			 * sections, prevent the optional entry into the
-			 * allocator as it uses sleeping locks. In fact, even
-			 * if the caller of kfree_rcu() is preemptible, this
-			 * path still is not, as krcp->lock is a raw spinlock.
-			 * With additional page pre-allocation in the works,
-			 * hitting this return is going to be much less likely.
-			 */
-			if (IS_ENABLED(CONFIG_PREEMPT_RT))
-				return false;
-
-			/*
-			 * NOTE: For one argument of kvfree_rcu() we can
-			 * drop the lock and get the page in sleepable
-			 * context. That would allow to maintain an array
-			 * for the CONFIG_PREEMPT_RT as well if no cached
-			 * pages are available.
-			 */
-			bnode = (struct kvfree_rcu_bulk_data *)
-				__get_free_page(GFP_NOWAIT | __GFP_NOWARN);
+	if (!(*krcp)->bkvhead[idx] ||
+			(*krcp)->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) {
+		bnode = get_cached_bnode(*krcp);
+		if (!bnode && can_alloc_page) {
+			krc_this_cpu_unlock(*krcp, *flags);
+			bnode = kmalloc(PAGE_SIZE, gfp);
+			*krcp = krc_this_cpu_lock(flags);
 		}
 
 		/* Switch to emergency path. */
@@ -3336,15 +3329,15 @@ kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
 
 		/* Initialize the new block. */
 		bnode->nr_records = 0;
-		bnode->next = krcp->bkvhead[idx];
+		bnode->next = (*krcp)->bkvhead[idx];
 
 		/* Attach it to the head. */
-		krcp->bkvhead[idx] = bnode;
+		(*krcp)->bkvhead[idx] = bnode;
 	}
 
 	/* Finally insert. */
-	krcp->bkvhead[idx]->records
-		[krcp->bkvhead[idx]->nr_records++] = ptr;
+	(*krcp)->bkvhead[idx]->records
+		[(*krcp)->bkvhead[idx]->nr_records++] = ptr;
 
 	return true;
 }
@@ -3382,24 +3375,20 @@ void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
 		ptr = (unsigned long *) func;
 	}
 
-	krcp = krc_this_cpu_lock(&flags);
-
 	// Queue the object but don't yet schedule the batch.
 	if (debug_rcu_head_queue(ptr)) {
 		// Probable double kfree_rcu(), just leak.
 		WARN_ONCE(1, "%s(): Double-freed call. rcu_head %p\n",
 			  __func__, head);
 
-		// Mark as success and leave.
-		success = true;
-		goto unlock_return;
+		return;
 	}
 
 	/*
 	 * Under high memory pressure GFP_NOWAIT can fail,
 	 * in that case the emergency path is maintained.
 	 */
-	success = kvfree_call_rcu_add_ptr_to_bulk(krcp, ptr);
+	success = add_ptr_to_bulk_krc_lock(&krcp, &flags, ptr, !head);
 	if (!success) {
 		if (head == NULL)
 			// Inline if kvfree_rcu(one_arg) call.
@@ -4394,23 +4383,12 @@ static void __init kfree_rcu_batch_init(void)
 
 	for_each_possible_cpu(cpu) {
 		struct kfree_rcu_cpu *krcp = per_cpu_ptr(&krc, cpu);
-		struct kvfree_rcu_bulk_data *bnode;
 
 		for (i = 0; i < KFREE_N_BATCHES; i++) {
 			INIT_RCU_WORK(&krcp->krw_arr[i].rcu_work, kfree_rcu_work);
 			krcp->krw_arr[i].krcp = krcp;
 		}
 
-		for (i = 0; i < rcu_min_cached_objs; i++) {
-			bnode = (struct kvfree_rcu_bulk_data *)
-				__get_free_page(GFP_NOWAIT | __GFP_NOWARN);
-
-			if (bnode)
-				put_cached_bnode(krcp, bnode);
-			else
-				pr_err("Failed to preallocate for %d CPU!\n", cpu);
-		}
-
 		INIT_DELAYED_WORK(&krcp->monitor_work, kfree_rcu_monitor);
 		krcp->initialized = true;
 	}