linux-kernel - Re: [RFC PATCH] jbd2: avoid __GFP_ZERO with SLAB_TYPESAFE_BY

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220210054357.GH4285@paulmck-ThinkPad-P17-Gen-1>
Date:   Wed, 9 Feb 2022 21:43:57 -0800
From:   "Paul E. McKenney" <paulmck@...nel.org>
To:     Theodore Ts'o <tytso@....edu>
Cc:     Jan Kara <jack@...e.cz>, Qian Cai <quic_qiancai@...cinc.com>,
        Jan Kara <jack@...e.com>,
        Neeraj Upadhyay <quic_neeraju@...cinc.com>,
        Joel Fernandes <joel@...lfernandes.org>,
        Boqun Feng <boqun.feng@...il.com>, linux-ext4@...r.kernel.org,
        rcu@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH] jbd2: avoid __GFP_ZERO with SLAB_TYPESAFE_BY_RCU

On Thu, Feb 10, 2022 at 12:07:33AM -0500, Theodore Ts'o wrote:
> On Wed, Feb 09, 2022 at 12:11:37PM -0800, Paul E. McKenney wrote:
> > On Wed, Feb 09, 2022 at 07:10:10PM +0100, Jan Kara wrote:
> > > 
> > > No, the performance impact of this would be just horrible. Can you
> > > ellaborate a bit why SLAB_TYPESAFE_BY_RCU + __GFP_ZERO is a problem and why
> > > synchronize_rcu() would be needed here before the memset() please? I mean
> > > how is zeroing here any different from the memory just being used?
> > 
> > Suppose a reader picks up a pointer to a memory block, then that memory
> > is freed.  No problem, given that this is a SLAB_TYPESAFE_BY_RCU slab,
> > so the memory won't be freed while the reader is accessing it.  But while
> > the reader is in the process of validating the block, it is zeroed.
> > 
> > How does the validation step handle this in all cases?
> > 
> > If you have a way of handling this, I will of course drop the patch.
> > And learn something new, which is always a good thing.  ;-)
> 
> I must be missing something.  The change is on the allocation path,
> and why would kmem_cache_[z]alloc() return a memory chunk which could
> still be in use by a reader?  Shouldn't the allocator _not_ return a
> particular chunk until it is sure there aren't any readers left that
> would be discombobulated by that memory being used for some new use
> case?

>From the allocator's viewpoint yes, but ...

> Otherwise we would have to add synchronize_rcu(); after every single
> kmem_cache allocation which might be using RCU, and that would be
> terrible, no?

... if ext4 is not freeing memory blocks that might still be referenced
by RCU readers, then the SLAB_TYPESAFE_BY_RCU should be removed.
This "might still be referenced" is from the viewpoint of the code using
the allocator, not from that of the allocator itself.

So the typical RCU approach (not involving SLAB_TYPESAFE_BY_RCU)
is to take the grace period at the time of the free.  This can be
done synchronously using synchronize_rcu(), but is often instead done
asynchronously using call_rcu() or kfree_rcu().  So in this case,
you don't need synchronize_rcu() on allocation because the required
grace period already happened at *free() time.

But there are a few situations where it makes sense to free blocks that
readers might still be referencing.  Readers must then add validity
checks to detect this case, and also prevent freeing, for example,
using a per-block spinlock for synchronization.  For example, a reader
might acquire a spinlock in the block to prevent changes, recheck the
lookup key, and if the key does not match, release the lock and pretend
not to have found the block.  If the key does match, anything attempting
to delete and free the block will be spinning on that same spinlock.

And so if you specify SLAB_TYPESAFE_BY_RCU, the slab allocator is
guaranteeing type safety to RCU readers instead of the usual existence
guarantee.  A memory block might be freed out from under an RCU reader,
but its type will remain the same.  This means that the grace period
happens internally to the slab allocator when a slab is returned to
the system.

So either the validation checks are quite novel, the kmem_cache_zalloc()
calls should be replaced by kmem_cache_alloc() plus validation checks,
or the SLAB_TYPESAFE_BY_RCU should be removed.

Just out of curiosity, what is your mental model of SLAB_TYPESAFE_BY_RCU?

And yes, I did just up the visibility of this topic in my upcoming
presentation...

							Thanx, Paul

> > > > ---
> > > >  fs/jbd2/journal.c | 9 ++++++---
> > > >  1 file changed, 6 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> > > > index c2cf74b01ddb..323112de5921 100644
> > > > --- a/fs/jbd2/journal.c
> > > > +++ b/fs/jbd2/journal.c
> > > > @@ -2861,15 +2861,18 @@ static struct journal_head *journal_alloc_journal_head(void)
> > > >  #ifdef CONFIG_JBD2_DEBUG
> > > >  	atomic_inc(&nr_journal_heads);
> > > >  #endif
> > > > -	ret = kmem_cache_zalloc(jbd2_journal_head_cache, GFP_NOFS);
> > > > +	ret = kmem_cache_alloc(jbd2_journal_head_cache, GFP_NOFS);
> > > >  	if (!ret) {
> > > >  		jbd_debug(1, "out of memory for journal_head\n");
> > > >  		pr_notice_ratelimited("ENOMEM in %s, retrying.\n", __func__);
> > > > -		ret = kmem_cache_zalloc(jbd2_journal_head_cache,
> > > > +		ret = kmem_cache_alloc(jbd2_journal_head_cache,
> > > >  				GFP_NOFS | __GFP_NOFAIL);
> > > >  	}
> > > > -	if (ret)
> > > > +	if (ret) {
> > > > +		synchronize_rcu();
> > > > +		memset(ret, 0, sizeof(*ret));
> > > >  		spin_lock_init(&ret->b_state_lock);
> > > > +	}
> > > >  	return ret;
> > > >  }
> > > >  
> > > > -- 
> > > > 2.30.2
> > > > 
> > > -- 
> > > Jan Kara <jack@...e.com>
> > > SUSE Labs, CR