lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <505dba1e-4950-4968-a789-434c52a3e1c7@paulmck-laptop>
Date:   Wed, 13 Dec 2023 10:05:52 -0800
From:   "Paul E. McKenney" <paulmck@...nel.org>
To:     Josef Bacik <josef@...icpanda.com>
Cc:     David Sterba <dsterba@...e.cz>, kernel test robot <lkp@...el.com>,
        llvm@...ts.linux.dev, oe-kbuild-all@...ts.linux.dev,
        linux-kernel@...r.kernel.org, julia.lawall@...ia.fr, clm@...com,
        dsterba@...e.com, baptiste.lepers@...il.com
Subject: Re: [paulmck-rcu:frederic.2023.12.08a 29/37]
 fs/btrfs/transaction.c:496:6: error: call to '__compiletime_assert_329'
 declared with 'error' attribute: Need native word sized stores/loads for
 atomicity.

On Wed, Dec 13, 2023 at 10:54:40AM -0500, Josef Bacik wrote:
> On Wed, Dec 13, 2023 at 06:56:37AM -0800, Paul E. McKenney wrote:
> > On Wed, Dec 13, 2023 at 01:53:58PM +0100, David Sterba wrote:
> > > On Sat, Dec 09, 2023 at 07:51:30AM -0800, Paul E. McKenney wrote:
> > > > On Sat, Dec 09, 2023 at 06:20:37PM +0800, kernel test robot wrote:
> > > > > tree:   https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git frederic.2023.12.08a
> > > > > head:   37843b5f561a08ae899fb791eeeb5abd992eabe2
> > > > > commit: 7dd87072d40809e26503f04b79d63290288dbbac [29/37] btrfs: Adjust ->last_trans ordering in btrfs_record_root_in_trans()
> > > > > config: riscv-rv32_defconfig (https://download.01.org/0day-ci/archive/20231209/202312091837.cKaPw0Tf-lkp@intel.com/config)
> > > > > compiler: clang version 17.0.0 (https://github.com/llvm/llvm-project.git 4a5ac14ee968ff0ad5d2cc1ffa0299048db4c88a)
> > > > > reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231209/202312091837.cKaPw0Tf-lkp@intel.com/reproduce)
> > > > > 
> > > > > If you fix the issue in a separate patch/commit (i.e. not just a new version of
> > > > > the same patch/commit), kindly add following tags
> > > > > | Reported-by: kernel test robot <lkp@...el.com>
> > > > > | Closes: https://lore.kernel.org/oe-kbuild-all/202312091837.cKaPw0Tf-lkp@intel.com/
> > > > > 
> > > > > All errors (new ones prefixed by >>):
> > > > > 
> > > > >    warning: unknown warning option '-Wpacked-not-aligned'; did you mean '-Wpacked-non-pod'? [-Wunknown-warning-option]
> > > > >    warning: unknown warning option '-Wstringop-truncation'; did you mean '-Wstring-concatenation'? [-Wunknown-warning-option]
> > > > >    warning: unknown warning option '-Wmaybe-uninitialized'; did you mean '-Wuninitialized'? [-Wunknown-warning-option]
> > > > > >> fs/btrfs/transaction.c:496:6: error: call to '__compiletime_assert_329' declared with 'error' attribute: Need native word sized stores/loads for atomicity.
> > > > >      496 |         if (smp_load_acquire(&root->last_trans) == trans->transid && /* ^^^ */
> > > > >          |             ^
> > > > 
> > > > Ooooh!!!  :-/
> > > > 
> > > > From what I can see, the current code can tear this load on 32-bit
> > > > systems, which can result in bad comparisons and then in failure to wait
> > > > for a partially complete transaction.
> > > > 
> > > > So is btrfs actually supported on 32-bit systems?  If not, would the
> > > > following patch be appropriate?
> > > 
> > > There are limitations on 32bit systems, eg. due to shorter inode numbers
> > > (ino_t is unsigned long) and that radix-tree/xarray does support only
> > > unsigned long keys, while we have 64bit identifiers for inodes or tree
> > > roots.
> > > 
> > > So far we support that and dropping it completely is I think a big deal,
> > > like with any possibly used feature. What I've seen there are NAS boxes
> > > with low power ARM that are 32bit.
> > > 
> > > > If btrfs is to be supported on 32-bit systems, from what I can see some
> > > > major surgery is required, even if a 32-bit counter is wrap-safe for
> > > > this particular type of transaction.  (But SSDs?  In-memory btrfs
> > > > filesystems?)
> > > 
> > > We won't probably do any major surgery to support 32bit systems.
> > 
> > Got it, and thank you for the background!  My takeaway is that 32-bit
> > BTRFS must work in the common case, but might have issues on some
> > workloads, for example, running out of inode numbers or load tearing.
> > 
> > > > ------------------------------------------------------------------------
> > > > 
> > > > diff --git a/fs/btrfs/Kconfig b/fs/btrfs/Kconfig
> > > > index 4fb925e8c981..4d56158c34f9 100644
> > > > --- a/fs/btrfs/Kconfig
> > > > +++ b/fs/btrfs/Kconfig
> > > > @@ -19,6 +19,7 @@ config BTRFS_FS
> > > >  	select RAID6_PQ
> > > >  	select XOR_BLOCKS
> > > >  	depends on PAGE_SIZE_LESS_THAN_256KB
> > > > +	depends on 64BIT
> > > 
> > > Can we keep the current inefficient smp_* barriers instead of dropping
> > > the whole 32bit support as an alternative. If the smp_load_acquire are
> > > better but not strictly necessary for the correctness (from the barriers
> > > POV) I'd suggest to leave it as-is. We can put comments in case somebody
> > > wants to optimize it in the future again.
> > 
> > We still have the barrier placement issue, given that smp_rmb() enforces
> > only the ordering of earlier and later loads, correct?  Or am I missing
> > some other ordering constraint that makes all that work?
> > 
> > But I can make each of the current patch's smp_load_acquire() call instead
> > be be a READ_ONCE() followed by an smp_rmb(), the test_bit_acquire()
> > call be test_bit() followed by smp_rmb(), and the smp_store_release()
> > call be an smp_wmb() followed by a WRITE_ONCE().  This provides the needed
> > ordering, bullet-proofs the 64-bit code against compilers, but works on
> > 32-bit systems.  For example, on a 32-bit system the 64-bit READ_ONCE()
> > and WRITE_ONCE() might still be compiled as a pair of 32-bit memory
> > accesses, but they will both be guaranteed to be single memory accesses
> > on 64-bit systems.
> > 
> > Would that work for you guys?
> 
> Actually I think we just need to re-work all of this to make it less silly.
> Does this look reasonable to you Paul?  I still have to test it, but I think it
> addresses your concerns and lets us keep 32bit support ;).  Thanks,

Things that avoid the need for memory barriers are often improvements!

I don't claim to understand enough of the BTRFS code to fully judge
this, but I do ask one stupid question below just in case it inspires
a non-stupid idea.  ;-)

							Thanx, Paul

> Josef
> 
> From: Josef Bacik <josef@...icpanda.com>
> Date: Wed, 13 Dec 2023 10:46:57 -0500
> Subject: [PATCH] btrfs: cleanup how we record roots in the transaction
> 
> Previously we didn't have a bit field for tracking if a root was
> recorded in the transaction, it was simply an integer on the root.
> Additionally we have the ->last_trans which we record the last
> transaction this root was modified in.  We had various memory barriers
> to attempt to synchronize these two updates so we could avoid taking the
> ->reloc_mutex all of the time for transaction starts.
> 
> However this code has changed a lot, but the dubious memory ordering has
> remained.  Fix up this code to make it more clear and get rid of the
> memory barriers.  All we're trying to avoid here is constantly taking
> the ->reloc_mutex, so add a new bit to mark that the root has been
> recorded.  If this bit is set then we can skip the expensive step,
> otherwise we have to take the lock and wait.
> 
> Add a big comment indicating why we're doing the ordering and locking
> the way we are, and use ->fs_roots_radix_lock as the arbiter of this new
> bit.
> 
> This makes the code simpler, removes the memory barriers, and documents
> clearly what we're doing so it's less confusing for future readers.
> 
> Signed-off-by: Josef Bacik <josef@...icpanda.com>
> ---
>  fs/btrfs/ctree.h       | 11 ++++---
>  fs/btrfs/disk-io.c     |  1 +
>  fs/btrfs/transaction.c | 69 +++++++++++++++++-------------------------
>  3 files changed, 34 insertions(+), 47 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 70e828d33177..eb44cf191863 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -90,12 +90,13 @@ struct btrfs_path {
>   */
>  enum {
>  	/*
> -	 * btrfs_record_root_in_trans is a multi-step process, and it can race
> -	 * with the balancing code.   But the race is very small, and only the
> -	 * first time the root is added to each transaction.  So IN_TRANS_SETUP
> -	 * is used to tell us when more checks are required
> +	 * We set this after we've recorded the root in the transaction.  This
> +	 * is used to avoid taking the reloc_mutex every time we call
> +	 * btrfs_start_transaction(root).  Once the root has been recorded we
> +	 * don't have to do the setup work again, and this is cleared on
> +	 * transaction commit once we're done with the root.
>  	 */
> -	BTRFS_ROOT_IN_TRANS_SETUP,
> +	BTRFS_ROOT_RECORDED,
>  
>  	/*
>  	 * Set if tree blocks of this root can be shared by other roots.
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index c6907d533fe8..0410fac5f78e 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -4805,6 +4805,7 @@ static void btrfs_free_all_qgroup_pertrans(struct btrfs_fs_info *fs_info)
>  			radix_tree_tag_clear(&fs_info->fs_roots_radix,
>  					(unsigned long)root->root_key.objectid,
>  					BTRFS_ROOT_TRANS_TAG);
> +			clear_bit(BTRFS_ROOT_RECORDED, &root->state);
>  		}
>  	}
>  	spin_unlock(&fs_info->fs_roots_radix_lock);
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 5b3333ceef04..186089ddac27 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -407,54 +407,43 @@ static int record_root_in_trans(struct btrfs_trans_handle *trans,
>  	int ret = 0;
>  
>  	if ((test_bit(BTRFS_ROOT_SHAREABLE, &root->state) &&
> -	    root->last_trans < trans->transid) || force) {
> +	     !test_bit(BTRFS_ROOT_RECORDED, &root->state)) || force) {

Would it make sense to test both bits at once?  Something like:

	if ((READ_ONCE(root->state) & (BIT(BTRFS_ROOT_SHAREABLE) | BIT(BTRFS_ROOT_RECORDED))) == BIT(BTRFS_ROOT_SHAREABLE))

That way you get an atomic test of both bits without the memory-ordering
aggravation.  Though you already hold the mutex here, so this should
not matter in this case, hence the question being stupid.  ;-)

>  		WARN_ON(!force && root->commit_root != root->node);
>  
>  		/*
> -		 * see below for IN_TRANS_SETUP usage rules
> -		 * we have the reloc mutex held now, so there
> -		 * is only one writer in this function
> +		 * We are using ->fs_roots_radix_lock as the ultimate protector
> +		 * of BTRFS_ROOT_RECORDED, which means we want to set it once
> +		 * we're completely done, so we call btrfs_init_reloc_root()
> +		 * first.
> +		 *
> +		 * This is ok because we are protected here by ->reloc_mutex, so
> +		 * in reality we won't ever race and call
> +		 * btrfs_init_reloc_root() twice on the same root back to back.
> +		 * However if we do it's ok, because again we're protected by
> +		 * ->reloc_mutex and the second call will simply update the
> +		 * ->last_trans on the reloc root to the current transid, making
> +		 * it essentially a no-op.
> +		 *
> +		 * We're using ->fs_roots_radix_lock here because
> +		 * btrfs_drop_snapshot will clear the tag so we don't update the
> +		 * root, and we will clear the RECORDED flag there without the
> +		 * ->reloc_mutex lock held.  Again this is fine because we won't
> +		 * be messing with a dropped root anymore, but for concurrency
> +		 * it's best to keep the rules nice and simple.
>  		 */
> -		set_bit(BTRFS_ROOT_IN_TRANS_SETUP, &root->state);
> -
> -		/* make sure readers find IN_TRANS_SETUP before
> -		 * they find our root->last_trans update
> -		 */
> -		smp_wmb();
> +		ret = btrfs_init_reloc_root(trans, root);
>  
>  		spin_lock(&fs_info->fs_roots_radix_lock);
> -		if (root->last_trans == trans->transid && !force) {
> +		if (test_bit(BTRFS_ROOT_RECORDED, &root->state) && !force) {
>  			spin_unlock(&fs_info->fs_roots_radix_lock);
>  			return 0;
>  		}
>  		radix_tree_tag_set(&fs_info->fs_roots_radix,
>  				   (unsigned long)root->root_key.objectid,
>  				   BTRFS_ROOT_TRANS_TAG);
> -		spin_unlock(&fs_info->fs_roots_radix_lock);
> +		set_bit(BTRFS_ROOT_RECORDED, &root->state);
>  		root->last_trans = trans->transid;
> -
> -		/* this is pretty tricky.  We don't want to
> -		 * take the relocation lock in btrfs_record_root_in_trans
> -		 * unless we're really doing the first setup for this root in
> -		 * this transaction.
> -		 *
> -		 * Normally we'd use root->last_trans as a flag to decide
> -		 * if we want to take the expensive mutex.
> -		 *
> -		 * But, we have to set root->last_trans before we
> -		 * init the relocation root, otherwise, we trip over warnings
> -		 * in ctree.c.  The solution used here is to flag ourselves
> -		 * with root IN_TRANS_SETUP.  When this is 1, we're still
> -		 * fixing up the reloc trees and everyone must wait.
> -		 *
> -		 * When this is zero, they can trust root->last_trans and fly
> -		 * through btrfs_record_root_in_trans without having to take the
> -		 * lock.  smp_wmb() makes sure that all the writes above are
> -		 * done before we pop in the zero below
> -		 */
> -		ret = btrfs_init_reloc_root(trans, root);
> -		smp_mb__before_atomic();
> -		clear_bit(BTRFS_ROOT_IN_TRANS_SETUP, &root->state);
> +		spin_unlock(&fs_info->fs_roots_radix_lock);
>  	}
>  	return ret;
>  }
> @@ -476,6 +465,7 @@ void btrfs_add_dropped_root(struct btrfs_trans_handle *trans,
>  	radix_tree_tag_clear(&fs_info->fs_roots_radix,
>  			     (unsigned long)root->root_key.objectid,
>  			     BTRFS_ROOT_TRANS_TAG);
> +	clear_bit(BTRFS_ROOT_RECORDED, &root->state);
>  	spin_unlock(&fs_info->fs_roots_radix_lock);
>  }
>  
> @@ -488,13 +478,7 @@ int btrfs_record_root_in_trans(struct btrfs_trans_handle *trans,
>  	if (!test_bit(BTRFS_ROOT_SHAREABLE, &root->state))
>  		return 0;
>  
> -	/*
> -	 * see record_root_in_trans for comments about IN_TRANS_SETUP usage
> -	 * and barriers
> -	 */
> -	smp_rmb();
> -	if (root->last_trans == trans->transid &&
> -	    !test_bit(BTRFS_ROOT_IN_TRANS_SETUP, &root->state))
> +	if (test_bit(BTRFS_ROOT_RECORDED, &root->state))
>  		return 0;
>  
>  	mutex_lock(&fs_info->reloc_mutex);
> @@ -1531,6 +1515,7 @@ static noinline int commit_fs_roots(struct btrfs_trans_handle *trans)
>  			radix_tree_tag_clear(&fs_info->fs_roots_radix,
>  					(unsigned long)root->root_key.objectid,
>  					BTRFS_ROOT_TRANS_TAG);
> +			clear_bit(BTRFS_ROOT_RECORDED, &root->state);
>  			spin_unlock(&fs_info->fs_roots_radix_lock);
>  
>  			btrfs_free_log(trans, root);
> -- 
> 2.43.0
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ