linux-kernel - Re: [PATCH] xfs: avoid redundant AGFL buffer invalidation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZlRLWP3Ty6uvMzjd@dread.disaster.area>
Date: Mon, 27 May 2024 18:59:04 +1000
From: Dave Chinner <david@...morbit.com>
To: Gao Xiang <hsiangkao@...ux.alibaba.com>
Cc: linux-xfs@...r.kernel.org, "Darrick J . Wong" <djwong@...nel.org>,
	Chandan Babu R <chandanbabu@...nel.org>,
	LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] xfs: avoid redundant AGFL buffer invalidation

On Mon, May 27, 2024 at 02:10:06PM +0800, Gao Xiang wrote:
> Currently AGFL blocks can be filled from the following three sources:
>  - allocbt free blocks, as in xfs_allocbt_free_block();
>  - rmapbt free blocks, as in xfs_rmapbt_free_block();
>  - refilled from freespace btrees, as in xfs_alloc_fix_freelist().
> 
> Originally, allocbt free blocks would be marked as stale only when they
> put back in the general free space pool as Dave mentioned on IRC, "we
> don't stale AGF metadata btree blocks when they are returned to the
> AGFL .. but once they get put back in the general free space pool, we
> have to make sure the buffers are marked stale as the next user of
> those blocks might be user data...."

So it turns out that xfs_alloc_ag_vextent_small() does this when
allocating from the AGFL:

	if (args->datatype & XFS_ALLOC_USERDATA) {
                struct xfs_buf  *bp;

                error = xfs_trans_get_buf(args->tp, args->mp->m_ddev_targp,
                                XFS_AGB_TO_DADDR(args->mp, args->agno, fbno),
                                args->mp->m_bsize, 0, &bp);
                if (error)
                        goto error;
                xfs_trans_binval(args->tp, bp);
        }

Hence we're already invalidating any buffer over the block allocated
from the AGFL to ensure nothing will overwrite the user data that
will be placed in the block after the allocation is committed.

This means we can trigger the log force from this path - more
about that below....

> However, after commit ca250b1b3d71 ("xfs: invalidate allocbt blocks
> moved to the free list") and commit edfd9dd54921 ("xfs: move buffer
> invalidation to xfs_btree_free_block"), even allocbt / bmapbt free
> blocks will be invalidated immediately since they may fail to pass
> V5 format validation on writeback even writeback to free space would be
> safe.

*nod*

> IOWs, IMHO currently there is actually no difference of free blocks
> between AGFL freespace pool and the general free space pool.  So let's
> avoid extra redundant AGFL buffer invalidation, since otherwise we're
> currently facing unnecessary xfs_log_force() due to xfs_trans_binval()
> again on buffers already marked as stale before as below:
> 
> [  333.507469] Call Trace:
> [  333.507862]  xfs_buf_find+0x371/0x6a0
> [  333.508451]  xfs_buf_get_map+0x3f/0x230
> [  333.509062]  xfs_trans_get_buf_map+0x11a/0x280
> [  333.509751]  xfs_free_agfl_block+0xa1/0xd0
> [  333.510403]  xfs_agfl_free_finish_item+0x16e/0x1d0
> [  333.511157]  xfs_defer_finish_noroll+0x1ef/0x5c0
> [  333.511871]  xfs_defer_finish+0xc/0xa0
> [  333.512471]  xfs_itruncate_extents_flags+0x18a/0x5e0
> [  333.513253]  xfs_inactive_truncate+0xb8/0x130
> [  333.513930]  xfs_inactive+0x223/0x270
> 
> And xfs_log_force() will take tens of milliseconds with AGF buffer
> locked.  It becomes an unnecessary long latency especially on our PMEM
> devices with FSDAX enabled.  Also fstests are passed with this patch.

Well, keep in mind the reason the log force was introduced in
xfs_buf_lock() - commit ed3b4d6cdc81 ("xfs: Improve scalability of
busy extent tracking") says:

    The only problem with this approach is that when a metadata buffer is
    marked stale (e.g. a directory block is removed), then buffer remains
    pinned and locked until the log goes to disk. The issue here is that
    if that stale buffer is reallocated in a subsequent transaction, the
    attempt to lock that buffer in the transaction will hang waiting
    the log to go to disk to unlock and unpin the buffer. Hence if
    someone tries to lock a pinned, stale, locked buffer we need to
    push on the log to get it unlocked ASAP. Effectively we are trading
    off a guaranteed log force for a much less common trigger for log
    force to occur.

IOWs, this "log force on buffer lock" trigger isn't specific to AGFL
blocks.  The log force is placed there to ensure that access latency
to any block we rapidly reallocate is *only* a few milliseconds,
rather than being "whenever the next log writes trigger" which could
be tens of seconds away....

Hence we need to be aware that removing the double invalidation on
the AGFL blocks does not make this "log force on stale buffer"
latency issue go away, it just changes when and where it happens
(i.e. on reallocation).

> Signed-off-by: Gao Xiang <hsiangkao@...ux.alibaba.com>
> ---
>  fs/xfs/libxfs/xfs_alloc.c | 18 ++----------------
>  1 file changed, 2 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index 6cb8b2ddc541..a80d2a31252a 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -2432,22 +2432,8 @@ xfs_free_agfl_block(
>  	struct xfs_buf		*agbp,
>  	struct xfs_owner_info	*oinfo)
>  {
> -	int			error;
> -	struct xfs_buf		*bp;
> -
> -	error = xfs_free_ag_extent(tp, agbp, agno, agbno, 1, oinfo,
> -				   XFS_AG_RESV_AGFL);
> -	if (error)
> -		return error;
> -
> -	error = xfs_trans_get_buf(tp, tp->t_mountp->m_ddev_targp,
> -			XFS_AGB_TO_DADDR(tp->t_mountp, agno, agbno),
> -			tp->t_mountp->m_bsize, 0, &bp);
> -	if (error)
> -		return error;
> -	xfs_trans_binval(tp, bp);
> -
> -	return 0;
> +	return xfs_free_ag_extent(tp, agbp, agno, agbno, 1, oinfo,
> +				  XFS_AG_RESV_AGFL);
>  }

I'd just get rid of the xfs_free_agfl_block() wrapper entirely and
call xfs_free_ag_extent() directly from xfs_agfl_free_finish_item().

-Dave.
-- 
Dave Chinner
david@...morbit.com