linux-kernel - Re: [RFC PATCH] xfs: fix per-cpu CIL structure aggregation racing with dying cpus

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230823012011.GE11286@frogsfrogsfrogs>
Date:   Tue, 22 Aug 2023 18:20:11 -0700
From:   "Darrick J. Wong" <djwong@...nel.org>
To:     Dave Chinner <david@...morbit.com>
Cc:     tglx@...utronix.de, peterz@...radead.org,
        xfs <linux-xfs@...r.kernel.org>,
        linux-kernel <linux-kernel@...r.kernel.org>
Subject: Re: [RFC PATCH] xfs: fix per-cpu CIL structure aggregation racing
 with dying cpus

On Wed, Aug 23, 2023 at 09:28:16AM +1000, Dave Chinner wrote:
> On Tue, Aug 22, 2023 at 11:30:16AM -0700, Darrick J. Wong wrote:
> > On Mon, Aug 21, 2023 at 03:17:52PM +1000, Dave Chinner wrote:
> > > On Fri, Aug 04, 2023 at 03:38:54PM -0700, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <djwong@...nel.org>
> > > > 
> > > > In commit 7c8ade2121200 ("xfs: implement percpu cil space used
> > > > calculation"), the XFS committed (log) item list code was converted to
> > > > use per-cpu lists and space tracking to reduce cpu contention when
> > > > multiple threads are modifying different parts of the filesystem and
> > > > hence end up contending on the log structures during transaction commit.
> > > > Each CPU tracks its own commit items and space usage, and these do not
> > > > have to be merged into the main CIL until either someone wants to push
> > > > the CIL items, or we run over a soft threshold and switch to slower (but
> > > > more accurate) accounting with atomics.
> > > > 
> > > > Unfortunately, the for_each_cpu iteration suffers from the same race
> > > > with cpu dying problem that was identified in commit 8b57b11cca88f
> > > > ("pcpcntrs: fix dying cpu summation race") -- CPUs are removed from
> > > > cpu_online_mask before the CPUHP_XFS_DEAD callback gets called.  As a
> > > > result, both CIL percpu structure aggregation functions fail to collect
> > > > the items and accounted space usage at the correct point in time.
> > > > 
> > > > If we're lucky, the items that are collected from the online cpus exceed
> > > > the space given to those cpus, and the log immediately shuts down in
> > > > xlog_cil_insert_items due to the (apparent) log reservation overrun.
> > > > This happens periodically with generic/650, which exercises cpu hotplug
> > > > vs. the filesystem code.
> > > > 
> > > > Applying the same sort of fix from 8b57b11cca88f to the CIL code seems
> > > > to make the generic/650 problem go away, but I've been told that tglx
> > > > was not happy when he saw:
> > > > 
> > > > "...the only thing we actually need to care about is that
> > > > percpu_counter_sum() iterates dying CPUs. That's trivial to do, and when
> > > > there are no CPUs dying, it has no addition overhead except for a
> > > > cpumask_or() operation."
> > > > 
> > > > I have no idea what the /correct/ solution is, but I've been holding on
> > > > to this patch for 3 months.  In that time, 8b57b11cca88 hasn't been
> > > > reverted and cpu_dying_mask hasn't been removed, so I'm sending this and
> > > > we'll see what happens.
> > > > 
> > > > So, how /do/ we perform periodic aggregation of per-cpu data safely?
> > > > Move the xlog_cil_pcp_dead call to the dying section, where at least the
> > > > other cpus will all be stopped?  And have it dump its items into any
> > > > online cpu's data structure?
> > > 
> > > I suspect that we have to stop using for_each_*cpu() and hotplug
> > > notifiers altogether for this code.
> > > 
> > > That is, we simply create our own bitmap for nr_possible_cpus in the
> > > CIL checkpoint context we allocate for each checkpoint (i.e. the
> > > struct xfs_cil_ctx). When we store something on that CPU for that
> > > CIL context, we set the corresponding bit for that CPU. Then when we
> > > are aggregating the checkpoint, we simply walk all the cpus with the
> > > "has items" bit set and grab everything.
> > > 
> > > This gets rid of the need for hotplug notifiers completely - we just
> > > don't care if the CPU is online or offline when we sweep the CIL
> > > context for items at push time - if the bit is set then there's
> > > stuff to sweep...
> > 
> > Oooh, good idea.  Something like this, then?  Lightly tested via
> > generic/650 for five minutes...
> 
> Not quite what I was saying - you put the cpu mask in the struct
> xfs_cil, which means it can potentially be accessed from both commit
> and aggregation at the same time for different checkpoints.
> 
> What I was suggesting is that it gets placed in the struct
> xfs_cil_ctx which is guaranteed to be private to

private to...?

> > 
> > --
> > From: Darrick J. Wong <djwong@...nel.org>
> > Subject: [PATCH] xfs: fix per-cpu CIL structure aggregation racing with dying cpus
> > 
> > In commit 7c8ade2121200 ("xfs: implement percpu cil space used
> > calculation"), the XFS committed (log) item list code was converted to
> > use per-cpu lists and space tracking to reduce cpu contention when
> > multiple threads are modifying different parts of the filesystem and
> > hence end up contending on the log structures during transaction commit.
> > Each CPU tracks its own commit items and space usage, and these do not
> > have to be merged into the main CIL until either someone wants to push
> > the CIL items, or we run over a soft threshold and switch to slower (but
> > more accurate) accounting with atomics.
> > 
> > Unfortunately, the for_each_cpu iteration suffers from the same race
> > with cpu dying problem that was identified in commit 8b57b11cca88f
> > ("pcpcntrs: fix dying cpu summation race") -- CPUs are removed from
> > cpu_online_mask before the CPUHP_XFS_DEAD callback gets called.  As a
> > result, both CIL percpu structure aggregation functions fail to collect
> > the items and accounted space usage at the correct point in time.
> > 
> > If we're lucky, the items that are collected from the online cpus exceed
> > the space given to those cpus, and the log immediately shuts down in
> > xlog_cil_insert_items due to the (apparent) log reservation overrun.
> > This happens periodically with generic/650, which exercises cpu hotplug
> > vs. the filesystem code:
> > 
> > smpboot: CPU 3 is now offline
> > XFS (sda3): ctx ticket reservation ran out. Need to up reservation
> > XFS (sda3): ticket reservation summary:
> > XFS (sda3):   unit res    = 9268 bytes
> > XFS (sda3):   current res = -40 bytes
> > XFS (sda3):   original count  = 1
> > XFS (sda3):   remaining count = 1
> > XFS (sda3): Filesystem has been shut down due to log error (0x2).
> > 
> > Applying the same sort of fix from 8b57b11cca88f to the CIL code seems
> > to make the generic/650 problem go away, but I've been told that tglx
> > was not happy when he saw:
> > 
> > "...the only thing we actually need to care about is that
> > percpu_counter_sum() iterates dying CPUs. That's trivial to do, and when
> > there are no CPUs dying, it has no addition overhead except for a
> > cpumask_or() operation."
> > 
> > The CPU hotplug code is rather complex and difficult to understand and I
> > don't want to try to understand the cpu hotplug locking well enough to
> > use cpu_dying mask.  Furthermore, there's a performance improvement that
> > could be had here.  Attach a private cpu mask to the CIL structure so
> > that we can track exactly which cpus have accessed the percpu data at
> > all.  It doesn't matter if the cpu has since gone offline; log item
> > aggregation will still find the items.  Better yet, we skip cpus that
> > have not recently logged anything.
> > 
> > Link: https://lore.kernel.org/linux-xfs/ZOLzgBOuyWHapOyZ@dread.disaster.area/T/
> > Link: https://lore.kernel.org/lkml/877cuj1mt1.ffs@tglx/
> > Link: https://lore.kernel.org/lkml/20230414162755.281993820@linutronix.de/
> > Cc: tglx@...utronix.de
> > Cc: peterz@...radead.org
> > Signed-off-by: Darrick J. Wong <djwong@...nel.org>
> > ---
> >  fs/xfs/xfs_log_cil.c  |   58 ++++++++++++++++++++-----------------------------
> >  fs/xfs/xfs_log_priv.h |   15 ++++++-------
> >  fs/xfs/xfs_super.c    |    1 -
> >  3 files changed, 31 insertions(+), 43 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > index eccbfb99e894..d91589a959ee 100644
> > --- a/fs/xfs/xfs_log_cil.c
> > +++ b/fs/xfs/xfs_log_cil.c
> > @@ -124,7 +124,7 @@ xlog_cil_push_pcp_aggregate(
> >  	struct xlog_cil_pcp	*cilpcp;
> >  	int			cpu;
> >  
> > -	for_each_online_cpu(cpu) {
> > +	for_each_cpu(cpu, &cil->xc_pcpmask) {
> >  		cilpcp = per_cpu_ptr(cil->xc_pcp, cpu);
> >  
> >  		ctx->ticket->t_curr_res += cilpcp->space_reserved;
> > @@ -165,7 +165,14 @@ xlog_cil_insert_pcp_aggregate(
> >  	if (!test_and_clear_bit(XLOG_CIL_PCP_SPACE, &cil->xc_flags))
> >  		return;
> >  
> > -	for_each_online_cpu(cpu) {
> > +	/*
> > +	 * We don't hold xc_push_lock here, so we can race with other cpus
> > +	 * setting xc_pcpmask.  However, we've atomically cleared PCP_SPACE
> > +	 * which forces other threads to add to the global space used count.
> > +	 * xc_pcpmask is a superset of cilpcp structures that could have a
> > +	 * nonzero space_used.
> > +	 */
> > +	for_each_cpu(cpu, &cil->xc_pcpmask) {
> >  		int	old, prev;
> 
> I don't think the xc_push_lock means anything here; if we use atomic
> bit ops to set the pcpmask it certainly doesn't. AFAICT, there
> aren't any new races here by adding the private pcp mask. If we do
> race, then the push aggregation will still correctly aggregate any
> count that was missed on the global total...

Agreed.

> >  		cilpcp = per_cpu_ptr(cil->xc_pcp, cpu);
> > @@ -186,6 +193,7 @@ xlog_cil_ctx_switch(
> >  	xlog_cil_set_iclog_hdr_count(cil);
> >  	set_bit(XLOG_CIL_EMPTY, &cil->xc_flags);
> >  	set_bit(XLOG_CIL_PCP_SPACE, &cil->xc_flags);
> > +	cpumask_clear(&cil->xc_pcpmask);
> >  	ctx->sequence = ++cil->xc_current_sequence;
> >  	ctx->cil = cil;
> >  	cil->xc_ctx = ctx;
> > @@ -554,6 +562,7 @@ xlog_cil_insert_items(
> >  	int			iovhdr_res = 0, split_res = 0, ctx_res = 0;
> >  	int			space_used;
> >  	int			order;
> > +	unsigned int		cpu_nr;
> >  	struct xlog_cil_pcp	*cilpcp;
> >  
> >  	ASSERT(tp);
> > @@ -579,6 +588,18 @@ xlog_cil_insert_items(
> >  	 */
> >  	cilpcp = get_cpu_ptr(cil->xc_pcp);
> >  
> > +	/*
> > +	 * Add this cpu to the mask of cpus that could have touched xc_pcp.
> > +	 * The push aggregation function is called with the push lock held,
> > +	 * so take it here if we have to update the mask.
> > +	 */
> > +	cpu_nr = current_cpu();
> > +	if (!cpumask_test_cpu(cpu_nr, &cil->xc_pcpmask)) {
> > +		spin_lock(&cil->xc_push_lock);
> > +		cpumask_set_cpu(cpu_nr, &cil->xc_pcpmask);
> > +		spin_unlock(&cil->xc_push_lock);
> > +	}
> 
> I don't think this needs a spin lock - an atomic test-and-set would
> be just fine here (i.e. cpumask_test_and_set_cpu()) and so we only
> need a single atomic operation here instead of two. i.e. make it a
> test-test-and-set operation similar to clearing XLOG_CIL_EMPTY in
> this fast path:
> 
> 	if (!cpumask_test_cpu(cpu, &cil->xc_pcpmask))
> 		cpumask_test_and_set_cpu(cpu_nr, &cil->xc_pcpmask);
> 
> We also don't have to care about memory barriers, etc w.r.t the
> push aggregate function because this is all done under the
> cil->xc_ctx_lock held shared.

Oh, hah.  I missed that cpumask_test_and_set_cpu is atomic, so we don't
need the spinlock here at all.  Good catch!

> Also, I think it would be better to put all this insert code
> explicitly under a single defined preempt disabled per-cpu section,
> rather than using current_cpu() inside the preempt disabled section
> without commenting on why it can be used like this safely. i.e.:
> 
>         /*
> 	 * Grab the cpu before we start any accounting.  That
> 	 * ensures that we are running with pre-emption disabled and
> 	 * so we can't be scheduled away between split sample/update
> 	 * operations that are done without outside locking to
> 	 * serialise them.
>          */
> 	cpu = get_cpu();
> 	cilpcp = this_cpu_ptr(cil->xc_pcp);
> 
> 	/* Tell the future push it has work to do on this CPU. */
> 	if (!cpumask_test_cpu(cpu, &cil->xc_pcpmask))
> 		cpumask_test_and_set_cpu(cpu, &cil->xc_pcpmask);
> 	....
> 	<done with cilpcp>
> 	put_cpu(cpu);

Much cleaner, thanks for the suggestion.

> .....
> 
> > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > index 09638e8fb4ee..ef7775657ce3 100644
> > --- a/fs/xfs/xfs_super.c
> > +++ b/fs/xfs/xfs_super.c
> > @@ -2313,7 +2313,6 @@ xfs_cpu_dead(
> >  	list_for_each_entry_safe(mp, n, &xfs_mount_list, m_mount_list) {
> >  		spin_unlock(&xfs_mount_list_lock);
> >  		xfs_inodegc_cpu_dead(mp, cpu);
> > -		xlog_cil_pcp_dead(mp->m_log, cpu);
> >  		spin_lock(&xfs_mount_list_lock);
> >  	}
> >  	spin_unlock(&xfs_mount_list_lock);
> 
> I wonder if we can do something similar for the inodegc code, and
> once again get rid of the need for hotpulg notifiers in XFS?

I would imagine so, if you don't mind bloating up xfs_mount.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@...morbit.com