netdev - Re: [Regression] stress-ng udp-flood causes kernel panic on Ampere Altra

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220706120201.GA7996@breakpoint.cc>
Date:   Wed, 6 Jul 2022 14:02:01 +0200
From:   Florian Westphal <fw@...len.de>
To:     Kajetan Puchalski <kajetan.puchalski@....com>
Cc:     Will Deacon <will@...nel.org>, Florian Westphal <fw@...len.de>,
        Pablo Neira Ayuso <pablo@...filter.org>,
        Jozsef Kadlecsik <kadlec@...filter.org>,
        "David S. Miller" <davem@...emloft.net>,
        Eric Dumazet <edumazet@...gle.com>,
        Jakub Kicinski <kuba@...nel.org>,
        Paolo Abeni <pabeni@...hat.com>, Mel Gorman <mgorman@...e.de>,
        lukasz.luba@....com, dietmar.eggemann@....com,
        mark.rutland@....com, broonie@...nel.org,
        netfilter-devel@...r.kernel.org, coreteam@...filter.org,
        netdev@...r.kernel.org, stable@...r.kernel.org,
        regressions@...ts.linux.dev, linux-kernel@...r.kernel.org,
        peterz@...radead.org
Subject: Re: [Regression] stress-ng udp-flood causes kernel panic on Ampere
 Altra

Kajetan Puchalski <kajetan.puchalski@....com> wrote:
> On Tue, Jul 05, 2022 at 12:24:49PM +0100, Will Deacon wrote:
> > > > Sorry, but I have absolutely no context here. We have a handy document
> > > > describing the differences between atomic_t and refcount_t:
> > > > 
> > > > 	Documentation/core-api/refcount-vs-atomic.rst
> > > > 
> > > > What else do you need to know?
> > > 
> > > Hmm, and I see a tonne of *_inc_not_zero() conversions in 719774377622
> > > ("netfilter: conntrack: convert to refcount_t api") which mean that you
> > > no longer have ordering to subsequent reads in the absence of an address
> > > dependency.
> > 
> > I think the patch above needs auditing with the relaxed behaviour in mind,
> > but for the specific crash reported here possibly something like the diff
> > below?
> > 
> > Will
> > 
> > --->8
> > 
> > diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
> > index 082a2fd8d85b..5ad9fcc84269 100644
> > --- a/net/netfilter/nf_conntrack_core.c
> > +++ b/net/netfilter/nf_conntrack_core.c
> > @@ -1394,6 +1394,7 @@ static unsigned int early_drop_list(struct net *net,
> >                  * already fired or someone else deleted it. Just drop ref
> >                  * and move to next entry.
> >                  */
> > +               smp_rmb();      /* XXX: Why? */
> >                 if (net_eq(nf_ct_net(tmp), net) &&
> >                     nf_ct_is_confirmed(tmp) &&
> >                     nf_ct_delete(tmp, 0, 0))
> > 
> 
> Just to follow up, I think you're right, the patch in question should be
> audited further for other missing memory barrier issues.
> While this one smp_rmb() helps a lot, ie lets the test run for at least
> an hour or two, an overnight 6 hour test still resulted in the same
> crash somewhere along the way so it looks like it's not the only one
> that's needed.

Yes, I don't think that refcount_inc_not_zero is useable as-is for conntrack.
Here is a patch, I hope this will get things back to a working order without
a revert to atomic_t api.

Subject: [nf] netfilter: conntrack: fix crash due to confirmed bit load reordering

Kajetan Puchalski reports crash on ARM, with backtrace of:

__nf_ct_delete_from_lists
nf_ct_delete
early_drop
__nf_conntrack_alloc

Unlike atomic_inc_not_zero, refcount_inc_not_zero is not a full barrier.
conntrack uses SLAB_TYPESAFE_BY_RCU, i.e. it is possible that a 'newly'
allocated object is still in use on another CPU:

CPU1						CPU2
						enounters 'ct' during hlist walk
 delete_from_lists
 refcount drops to 0
 kmem_cache_free(ct);
 __nf_conntrack_alloc() // returns same object
						refcount_inc_not_zero(ct); /* might fail */

						/* If set, ct is public/in the hash table */
						test_bit(IPS_CONFIRMED_BIT, &ct->status);

In case CPU1 already set refcount back to 1, refcount_inc_not_zero()
will succeed.

The expected possibilities for a CPU that obtained the object 'ct'
(but no reference so far) are:

1. refcount_inc_not_zero() fails.  CPU2 ignores the object and moves to
   the next entry in the list.  This happens for objects that are about
   to be free'd, that have been free'd, or that have been reallocated
   by __nf_conntrack_alloc(), but where the refcount has not been
   increased back to 1 yet.

2. refcount_inc_not_zero() succeeds. CPU2 checks the CONFIRMED bit
   in ct->status.  If set, the object is public/in the table.

   If not, the object must be skipped; CPU2 calls nf_ct_put() to
   un-do the refcount increment and moves to the next object.

Parallel deletion from the hlists is prevented by a
'test_and_set_bit(IPS_DYING_BIT, &ct->status);' check, i.e. only one
cpu will do the unlink, the other one will only drop its reference count.

Because refcount_inc_not_zero is not a full barrier, CPU2 may try to
delete an object that is not on any list:

1. refcount_inc_not_zero() successful (refcount inited to 1 on other CPU)
2. CONFIRMED test also successful (load was reordered or zeroing
   of ct->status not yet visible)
3. delete_from_lists unlinks entry not on the hlist, because
   IPS_DYING_BIT is 0 (already cleared).

2) is already wrong: CPU2 will handle a partially initited object
that is supposed to be private to CPU1.

This change adds smp_rmb() whenever refcount_inc_not_zero() was successful.

It also inserts a smp_wmb() before the refcount is set to 1 during
allocation.

Because other CPU might still 'see' the object, refcount_set(1)
"resurrects" the object, so we need to make sure that other CPUs will
also observe the right contents.  In particular, the CONFIRMED bit test
must only pass once the object is fully initialised and either in the
hash or about to be inserted (with locks held to delay possible unlink from
early_drop or gc worker).

I did not change flow_offload_alloc(), as far as I can see it should call
refcount_inc(), not refcount_inc_not_zero(): the ct object is attached to
the skb so its refcount should be >= 1 in all cases.

Reported-by: Kajetan Puchalski <kajetan.puchalski@....com>
Diagnosed-by: Will Deacon <will@...nel.org>
Fixes: 719774377622 ("netfilter: conntrack: convert to refcount_t api")
Signed-off-by: Florian Westphal <fw@...len.de>
---
 include/net/netfilter/nf_conntrack.h |  3 +++
 net/netfilter/nf_conntrack_core.c    | 19 +++++++++++++++++++
 2 files changed, 22 insertions(+)

diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index a32be8aa7ed2..3dc3646ffba2 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -300,6 +300,9 @@ static inline bool nf_ct_is_expired(const struct nf_conn *ct)
 /* use after obtaining a reference count */
 static inline bool nf_ct_should_gc(const struct nf_conn *ct)
 {
+	/* ->status and ->timeout loads must happen after refcount increase */
+	smp_rmb();
+
 	return nf_ct_is_expired(ct) && nf_ct_is_confirmed(ct) &&
 	       !nf_ct_is_dying(ct);
 }
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 082a2fd8d85b..072cabf1b296 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -795,6 +795,9 @@ __nf_conntrack_find_get(struct net *net, const struct nf_conntrack_zone *zone,
 		 */
 		ct = nf_ct_tuplehash_to_ctrack(h);
 		if (likely(refcount_inc_not_zero(&ct->ct_general.use))) {
+			/* re-check key after refcount */
+			smp_rmb();
+
 			if (likely(nf_ct_key_equal(h, tuple, zone, net)))
 				goto found;
 
@@ -1393,7 +1396,11 @@ static unsigned int early_drop_list(struct net *net,
 		 * We steal the timer reference.  If that fails timer has
 		 * already fired or someone else deleted it. Just drop ref
 		 * and move to next entry.
+		 *
+		 * smp_rmb to ensure ->ct_net and ->status are loaded after
+		 * refcount increase.
 		 */
+		smp_rmb();
 		if (net_eq(nf_ct_net(tmp), net) &&
 		    nf_ct_is_confirmed(tmp) &&
 		    nf_ct_delete(tmp, 0, 0))
@@ -1536,6 +1543,8 @@ static void gc_worker(struct work_struct *work)
 			if (!refcount_inc_not_zero(&tmp->ct_general.use))
 				continue;
 
+			/* load ct->status after refcount */
+			smp_rmb();
 			if (gc_worker_skip_ct(tmp)) {
 				nf_ct_put(tmp);
 				continue;
@@ -1775,6 +1784,16 @@ init_conntrack(struct net *net, struct nf_conn *tmpl,
 	if (!exp)
 		__nf_ct_try_assign_helper(ct, tmpl, GFP_ATOMIC);
 
+	/* Other CPU might have obtained a pointer to this object before it was
+	 * released.  Because refcount is 0, refcount_inc_not_zero() will fail.
+	 *
+	 * After refcount_set(1) it will succeed; ensure that zeroing of
+	 * ct->status and the correct ct->net pointer are visible; else other
+	 * core might observe CONFIRMED bit which means the entry is valid and
+	 * in the hash table, but its not (anymore).
+	 */
+	smp_wmb();
+
 	/* Now it is going to be associated with an sk_buff, set refcount to 1. */
 	refcount_set(&ct->ct_general.use, 1);
 
-- 
2.35.1