linux-kernel - [PATCH] mm/slub: fix lockups on PREEMPT && !SMP kernels

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <1426261632-8911-1-git-send-email-mark.rutland@arm.com>
Date:	Fri, 13 Mar 2015 15:47:12 +0000
From:	Mark Rutland <mark.rutland@....com>
To:	linux-kernel@...r.kernel.org
Cc:	Mark Rutland <mark.rutland@....com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Catalin Marinas <catalin.marinas@....com>,
	Christoph Lameter <cl@...ux.com>,
	David Rientjes <rientjes@...gle.com>,
	Jesper Dangaard Brouer <brouer@...hat.com>,
	Joonsoo Kim <iamjoonsoo.kim@....com>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Pekka Enberg <penberg@...nel.org>,
	Steve Capper <steve.capper@...aro.org>
Subject: [PATCH] mm/slub: fix lockups on PREEMPT && !SMP kernels

Commit 9aabf810a67cd97e ("mm/slub: optimize alloc/free fastpath by
removing preemption on/off") introduced an occasional hang for kernels
built with CONFIG_PREEMPT && !CONFIG_SMP.

The problem is the following loop the patch introduced to
slab_alloc_node and slab_free:

do {
        tid = this_cpu_read(s->cpu_slab->tid);
        c = raw_cpu_ptr(s->cpu_slab);
} while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));

GCC 4.9 has been observed to hoist the load of c and c->tid above the
loop for !SMP kernels (as in this case raw_cpu_ptr(x) is compile-time
constant and does not force a reload). On arm64 the generated assembly
looks like:

ffffffc00016d3c4:       f9400404        ldr     x4, [x0,#8]
ffffffc00016d3c8:       f9400401        ldr     x1, [x0,#8]
ffffffc00016d3cc:       eb04003f        cmp     x1, x4
ffffffc00016d3d0:       54ffffc1        b.ne    ffffffc00016d3c8 <slab_alloc_node.constprop.82+0x30>

If the thread is preempted between the load of c->tid (into x1) and tid
(into x4), and and allocation or free occurs in another thread (bumping
the cpu_slab's tid), the thread will be stuck in the loop until
s->cpu_slab->tid wraps, which may be forever in the absence of
allocations on the same CPU.

The loop itself is somewhat pointless as the thread can be preempted at
any point after the loop before the this_cpu_cmpxchg_double, and the
window for preemption within the loop is far smaller. Given that we
assume accessing c->tid is safe for the loop condition, and we retry
when the cmpxchg fails, we can get rid of the loop entirely and just
access c->tid via the raw_cpu_ptr for s->cpu_slab.

This patch ensures that we reload c->tid even when the compiler would
otherwise assume that it could cache the value, and removes the
redundant (and misleading) loop. Comments are updated to take these
changes into account.

Signed-off-by: Mark Rutland <mark.rutland@....com>
Cc: Andrew Morton <akpm@...ux-foundation.org>
Cc: Catalin Marinas <catalin.marinas@....com>
Cc: Christoph Lameter <cl@...ux.com>
Cc: David Rientjes <rientjes@...gle.com>
Cc: Jesper Dangaard Brouer <brouer@...hat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@....com>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Pekka Enberg <penberg@...nel.org>
Cc: Steve Capper <steve.capper@...aro.org>
---
 mm/slub.c | 32 ++++++++++++--------------------
 1 file changed, 12 insertions(+), 20 deletions(-)

Hi all,

I wasn't entirely sure if the read of c->tid in the loop condition was safe
given the general rules for this_cpu operations. I assume that a read on a
remote CPU is always safe, and hence we can just load tid via the raw_cpu_ptr
and leave it to the cmpxchg to detect when we get preempted. If the remote read
is not safe then as far as I can tell the original patch is broken in this
regard even in the absence of reordering by the compiler.

I've boot-tested this patch a number of times (with SMP and !SMP) on the
platform I noticed the lockup on, and everything seems stable. Do we have any
SL*B stress tests that could give this a better thrashing?

Otherwise any and all comments/review/testing would be appreciated.

Mark.

diff --git a/mm/slub.c b/mm/slub.c
index 6832c4e..2b9a75e 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2437,19 +2437,13 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s,
 		return NULL;
 redo:
 	/*
-	 * Must read kmem_cache cpu data via this cpu ptr. Preemption is
-	 * enabled. We may switch back and forth between cpus while
-	 * reading from one cpu area. That does not matter as long
-	 * as we end up on the original cpu again when doing the cmpxchg.
-	 *
-	 * We should guarantee that tid and kmem_cache are retrieved on
-	 * the same cpu. It could be different if CONFIG_PREEMPT so we need
-	 * to check if it is matched or not.
+	 * Preemption is enabled. We may switch back and forth between cpus,
+	 * and other threads may access the current cpu's area. That does not
+	 * matter as long as when we reach the cmpxhg we end up on the original
+	 * cpu and tid hasn't changed. We'll retry until that happens.
 	 */
-	do {
-		tid = this_cpu_read(s->cpu_slab->tid);
-		c = raw_cpu_ptr(s->cpu_slab);
-	} while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
+	c = raw_cpu_ptr(s->cpu_slab);
+	tid = READ_ONCE(c->tid);
 
 	/*
 	 * Irqless object alloc/free algorithm used here depends on sequence
@@ -2710,15 +2704,13 @@ static __always_inline void slab_free(struct kmem_cache *s,
 
 redo:
 	/*
-	 * Determine the currently cpus per cpu slab.
-	 * The cpu may change afterward. However that does not matter since
-	 * data is retrieved via this pointer. If we are on the same cpu
-	 * during the cmpxchg then the free will succedd.
+	 * Preemption is enabled. We may switch back and forth between cpus,
+	 * and other threads may access the current cpu's area. That does not
+	 * matter as long as when we reach the cmpxhg we end up on the original
+	 * cpu and tid hasn't changed. We'll retry until that happens.
 	 */
-	do {
-		tid = this_cpu_read(s->cpu_slab->tid);
-		c = raw_cpu_ptr(s->cpu_slab);
-	} while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
+	c = raw_cpu_ptr(s->cpu_slab);
+	tid = READ_ONCE(c->tid);
 
 	/* Same with comment on barrier() in slab_alloc_node() */
 	barrier();
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/