[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1335467575.2775.61.camel@edumazet-glaptop>
Date: Thu, 26 Apr 2012 21:12:55 +0200
From: Eric Dumazet <eric.dumazet@...il.com>
To: Waiman Long <Waiman.Long@...com>
Cc: cl@...ux-foundation.org, penberg@...nel.org, mpm@...enic.com,
linux-mm@...ck.org, linux-kernel@...r.kernel.org, don.morris@...com
Subject: Re: [PATCH] slub: prevent validate_slab() error due to race
condition
On Thu, 2012-04-26 at 14:57 -0400, Waiman Long wrote:
> The SLUB memory allocator was changed substantially from 3.0 to 3.1 by
> replacing some of page locking codes for updating the free object list
> of the slab with double-quadword atomic exchange (cmpxchg_double_slab)
> or a pseudo one using a page lock when debugging is turned on. In the
> normal case, that should be enough to make sure that the slab is in a
> consistent state. However, when CONFIG_SLUB_DEBUG is turned on and the
> Redzone debugging flag is set, the Redzone bytes are also used to mark
> if an object is free or allocated. The extra state information in those
> Redzone bytes is not protected by the cmpxchg_double_slab(). As a
> result,
> validate_slab() may report a Redzone error if the validation is
> performed
> while racing with a free to a debugged slab.
>
> The problem was reported in
>
> https://bugzilla.kernel.org/show_bug.cgi?id=42312
>
> It is fairly easy to reproduce by passing in the kernel parameter of
> "slub_debug=FZPU". After booting, run the command (as root):
>
> while true ; do ./slabinfo -v ; sleep 3 ; done
>
> The slabinfo test code can be found in tools/vm/slabinfo.c.
>
> At the same time, load the system with heavy I/O activities by, for
> example, building the Linux kernel. The following kind of dmesg messages
> will then be reported:
>
> BUG names_cache: Redzone overwritten
> SLUB: names_cache 3 slabs counted but counter=4
>
> This patch fixes the BUG message by acquiring the node-level lock for
> slabs flagged for debugging to avoid this possible racing condition.
> The locking is done on the node-level lock instead of the more granular
> page lock because the new code may speculatively acquire the node-level
> lock later on. Acquiring the page lock and then the node lock may lead
> to potential deadlock.
>
> As the increment of slab node count and insertion of the new slab into
> the partial or full slab list is not an atomic operation, there is a
> small time window where the two may not match. This patch temporarily
> works around this problem by allowing the node count to be one larger
> than the number of slab presents in the lists. This workaround may not
> work if more than one CPU is actively adding slab to the same node,
> but it should be good enough to workaround the problem in most cases.
>
> To really fix the issue, the overall synchronization between debug slub
> operations and slub validation needs a revisit.
>
> This patch also fixes a number of "code indent should use tabs where
> possible" error reported by checkpatch.pl in the __slab_free() function
> by replacing groups of 8-space tab by real tabs.
>
> After applying the patch, the slub error and warnings are all gone in
> the 4-CPU x86-64 test machine.
>
> Signed-off-by: Waiman Long <waiman.long@...com>
> Reviewed-by: Don Morris <don.morris@...com>
> ---
> mm/slub.c | 46 +++++++++++++++++++++++++++++++++-------------
> 1 files changed, 33 insertions(+), 13 deletions(-)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index ffe13fd..4ca3140 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2445,8 +2445,18 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
>
> stat(s, FREE_SLOWPATH);
>
> - if (kmem_cache_debug(s) && !free_debug_processing(s, page, x, addr))
> - return;
> + if (kmem_cache_debug(s)) {
> + /*
> + * We need to acquire the node lock to prevent spurious error
> + * with validate_slab().
> + */
> + n = get_node(s, page_to_nid(page));
> + spin_lock_irqsave(&n->list_lock, flags);
> + if (!free_debug_processing(s, page, x, addr)) {
> + spin_unlock_irqrestore(&n->list_lock, flags);
> + return;
> + }
missing spin_unlock_irqrestore(&n->list_lock, flags); ?
> + }
>
> do {
> prior = page->freelist;
> @@ -2467,7 +2477,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
>
> else { /* Needs to be taken off a list */
>
> - n = get_node(s, page_to_nid(page));
> + n = get_node(s, page_to_nid(page));
> /*
> * Speculatively acquire the list_lock.
> * If the cmpxchg does not succeed then we may
> @@ -2501,10 +2511,10 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
> * The list lock was not taken therefore no list
> * activity can be necessary.
> */
> - if (was_frozen)
> - stat(s, FREE_FROZEN);
> - return;
> - }
> + if (was_frozen)
> + stat(s, FREE_FROZEN);
> + return;
> + }
>
> /*
> * was_frozen may have been set after we acquired the list_lock in
> @@ -2514,7 +2524,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
> stat(s, FREE_FROZEN);
> else {
> if (unlikely(!inuse && n->nr_partial > s->min_partial))
> - goto slab_empty;
> + goto slab_empty;
>
> /*
> * Objects left in the slab. If it was not on the partial list before
> @@ -4122,7 +4132,7 @@ static void validate_slab_slab(struct kmem_cache *s, struct page *page,
> static int validate_slab_node(struct kmem_cache *s,
> struct kmem_cache_node *n, unsigned long *map)
> {
> - unsigned long count = 0;
> + unsigned long count = 0, n_count;
> struct page *page;
> unsigned long flags;
>
> @@ -4143,10 +4153,20 @@ static int validate_slab_node(struct kmem_cache *s,
> validate_slab_slab(s, page, map);
> count++;
> }
> - if (count != atomic_long_read(&n->nr_slabs))
> - printk(KERN_ERR "SLUB: %s %ld slabs counted but "
> - "counter=%ld\n", s->name, count,
> - atomic_long_read(&n->nr_slabs));
> + n_count = atomic_long_read(&n->nr_slabs);
> + /*
> + * The following workaround is to greatly reduce the chance of counter
> + * mismatch messages due to the fact that inc_slabs_node() and the
> + * subsequent insertion into the partial or full slab list is not
> + * atomic. Consequently, there is a small timing window when the two
> + * are not in the same state. A possible fix is to take the node lock
> + * while doing inc_slabs_node() and slab insertion, but that may
> + * require substantial changes to existing slow path slab allocation
> + * logic.
> + */
> + if ((count != n_count) && (count + 1 != n_count))
> + printk(KERN_ERR "SLUB: %s %ld slabs counted but counter=%ld\n",
> + s->name, count, n_count);
>
> out:
> spin_unlock_irqrestore(&n->list_lock, flags);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists