linux-kernel - Re: [RT LATENCY] 249 microsecond latency caused by slub's unfreeze

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Thu, 28 Mar 2013 22:43:03 -0400
From:	Paul Gortmaker <paul.gortmaker@...driver.com>
To:	Christoph Lameter <cl@...ux.com>
Cc:	Joonsoo Kim <iamjoonsoo.kim@....com>,
	Steven Rostedt <rostedt@...dmis.org>,
	LKML <linux-kernel@...r.kernel.org>,
	RT <linux-rt-users@...r.kernel.org>,
	Thomas Gleixner <tglx@...utronix.de>,
	Clark Williams <clark@...hat.com>,
	Pekka Enberg <penberg@...nel.org>
Subject: Re: [RT LATENCY] 249 microsecond latency caused by slub's
 unfreeze_partials() code.

On Thu, Mar 28, 2013 at 1:30 PM, Christoph Lameter <cl@...ux.com> wrote:
> This patch requires the earlier bug fix.
>
>
> Subject: slub: Make cpu partial slab support configurable

Hi Christoph,

Minor nit:

   Applying: slub: Make cpu partial slab support configurable
   /home/paul/git/linux-head/.git/rebase-apply/patch:141: space before
tab in indent.
		       		{ /* Needs to be taken off a list */
   warning: 1 line adds whitespace errors.

I've tagged the block below where this happens.

Also I've got a question/comment about the Kconfig addition below...

>
> cpu partial support can introduce level of indeterminism that is not wanted
> in certain context (like a realtime kernel). Make it configurable.
>
> Signed-off-by: Christoph Lameter <cl@...ux.com>
>
> Index: linux/include/linux/slub_def.h
> ===================================================================
> --- linux.orig/include/linux/slub_def.h 2013-03-28 12:14:26.958358688 -0500
> +++ linux/include/linux/slub_def.h      2013-03-28 12:19:46.275866639 -0500
> @@ -47,7 +47,9 @@ struct kmem_cache_cpu {
>         void **freelist;        /* Pointer to next available object */
>         unsigned long tid;      /* Globally unique transaction id */
>         struct page *page;      /* The slab from which we are allocating */
> +#ifdef CONFIG_SLUB_CPU_PARTIAL
>         struct page *partial;   /* Partially allocated frozen slabs */
> +#endif
>  #ifdef CONFIG_SLUB_STATS
>         unsigned stat[NR_SLUB_STAT_ITEMS];
>  #endif
> @@ -84,7 +86,9 @@ struct kmem_cache {
>         int size;               /* The size of an object including meta data */
>         int object_size;        /* The size of an object without meta data */
>         int offset;             /* Free pointer offset. */
> +#ifdef CONFIG_SLUB_CPU_PARTIAL
>         int cpu_partial;        /* Number of per cpu partial objects to keep around */
> +#endif
>         struct kmem_cache_order_objects oo;
>
>         /* Allocation and freeing of slabs */
> Index: linux/mm/slub.c
> ===================================================================
> --- linux.orig/mm/slub.c        2013-03-28 12:18:48.446813991 -0500
> +++ linux/mm/slub.c     2013-03-28 12:19:46.275866639 -0500
> @@ -1531,7 +1531,9 @@ static inline void *acquire_slab(struct
>         return freelist;
>  }
>
> +#ifdef CONFIG_SLUB_CPU_PARTIAL
>  static int put_cpu_partial(struct kmem_cache *s, struct page *page, int drain);
> +#endif
>  static inline bool pfmemalloc_match(struct page *page, gfp_t gfpflags);
>
>  /*
> @@ -1570,10 +1572,20 @@ static void *get_partial_node(struct kme
>                         object = t;
>                         available =  page->objects - (unsigned long)page->lru.next;
>                 } else {
> +#ifdef CONFIG_SLUB_CPU_PARTIAL
>                         available = put_cpu_partial(s, page, 0);
>                         stat(s, CPU_PARTIAL_NODE);
> +#else
> +                       BUG();
> +#endif
>                 }
> -               if (kmem_cache_debug(s) || available > s->cpu_partial / 2)
> +               if (kmem_cache_debug(s) ||
> +#ifdef CONFIG_SLUB_CPU_PARTIAL
> +                       available > s->cpu_partial / 2
> +#else
> +                       available > 0
> +#endif
> +                       )
>                         break;
>
>         }
> @@ -1874,6 +1886,7 @@ redo:
>         }
>  }
>
> +#ifdef CONFIG_SLUB_CPU_PARTIAL
>  /*
>   * Unfreeze all the cpu partial slabs.
>   *
> @@ -1989,6 +2002,7 @@ static int put_cpu_partial(struct kmem_c
>         } while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page) != oldpage);
>         return pobjects;
>  }
> +#endif
>
>  static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
>  {
> @@ -2013,7 +2027,9 @@ static inline void __flush_cpu_slab(stru
>                 if (c->page)
>                         flush_slab(s, c);
>
> +#ifdef CONFIG_SLUB_CPU_PARTIAL
>                 unfreeze_partials(s, c);
> +#endif
>         }
>  }
>
> @@ -2029,7 +2045,11 @@ static bool has_cpu_slab(int cpu, void *
>         struct kmem_cache *s = info;
>         struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
>
> +#ifdef CONFIG_SLUB_CPU_PARTIAL
>         return c->page || c->partial;
> +#else
> +       return c->page;
> +#endif
>  }
>
>  static void flush_all(struct kmem_cache *s)
> @@ -2225,7 +2245,10 @@ static void *__slab_alloc(struct kmem_ca
>         page = c->page;
>         if (!page)
>                 goto new_slab;
> +
> +#ifdef CONFIG_SLUB_CPU_PARTIAL
>  redo:
> +#endif
>
>         if (unlikely(!node_match(page, node))) {
>                 stat(s, ALLOC_NODE_MISMATCH);
> @@ -2278,6 +2301,7 @@ load_freelist:
>
>  new_slab:
>
> +#ifdef CONFIG_SLUB_CPU_PARTIAL
>         if (c->partial) {
>                 page = c->page = c->partial;
>                 c->partial = page->next;
> @@ -2285,6 +2309,7 @@ new_slab:
>                 c->freelist = NULL;
>                 goto redo;
>         }
> +#endif
>
>         freelist = new_slab_objects(s, gfpflags, node, &c);
>
> @@ -2491,6 +2516,7 @@ static void __slab_free(struct kmem_cach
>                 new.inuse--;
>                 if ((!new.inuse || !prior) && !was_frozen) {
>
> +#ifdef CONFIG_SLUB_CPU_PARTIAL
>                         if (!kmem_cache_debug(s) && !prior)
>
>                                 /*
> @@ -2499,7 +2525,9 @@ static void __slab_free(struct kmem_cach
>                                  */
>                                 new.frozen = 1;
>
> -                       else { /* Needs to be taken off a list */
> +                       else
> +#endif
> +                               { /* Needs to be taken off a list */

This is the origin of the trivial whitespace nag.
:se list in vi shows:

-^I^I^Ielse { /* Needs to be taken off a list */$
+^I^I^Ielse$
+#endif$
+^I^I       ^I^I{ /* Needs to be taken off a list */$

>
>                                 n = get_node(s, page_to_nid(page));
>                                 /*
> @@ -2521,6 +2549,7 @@ static void __slab_free(struct kmem_cach
>                 "__slab_free"));
>
>         if (likely(!n)) {
> +#ifdef CONFIG_SLUB_CPU_PARTIAL
>
>                 /*
>                  * If we just froze the page then put it onto the
> @@ -2530,6 +2559,7 @@ static void __slab_free(struct kmem_cach
>                         put_cpu_partial(s, page, 1);
>                         stat(s, CPU_PARTIAL_FREE);
>                 }
> +#endif
>                 /*
>                  * The list lock was not taken therefore no list
>                  * activity can be necessary.
> @@ -3036,7 +3066,7 @@ static int kmem_cache_open(struct kmem_c
>          * list to avoid pounding the page allocator excessively.
>          */
>         set_min_partial(s, ilog2(s->size) / 2);
> -
> +#ifdef CONFIG_SLUB_CPU_PARTIAL
>         /*
>          * cpu_partial determined the maximum number of objects kept in the
>          * per cpu partial lists of a processor.
> @@ -3064,6 +3094,7 @@ static int kmem_cache_open(struct kmem_c
>                 s->cpu_partial = 13;
>         else
>                 s->cpu_partial = 30;
> +#endif
>
>  #ifdef CONFIG_NUMA
>         s->remote_node_defrag_ratio = 1000;
> @@ -4424,13 +4455,14 @@ static ssize_t show_slab_objects(struct
>                         total += x;
>                         nodes[node] += x;
>
> +#ifdef CONFIG_SLUB_CPU_PARTIAL
>                         page = ACCESS_ONCE(c->partial);
>                         if (page) {
>                                 x = page->pobjects;
>                                 total += x;
>                                 nodes[node] += x;
>                         }
> -
> +#endif
>                         per_cpu[node]++;
>                 }
>         }
> @@ -4583,6 +4615,7 @@ static ssize_t min_partial_store(struct
>  }
>  SLAB_ATTR(min_partial);
>
> +#ifdef CONFIG_CPU_PARTIAL
>  static ssize_t cpu_partial_show(struct kmem_cache *s, char *buf)
>  {
>         return sprintf(buf, "%u\n", s->cpu_partial);
> @@ -4605,6 +4638,7 @@ static ssize_t cpu_partial_store(struct
>         return length;
>  }
>  SLAB_ATTR(cpu_partial);
> +#endif
>
>  static ssize_t ctor_show(struct kmem_cache *s, char *buf)
>  {
> @@ -4644,6 +4678,7 @@ static ssize_t objects_partial_show(stru
>  }
>  SLAB_ATTR_RO(objects_partial);
>
> +#ifdef CONFIG_SLUB_CPU_PARTIAL
>  static ssize_t slabs_cpu_partial_show(struct kmem_cache *s, char *buf)
>  {
>         int objects = 0;
> @@ -4674,6 +4709,7 @@ static ssize_t slabs_cpu_partial_show(st
>         return len + sprintf(buf + len, "\n");
>  }
>  SLAB_ATTR_RO(slabs_cpu_partial);
> +#endif
>
>  static ssize_t reclaim_account_show(struct kmem_cache *s, char *buf)
>  {
> @@ -4997,11 +5033,13 @@ STAT_ATTR(DEACTIVATE_BYPASS, deactivate_
>  STAT_ATTR(ORDER_FALLBACK, order_fallback);
>  STAT_ATTR(CMPXCHG_DOUBLE_CPU_FAIL, cmpxchg_double_cpu_fail);
>  STAT_ATTR(CMPXCHG_DOUBLE_FAIL, cmpxchg_double_fail);
> +#ifdef CONFIG_CPU_PARTIAL
>  STAT_ATTR(CPU_PARTIAL_ALLOC, cpu_partial_alloc);
>  STAT_ATTR(CPU_PARTIAL_FREE, cpu_partial_free);
>  STAT_ATTR(CPU_PARTIAL_NODE, cpu_partial_node);
>  STAT_ATTR(CPU_PARTIAL_DRAIN, cpu_partial_drain);
>  #endif
> +#endif
>
>  static struct attribute *slab_attrs[] = {
>         &slab_size_attr.attr,
> @@ -5009,7 +5047,9 @@ static struct attribute *slab_attrs[] =
>         &objs_per_slab_attr.attr,
>         &order_attr.attr,
>         &min_partial_attr.attr,
> +#ifdef CONFIG_CPU_PARTIAL
>         &cpu_partial_attr.attr,
> +#endif
>         &objects_attr.attr,
>         &objects_partial_attr.attr,
>         &partial_attr.attr,
> @@ -5022,7 +5062,9 @@ static struct attribute *slab_attrs[] =
>         &destroy_by_rcu_attr.attr,
>         &shrink_attr.attr,
>         &reserved_attr.attr,
> +#ifdef CONFIG_SLUB_CPU_PARTIAL
>         &slabs_cpu_partial_attr.attr,
> +#endif
>  #ifdef CONFIG_SLUB_DEBUG
>         &total_objects_attr.attr,
>         &slabs_attr.attr,
> @@ -5064,11 +5106,13 @@ static struct attribute *slab_attrs[] =
>         &order_fallback_attr.attr,
>         &cmpxchg_double_fail_attr.attr,
>         &cmpxchg_double_cpu_fail_attr.attr,
> +#ifdef CONFIG_SLUB_CPU_PARTIAL
>         &cpu_partial_alloc_attr.attr,
>         &cpu_partial_free_attr.attr,
>         &cpu_partial_node_attr.attr,
>         &cpu_partial_drain_attr.attr,
>  #endif
> +#endif
>  #ifdef CONFIG_FAILSLAB
>         &failslab_attr.attr,
>  #endif
> Index: linux/init/Kconfig
> ===================================================================
> --- linux.orig/init/Kconfig     2013-03-28 12:14:26.958358688 -0500
> +++ linux/init/Kconfig  2013-03-28 12:19:46.275866639 -0500
> @@ -1514,6 +1514,14 @@ config SLOB
>
>  endchoice
>
> +config SLUB_CPU_PARTIAL
> +       depends on SLUB
> +       bool "SLUB per cpu partial cache"
> +       help
> +         Per cpu partial caches accellerate freeing of objects at the
> +         price of more indeterminism in the latency of the free.
> +         Typically one would choose no for a realtime system.

Is "batch" a better description than "accelerate" ?  Something like

     Per cpu partial caches allows batch freeing of objects to maximize
     throughput.  However, this can increase the length of time spent
     holding key locks, which can increase latency spikes with respect
     to responsiveness.  Select yes unless you are tuning for a realtime
     oriented system.

Also, I believe this will cause a behaviour change for people who
just run "make oldconfig" -- since there is no default line.  Meaning
that it used to be unconditionally on, but now I think it will be off
by default, if people just mindlessly hold down Enter key.

For RT, we'll want default N if RT_FULL (RT_BASE?) but for mainline,
I expect you'll want default Y in order to be consistent with previous
behaviour?

I've not built/booted yet, but I'll follow up if I see anything else in doing
that.

Thanks for the patches, as I'd said to Steve -- I was totally unaware that
this tuning knob even existed prior to this discussion.

Paul.
--

> +
>  config MMAP_ALLOW_UNINITIALIZED
>         bool "Allow mmapped anonymous memory to be uninitialized"
>         depends on EXPERT && !MMU
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/