linux-kernel - Re: [PATCH] mm/slub: Detach node lock from counting free objects

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200218205312.GA3156@carbon>
Date:   Tue, 18 Feb 2020 12:53:12 -0800
From:   Roman Gushchin <guro@...com>
To:     Wen Yang <wenyang@...ux.alibaba.com>
CC:     Andrew Morton <akpm@...ux-foundation.org>,
        Christoph Lameter <cl@...ux.com>,
        Pekka Enberg <penberg@...nel.org>,
        David Rientjes <rientjes@...gle.com>,
        Joonsoo Kim <iamjoonsoo.kim@....com>,
        Xunlei Pang <xlpang@...ux.alibaba.com>, <linux-mm@...ck.org>,
        <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] mm/slub: Detach node lock from counting free objects

On Sun, Feb 16, 2020 at 12:15:54PM +0800, Wen Yang wrote:
> 
> 
> On 2020/2/13 6:52 上午, Andrew Morton wrote:
> > On Sat,  1 Feb 2020 11:15:02 +0800 Wen Yang <wenyang@...ux.alibaba.com> wrote:
> > 
> > > The lock, protecting the node partial list, is taken when couting the free
> > > objects resident in that list. It introduces locking contention when the
> > > page(s) is moved between CPU and node partial lists in allocation path
> > > on another CPU. So reading "/proc/slabinfo" can possibily block the slab
> > > allocation on another CPU for a while, 200ms in extreme cases. If the
> > > slab object is to carry network packet, targeting the far-end disk array,
> > > it causes block IO jitter issue.
> > > 
> > > This fixes the block IO jitter issue by caching the total inuse objects in
> > > the node in advance. The value is retrieved without taking the node partial
> > > list lock on reading "/proc/slabinfo".
> > > 
> > > ...
> > > 
> > > @@ -1768,7 +1774,9 @@ static void free_slab(struct kmem_cache *s, struct page *page)
> > >   static void discard_slab(struct kmem_cache *s, struct page *page)
> > >   {
> > > -	dec_slabs_node(s, page_to_nid(page), page->objects);
> > > +	int inuse = page->objects;
> > > +
> > > +	dec_slabs_node(s, page_to_nid(page), page->objects, inuse);
> > 
> > Is this right?  dec_slabs_node(..., page->objects, page->objects)?
> > 
> > If no, we could simply pass the page* to inc_slabs_node/dec_slabs_node
> > and save a function argument.
> > 
> > If yes then why?
> > 
> 
> Thanks for your comments.
> We are happy to improve this patch based on your suggestions.
> 
> 
> When the user reads /proc/slabinfo, in order to obtain the active_objs
> information, the kernel traverses all slabs and executes the following code
> snippet:
> static unsigned long count_partial(struct kmem_cache_node *n,
>                                         int (*get_count)(struct page *))
> {
>         unsigned long flags;
>         unsigned long x = 0;
>         struct page *page;
> 
>         spin_lock_irqsave(&n->list_lock, flags);
>         list_for_each_entry(page, &n->partial, slab_list)
>                 x += get_count(page);
>         spin_unlock_irqrestore(&n->list_lock, flags);
>         return x;
> }
> 
> It may cause performance issues.
> 
> Christoph suggested "you could cache the value in the userspace application?
> Why is this value read continually?", But reading the /proc/slabinfo is
> initiated by the user program. As a cloud provider, we cannot control user
> behavior. If a user program inadvertently executes cat /proc/slabinfo, it
> may affect other user programs.
> 
> As Christoph said: "The count is not needed for any operations. Just for the
> slabinfo output. The value has no operational value for the allocator
> itself. So why use extra logic to track it in potentially performance
> critical paths?"
> 
> In this way, could we show the approximate value of active_objs in the
> /proc/slabinfo?
> 
> Based on the following information:
> In the discard_slab() function, page->inuse is equal to page->total_objects;
> In the allocate_slab() function, page->inuse is also equal to
> page->total_objects (with one exception: for kmem_cache_node, page-> inuse
> equals 1);
> page->inuse will only change continuously when the obj is constantly
> allocated or released. (This should be the performance critical path
> emphasized by Christoph)
> 
> When users query the global slabinfo information, we may use total_objects
> to approximate active_objs.

Well, from one point of view, it makes no sense, because the ratio between
these two numbers is very meaningful: it's the slab utilization rate.

On the other side, with enabled per-cpu partial lists active_objs has
nothing to do with the reality anyway, so I agree with you, calling
count_partial() is almost useless.

That said, I wonder if the right thing to do is something like the patch below?

Thanks!

Roman

--

diff --git a/mm/slub.c b/mm/slub.c
index 1d644143f93e..ba0505e75ecc 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2411,14 +2411,16 @@ static inline unsigned long node_nr_objs(struct kmem_cache_node *n)
 static unsigned long count_partial(struct kmem_cache_node *n,
                                        int (*get_count)(struct page *))
 {
-       unsigned long flags;
        unsigned long x = 0;
+#ifdef CONFIG_SLUB_CPU_PARTIAL
+       unsigned long flags;
        struct page *page;
 
        spin_lock_irqsave(&n->list_lock, flags);
        list_for_each_entry(page, &n->partial, slab_list)
                x += get_count(page);
        spin_unlock_irqrestore(&n->list_lock, flags);
+#endif
        return x;
 }
 #endif /* CONFIG_SLUB_DEBUG || CONFIG_SYSFS */