[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150616175231.427499ae@redhat.com>
Date: Tue, 16 Jun 2015 17:52:31 +0200
From: Jesper Dangaard Brouer <brouer@...hat.com>
To: Christoph Lameter <cl@...ux.com>
Cc: Joonsoo Kim <js1304@...il.com>,
Joonsoo Kim <iamjoonsoo.kim@....com>,
Linux Memory Management List <linux-mm@...ck.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Linux-Netdev <netdev@...r.kernel.org>,
Alexander Duyck <alexander.duyck@...il.com>, brouer@...hat.com
Subject: Re: [PATCH 7/7] slub: initial bulk free implementation
On Tue, 16 Jun 2015 10:10:25 -0500 (CDT)
Christoph Lameter <cl@...ux.com> wrote:
> On Tue, 16 Jun 2015, Joonsoo Kim wrote:
>
> > So, in your test, most of objects may come from one or two slabs and your
> > algorithm is well optimized for this case. But, is this workload normal case?
>
> It is normal if the objects were bulk allocated because SLUB ensures that
> all objects are first allocated from one page before moving to another.
Yes, exactly. Maybe SLAB is different? If so, then we can handle that
in the SLAB specific bulk implementation.
> > If most of objects comes from many different slabs, bulk free API does
> > enabling/disabling interrupt very much so I guess it work worse than
> > just calling __kmem_cache_free_bulk(). Could you test this case?
>
> In case of SLAB this would be an issue since the queueing mechanism
> destroys spatial locality. This is much less an issue for SLUB.
I think Kim is worried about the cost of the enable/disable calls, when
the slowpath gets called. But it is not a problem because the cost of
local_irq_{disable,enable} is very low (total cost 7 cycles).
It is very important that everybody realizes that the save+restore
variant is very expensive, this is key:
CPU: i7-4790K CPU @ 4.00GHz
* local_irq_{disable,enable}: 7 cycles(tsc) - 1.821 ns
* local_irq_{save,restore} : 37 cycles(tsc) - 9.443 ns
Even if EVERY object need to call slowpath/__slab_free() it will be
faster than calling the fallback. Because I've demonstrated the call
this_cpu_cmpxchg_double() costs 9 cycles.
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Sr. Network Kernel Developer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer
p.s. for comparison[1] a function call cost is 5-6 cycles, and a function
pointer call cost is 6-10 cycles, depending on CPU.
[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists