[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <47B6B5E1.90705@cosmosbay.com>
Date: Sat, 16 Feb 2008 11:07:29 +0100
From: Eric Dumazet <dada1@...mosbay.com>
To: Andrew Morton <akpm@...ux-foundation.org>
CC: Herbert Xu <herbert@...dor.apana.org.au>,
"David S. Miller" <davem@...emloft.net>, netdev@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: include/linux/pcounter.h
Andrew Morton a écrit :
> - First up, why was this added at all? We have percpu_counter.h which
> has several years development invested in it. afaict it would suit the
> present applications of pcounters.
>
> If some deficiency in percpu_counters has been identified, is it
> possible to correct that deficiency rather than implementing a whole new
> counter type? That would be much better.
>
> - The comments in pcounter.h appear to indicate that there is a
> performance advantage (and we infer that the advantage is when the
> statically-allocated flavour of pcounters is used). When compared with
> percpu_counters the number of data-reference indirections is the same as
> with percpu_counters, so no advantage there.
>
> And, bizarrely, because of a quite inappropriate abstraction toy, both
> flavours of pcounters add an indirect function call which I believe is
> significantly more expensive than a plain old pointer indirection.
>
> So it's quite possible that DEFINE_PCOUNTER-style counters consume more
> memory and are slower than percpu_counters. They will surely be much
> slower on the read side. More below.
>
> If we really want to put some helper wrappers around
> DEFINE_PER_CPU(s32) then I'd suggest that we should do that as a
> standalone thing and not attempt to graft the same interface onto two
> quite different types of storage (DEFINE_PER_CPU and alloc_per_cpu)
>
> - The comment "2)" in pcounter.h (which overflows 80 cols and probably
> wasn't passed through checkpatch) indicates that some other
> implementation (presumably plain old DEFINE_PER_CPU) will use
> NR_CPUS*(32+sizeof(void *)) bytes of storage. But DEFINE_PCOUNTER will
> use as much memory as DEFINE_PER_CPU(s32) and both pcounter_alloc()-style
> pcounters and percpu_counters use
> num_possible_cpus()*sizeof(s32)+epsilon.
>
> - The CONFIG_SMP=n stubs in pcounter.h are cheesy and are vulnerable to
> several well-known compilation risks which I always forget. Should be
> converted to regular static inlines.
>
> - the comment in lib/pcounter.c needlessly exceeds 80 cols.
>
> - pcounter_dyn_add() will spew a
> use-of-smp_processor_id()-in-preemptible-code warning if used in places
> where one could reasonably use it. The interface could do with a bit of
> a rethink. Or at least some justification and documentation.
>
> - pcounter_getval() will be disastrously inefficient if
> num_possible_cpus() is much greater than num_online_cpus(). It should
> use for_each_online_cpu() (as does percpu_counter), and implement a CPU
> hotplug notifier (as does percpu_counter).
>
> It will remain grossly inefficient at high CPU counts, unlike
> percpu_counters, which solved this problem by doing a batched spill into
> a central counter at add/sub time.
>
> The danger here is that someone will actually use this interface in new
> code. Six months later (when it's too late to fix it) the big-NUMA guys
> come up and say "whaa, when our user does <this> it disabled interrupts
> for N milliseconds".
>
> - pcounter_getval() can return incorrect negative numbers. This can
> cause caller malfunctions in very rare situations because callers just
> don't expect the things which they're counting to go negative.
>
> We experienced this during the evolution of percpu_counter. See
> percpu_counter_sum_positive() and friends.
>
> - pcounter_alloc() should return -ENOMEM on allocation error, not "1".
>
> - pcounter_free() perhaps shouldn't test for (self->per_cpu_values !=
> NULL), because callers shouldn't be calling it if pcounter_alloc() failed
> (arguable).
>
>
>
> afaict the whole implementation can and should be removed and replaced with
> percpu_counters. I don't think there's much point in its ability to manage
> DEFINE_PER_CPU counters: pcounter_getval() remains grossly inefficient (and
> can return negative values) and quite a bit of new code will need to be put
> in place to address that.
>
> But perhaps there are plans to evolve it into something further in the
> future, I don't know. But I would suggest that the people who have worked
> upon percpu_counters (principally Gautham, Peter Z, clameter and me) be
> involved in that work.
>
Andrew, pcounter is a temporary abstraction.
It is temporaty because it will vanish as soon as Christoph Clameter (or
somebody else) provides real cheap per cpu counter implementation.
At time I introduced it in network tree (locally, not meant to invade kernel
land and makes you unhappy :) ), the goals were :
Some counters (total sockets count) were a single integer, that were doing
ping-pong between cpus (SMP/NUMA). As they are basically lazy values (as we
dont really need to read their value), using plain atomic_t was overkill.
Using a plain percpu_counters was expensive (NR_CPUS*(32+sizeof(void *))
instead of num_possible_cpus()*4).
Using 'online' instead of 'possible' stuff is not really needed for a
temporary thing.
- We dont care of read sides. We want really fast write side. Real fast.
Read side is when you do a "cat /proc/net/sockstat".
That is ... once in a while...
Now when we allocate a new socket, code to increment the "socket count" is :
c03a74a8 <tcp_pcounter_add>:
c03a74a8: b8 90 26 5f c0 mov $0xc05f2690,%eax
c03a74ad: 64 8b 0d 10 f1 5e c0 mov %fs:0xc05ef110,%ecx
c03a74b4: 01 14 01 add %edx,(%ecx,%eax,1)
c03a74b7: c3 ret
That is 4 instructions. I could be two in the future, thanks to current work
on fs/gs based percpu variables.
Current percpu_counters implementation is more expensive :
c021467b <__percpu_counter_add>:
c021467b: 55 push %ebp
c021467c: 57 push %edi
c021467d: 89 c7 mov %eax,%edi
c021467f: 56 push %esi
c0214680: 53 push %ebx
c0214681: 83 ec 04 sub $0x4,%esp
c0214684: 8b 40 14 mov 0x14(%eax),%eax
c0214687: 64 8b 1d 08 f0 5e c0 mov %fs:0xc05ef008,%ebx
c021468e: 8b 6c 24 18 mov 0x18(%esp),%ebp
c0214692: f7 d0 not %eax
c0214694: 8b 1c 98 mov (%eax,%ebx,4),%ebx
c0214697: 89 1c 24 mov %ebx,(%esp)
c021469a: 8b 03 mov (%ebx),%eax
c021469c: 89 c3 mov %eax,%ebx
c021469e: 89 c6 mov %eax,%esi
c02146a0: c1 fe 1f sar $0x1f,%esi
c02146a3: 89 e8 mov %ebp,%eax
c02146a5: 01 d3 add %edx,%ebx
c02146a7: 11 ce adc %ecx,%esi
c02146a9: 99 cltd
c02146aa: 39 d6 cmp %edx,%esi
c02146ac: 7f 15 jg c02146c3
<__percpu_counter_add+0x48>
c02146ae: 7c 04 jl c02146b4
<__percpu_counter_add+0x39>
c02146b0: 39 eb cmp %ebp,%ebx
c02146b2: 73 0f jae c02146c3
<__percpu_counter_add+0x48>
c02146b4: f7 dd neg %ebp
c02146b6: 89 e8 mov %ebp,%eax
c02146b8: 99 cltd
c02146b9: 39 d6 cmp %edx,%esi
c02146bb: 7f 20 jg c02146dd
<__percpu_counter_add+0x62>
c02146bd: 7c 04 jl c02146c3
<__percpu_counter_add+0x48>
c02146bf: 39 eb cmp %ebp,%ebx
c02146c1: 77 1a ja c02146dd
<__percpu_counter_add+0x62>
c02146c3: 89 f8 mov %edi,%eax
c02146c5: e8 04 cc 1f 00 call c04112ce <_spin_lock>
c02146ca: 01 5f 04 add %ebx,0x4(%edi)
c02146cd: 11 77 08 adc %esi,0x8(%edi)
c02146d0: 8b 04 24 mov (%esp),%eax
c02146d3: c7 00 00 00 00 00 movl $0x0,(%eax)
c02146d9: fe 07 incb (%edi)
c02146db: eb 05 jmp c02146e2
<__percpu_counter_add+0x67>
c02146dd: 8b 04 24 mov (%esp),%eax
c02146e0: 89 18 mov %ebx,(%eax)
c02146e2: 58 pop %eax
c02146e3: 5b pop %ebx
c02146e4: 5e pop %esi
c02146e5: 5f pop %edi
c02146e6: 5d pop %ebp
c02146e7: c3 ret
Once it is better, just make pcounter vanish.
It is even clearly stated at the top of include/linux/pcounter.h
/*
* Using a dynamic percpu 'int' variable has a cost :
* 1) Extra dereference
* Current per_cpu_ptr() implementation uses an array per 'percpu variable'.
* 2) memory cost of NR_CPUS*(32+sizeof(void *)) instead of num_possible_cpus()*4
*
* This pcounter implementation is an abstraction to be able to use
* either a static or a dynamic per cpu variable.
* One dynamic per cpu variable gets a fast & cheap implementation, we can
* change pcounter implementation too.
*/
We all agree.
Thank you
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists