linux-kernel - Re: [patch] percpu_counter: scalability works

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1305301151.3866.39.camel@edumazet-laptop>
Date:	Fri, 13 May 2011 17:39:11 +0200
From:	Eric Dumazet <eric.dumazet@...il.com>
To:	Shaohua Li <shaohua.li@...el.com>
Cc:	Tejun Heo <tj@...nel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
	"cl@...ux.com" <cl@...ux.com>,
	"npiggin@...nel.dk" <npiggin@...nel.dk>
Subject: Re: [patch] percpu_counter: scalability works

Le vendredi 13 mai 2011 à 16:51 +0200, Eric Dumazet a écrit :

> Here the patch I cooked (on top of linux-2.6)
> 
> This solves the problem quite well for me.
> 
> Idea is :
> 
> Consider _sum() being slow path. It is still serialized by a spinlock().
> 
> Add a fbc->sequence, so that _add() can detect a sum() is in flight, and
> directly add to a new atomic64_t field I named "fbc->slowcount" (and not
> touch its percpu s32 variable so that _sum() can get accurate
> percpu_counter 'Value')
> 
> Low order bit of the 'sequence' is used to signal _sum() is in flight,
> while _add() threads that overflow their percpu s32 variable do a
> sequence += 2, so that _sum() can detect at least one cpu messed the
> fbc->count and reset its s32 variable. _sum() can restart its loop, but
> since sequence has still low order bit set, we have guarantee that the
> _sum() loop wont be restarted ad infinitum.
> 
> Notes : I disabled IRQ in _add() to reduce window, making _add() as fast
> as possible to avoid _sum() extra loops, but its not strictly necessary,
> we can discuss this point, since _sum() is slow path :)
> 
> _sum() is accurate and not blocking anymore _add(). It's slowing it a
> bit of course since all _add() will touch fbc->slowcount.
> 
> _sum() is about same speed than before in my tests.
> 
> On my 8 cpu (Intel(R) Xeon(R) CPU E5450 @ 3.00GHz) machine, and 32bit
> kernel, the : 
> 	loop (10000000 times) {
> 		p = mmap(128M, ANONYMOUS);
> 		munmap(p, 128M);
> 	} 
> done on 8 cpus bench :
> 
> Before patch :
> real	3m22.759s
> user	0m6.353s
> sys	26m28.919s
> 
> After patch :
> real	0m23.420s
> user	0m6.332s
> sys	2m44.561s
> 
> Quite good results considering atomic64_add() uses two "lock cmpxchg8b"
> on x86_32 :
> 
>     33.03%        mmap_test  [kernel.kallsyms]       [k] unmap_vmas
>     12.99%        mmap_test  [kernel.kallsyms]       [k] atomic64_add_return_cx8
>      5.62%        mmap_test  [kernel.kallsyms]       [k] free_pgd_range
>      3.07%        mmap_test  [kernel.kallsyms]       [k] sysenter_past_esp
>      2.48%        mmap_test  [kernel.kallsyms]       [k] memcpy
>      2.24%        mmap_test  [kernel.kallsyms]       [k] perf_event_mmap
>      2.21%        mmap_test  [kernel.kallsyms]       [k] _raw_spin_lock
>      2.02%        mmap_test  [vdso]                  [.] 0xffffe424
>      2.01%        mmap_test  [kernel.kallsyms]       [k] perf_event_mmap_output
>      1.38%        mmap_test  [kernel.kallsyms]       [k] vma_adjust
>      1.24%        mmap_test  [kernel.kallsyms]       [k] sched_clock_local
>      1.23%             perf  [kernel.kallsyms]       [k] __copy_from_user_ll_nozero
>      1.07%        mmap_test  [kernel.kallsyms]       [k] down_write
> 
> 
> If only one cpu runs the program :
> 
> real	0m16.685s
> user	0m0.771s
> sys	0m15.815s

Thinking a bit more, we could allow several _sum() in flight (we would
need an atomic_t counter for counter of _sum(), not a single bit, and
remove the spinlock.

This would allow using a separate integer for the
add_did_change_fbc_count and remove one atomic operation in _add() { the
atomic_add(2, &fbc->sequence); of my previous patch }


Another idea would also put fbc->count / fbc->slowcount out of line,
to keep "struct percpu_counter" read mostly.

I'll send a V2 with this updated schem.


By the way, I ran the bench on a more recent 2x4x2 machine and 64bit
kernel (HP G6 : Intel(R) Xeon(R) CPU E5540  @ 2.53GHz)

1) One process started (no contention) :

Before :
real	0m21.372s
user	0m0.680s
sys	0m20.670s

After V1 patch :
real	0m19.941s
user	0m0.750s
sys	0m19.170s


2) 16 processes started

Before patch:
real	2m14.509s
user	0m13.780s
sys	35m24.170s

After V1 patch :
real	0m48.617s
user	0m16.980s
sys	12m9.400s



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/