linux-kernel - Re: [patch V3] percpu_counter: scalability works

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Tue, 17 May 2011 13:22:48 +0800
From:	Shaohua Li <shaohua.li@...el.com>
To:	Eric Dumazet <eric.dumazet@...il.com>
Cc:	Tejun Heo <tj@...nel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
	"cl@...ux.com" <cl@...ux.com>,
	"npiggin@...nel.dk" <npiggin@...nel.dk>
Subject: Re: [patch V3] percpu_counter: scalability works

On Tue, 2011-05-17 at 12:56 +0800, Eric Dumazet wrote:
> Le mardi 17 mai 2011 à 08:55 +0800, Shaohua Li a écrit :
> 
> > I'm still interesting in improving percpu_counter itself. If we can
> > improve it, why we don't? My patch doesn't slow down anything for all
> > tests. Why didn't you ever look at it?
> > 
> 
> I did and said there were obvious problems in it.
> 
> 1) 4096 cpus : size of one percpu_counter is 16Kbytes.
> 
> After your last patch -> 20 kbytes for no good reason.
I don't know why you said there is no good reason. I posted a lot of
data which shows improvement, while you just ignore.

The size issue is completely pointless. If you have 4096 CPUs, how could
you worry about 16k bytes memory. Especially the extra memory makes the
API much faster.

> 2) Two separate alloc_percpu() -> two separate cache lines instead of
> one.
Might be in one cache line actually, but can be easily fixed if not
anyway. On the other hand, even touch two cache lines, it's still faster
than the original spinlock implementation, which I already posted data.

> But then, if one alloc_percpu() -> 32 kbytes per object.
the size issue is completely pointless

> 3) Focus on percpu_counter() implementation instead of making an
> analysis of callers.
> 
> I did a lot of rwlocks removal in network stack because they are not the
> right synchronization primitive in many cases. I did not optimize
> rwlocks. If rwlocks were even slower, I suspect other people would have
> help me to convert things faster.
My original issue is mmap, but I already declaimed several times we can
make percpu_counter better, why won't we?

> 4) There is a possible way to solve your deviation case : add at _add()
> beginning a short cut for crazy 'amount' values. Its a bit expensive on
> 32bit arches, so might be added in a new helper to let _add() be fast
> for normal and gentle users.

+		if (unlikely(cmpxchg(ptr, old, 0) != old))
> +			goto retry;
this doesn't change anything, you still have the deviation issue here

> +		atomic64_add(count, &fbc->count);

> if (unlikely(amount >= batch || amount <= -batch)) {
> 	atomic64(amount, &fbc->count);
> 	return;
> }
why we just handle this special case, my patch can make the whole part
faster without deviation

so you didn't point out any obvious problem with my patch actually. This
is good.

Thanks,
Shaohua

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/