linux-kernel - [patch v3 0/3] percpu_counter: bug fix and enhancement

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Date:	Tue, 17 May 2011 16:41:17 +0800
From:	Shaohua Li <shaohua.li@...el.com>
To:	linux-kernel@...r.kernel.org
Cc:	akpm@...ux-foundation.org, tj@...nel.org, eric.dumazet@...il.com,
	cl@...ux.com
Subject: [patch v3 0/3] percpu_counter: bug fix and enhancement

The patch sets do two things.
1. fix bug for 32-bit system. percpu_counter uses s64 counter. Without any
locking reading s64 in 32-bit system isn't safe and can cause bad side effect.
2. improve scalability for __percpu_counter_add. In some cases, _add could
cause heavy lock contention (see patch 2 for detailed infomation and data).
The patches will remove the contention and speed up it a bit. Last post
(http://marc.info/?l=linux-kernel&m=130259547913607&w=2) simpliy uses
atomic64 for percpu_counter, but Tejun pointed out this could cause
deviation in __percpu_counter_sum.

In this impelmentation, we track _sum and _add state. When _sum starts, _add
will wait _sum to finish. This sounds scaring, since _add is fast path. But
since _sum is called rare, at most time _add doesn't need wait.

patch 1 fix s64 read bug for 32-bit system for UP
patch 2,3 fix s64 read bug for 32-bit system for MP. And it also improve the
scalability for __percpu_counter_add.

I did some benchmarks with the patches applied:
Test1:
24 CPUs do:
while {
mmap(32M);
munmap(32M);
}
Each CPU is about 7x faster to do the loop.

Test2:
One CPU does:
while {
__percpu_counter_add(+count)
__percpu_counter_add(-count)
}
the loop do 10000000 times.
in _add fast path (no locking hold):
before my patch:
real    0m0.133s
user    0m0.000s
sys     0m0.124s
after:
real    0m0.129s
user    0m0.000s
sys     0m0.120s
the difference is variation

in _add slow path (locking hold):
before my patch:
real    0m0.374s
user    0m0.000s
sys     0m0.372s
after:
real    0m0.245s
user    0m0.000s
sys     0m0.020s

Test3:
One CPU runs percpu_counter_sum, 23 CPUs run percpu_counter_add. In _add
fast path (don't hold) lock, _sum runs a little slow (about 20% slower).
In _add slow path (hold lock), _sum runs much faster (about 9x faster)

So overall my patches make percpu_counter API faster. The only exception
is _sum has a little slower in one case, but _sum is called rare, the 20%
slower doesn't matter.

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/