[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20110517084117.572970726@sli10-conroe.sh.intel.com>
Date: Tue, 17 May 2011 16:41:17 +0800
From: Shaohua Li <shaohua.li@...el.com>
To: linux-kernel@...r.kernel.org
Cc: akpm@...ux-foundation.org, tj@...nel.org, eric.dumazet@...il.com,
cl@...ux.com
Subject: [patch v3 0/3] percpu_counter: bug fix and enhancement
The patch sets do two things.
1. fix bug for 32-bit system. percpu_counter uses s64 counter. Without any
locking reading s64 in 32-bit system isn't safe and can cause bad side effect.
2. improve scalability for __percpu_counter_add. In some cases, _add could
cause heavy lock contention (see patch 2 for detailed infomation and data).
The patches will remove the contention and speed up it a bit. Last post
(http://marc.info/?l=linux-kernel&m=130259547913607&w=2) simpliy uses
atomic64 for percpu_counter, but Tejun pointed out this could cause
deviation in __percpu_counter_sum.
In this impelmentation, we track _sum and _add state. When _sum starts, _add
will wait _sum to finish. This sounds scaring, since _add is fast path. But
since _sum is called rare, at most time _add doesn't need wait.
patch 1 fix s64 read bug for 32-bit system for UP
patch 2,3 fix s64 read bug for 32-bit system for MP. And it also improve the
scalability for __percpu_counter_add.
I did some benchmarks with the patches applied:
Test1:
24 CPUs do:
while {
mmap(32M);
munmap(32M);
}
Each CPU is about 7x faster to do the loop.
Test2:
One CPU does:
while {
__percpu_counter_add(+count)
__percpu_counter_add(-count)
}
the loop do 10000000 times.
in _add fast path (no locking hold):
before my patch:
real 0m0.133s
user 0m0.000s
sys 0m0.124s
after:
real 0m0.129s
user 0m0.000s
sys 0m0.120s
the difference is variation
in _add slow path (locking hold):
before my patch:
real 0m0.374s
user 0m0.000s
sys 0m0.372s
after:
real 0m0.245s
user 0m0.000s
sys 0m0.020s
Test3:
One CPU runs percpu_counter_sum, 23 CPUs run percpu_counter_add. In _add
fast path (don't hold) lock, _sum runs a little slow (about 20% slower).
In _add slow path (hold lock), _sum runs much faster (about 9x faster)
So overall my patches make percpu_counter API faster. The only exception
is _sum has a little slower in one case, but _sum is called rare, the 20%
slower doesn't matter.
Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists