lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Fri, 14 Jun 2013 15:31:25 -0700
From:	Tejun Heo <tj@...nel.org>
To:	Michal Hocko <mhocko@...e.cz>
Cc:	lizefan@...wei.com, containers@...ts.linux-foundation.org,
	cgroups@...r.kernel.org, koverstreet@...gle.com,
	linux-kernel@...r.kernel.org, cl@...ux-foundation.org,
	Mike Snitzer <snitzer@...hat.com>,
	Vivek Goyal <vgoyal@...hat.com>,
	"Alasdair G. Kergon" <agk@...hat.com>,
	Jens Axboe <axboe@...nel.dk>,
	Mikulas Patocka <mpatocka@...hat.com>,
	Glauber Costa <glommer@...il.com>
Subject: Re: [PATCH 11/11] cgroup: use percpu refcnt for cgroup_subsys_states

Hello, Michal.

On Fri, Jun 14, 2013 at 03:20:26PM +0200, Michal Hocko wrote:
> I have no objections to change css reference counting scheme if the
> guarantees we used to have are still valid. I am just missing some
> comparisons. Do you have any numbers that would show benefits clearly?

Mikulas' high scalability dm test case on top of ramdisk was affected
severely when css refcnting was added to track the original issuer's
cgroup context.  That probably is one of the more severe cases.

> You are mentioning that especially controllers that are strongly per-cpu
> oriented will see the biggest improvements. What about others?
> A single atomic_add resp. atomic_dec_return is much less heavy than the

Even with preemption enabled, the percpu ref get/put will be under ten
instructions which touch two memory areas - the preemption counter
which is usually very hot anyway and the percpu refcnt itself.  It
shouldn't be much slower than the atomic ops.  If the kernel has
preemption disabled, percpu_ref is actually likely to be cheaper even
on single CPU.

So, here are some numbers from the attached test program.  The test is
very simple - inc ref, copy N bytes into per-cpu buf, dec ref - and
see how many times it can do that in given amount of time - 15s.  Both
single CPU and all CPUs scenarios are tested.  The test is run inside
qemu on my laptop - mobile i7 2 core / 4 threads.  Yeah, I know.  I'll
run it on a proper test machine later today.

Single CPU case.  Preemption enabled.  This is the best scenario for
atomic_t.  No cacheline bouncing at all.

    copy size  atomic_t		  percpu_ref	   diff

	0      1198217077	  1747505555	  +45.84%
	32	505504457	   465714742	   -7.87%
	64	511521639	   470741887	   -7.97%
	128	485319952	   434920137	  -10.38%
	256	421809359	   384871731	   -8.76%
	512	330527059	   307587622	   -6.94%

For some reason, percpu_ref wins if copy_size is zero.  I don't know
why that is.  The body isn't optimized out so it's still doing all the
refcnting.  Maybe the CPU doesn't have enough work to mask pipeline
bubbles from atomic ops?  In other cases, it's slower by around or
under 10% which isn't exactly noise but this is the worst possible
scenario.  Unless this is the only thing a pinned CPU is doing, it's
unlikely to be noticeable.

Now doing the same thing on multiple CPUs.  Note that while this is
the best scenario for percpu_ref, the hardware the test is run on is
very favorable to atomic_t - it's just two cores on the same package
sharing the L3 cache, so cacheline ping-poinging is relatively cheap.

    copy size  atomic_t		  percpu_ref	   diff

	0      342659959	  3794775739	  +1007.45%
	32     401933493	  1337286466	   +232.71%
	64     385459752	  1353978982	   +251.26%
	128    401379622	  1234780968	   +207.63%
	256    401170676	  1052511682	   +162.36%
	512    387796881	   794101245	   +104.77%

Even on this machine, the difference is huge.  If the refcnt is used
from different CPUs in any frequency, percpu_ref will destroy
atomic_t.  Also note that percpu_ref will scale perfectly as the
number of CPUs increases while atomic_t will get worse.

I'll play with it a bit more on an actual machine and post more
results.  Test program attached.

Thanks.

-- 
tejun

View attachment "test-pcpuref.c" of type "text/plain" (2952 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ