linux-kernel - Re: [PATCH 11/11] cgroup: use percpu refcnt for cgroup_subsys

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Fri, 14 Jun 2013 15:31:25 -0700
From:	Tejun Heo <tj@...nel.org>
To:	Michal Hocko <mhocko@...e.cz>
Cc:	lizefan@...wei.com, containers@...ts.linux-foundation.org,
	cgroups@...r.kernel.org, koverstreet@...gle.com,
	linux-kernel@...r.kernel.org, cl@...ux-foundation.org,
	Mike Snitzer <snitzer@...hat.com>,
	Vivek Goyal <vgoyal@...hat.com>,
	"Alasdair G. Kergon" <agk@...hat.com>,
	Jens Axboe <axboe@...nel.dk>,
	Mikulas Patocka <mpatocka@...hat.com>,
	Glauber Costa <glommer@...il.com>
Subject: Re: [PATCH 11/11] cgroup: use percpu refcnt for cgroup_subsys_states

Hello, Michal.

On Fri, Jun 14, 2013 at 03:20:26PM +0200, Michal Hocko wrote:
> I have no objections to change css reference counting scheme if the
> guarantees we used to have are still valid. I am just missing some
> comparisons. Do you have any numbers that would show benefits clearly?

Mikulas' high scalability dm test case on top of ramdisk was affected
severely when css refcnting was added to track the original issuer's
cgroup context.  That probably is one of the more severe cases.

> You are mentioning that especially controllers that are strongly per-cpu
> oriented will see the biggest improvements. What about others?
> A single atomic_add resp. atomic_dec_return is much less heavy than the

Even with preemption enabled, the percpu ref get/put will be under ten
instructions which touch two memory areas - the preemption counter
which is usually very hot anyway and the percpu refcnt itself.  It
shouldn't be much slower than the atomic ops.  If the kernel has
preemption disabled, percpu_ref is actually likely to be cheaper even
on single CPU.

So, here are some numbers from the attached test program.  The test is
very simple - inc ref, copy N bytes into per-cpu buf, dec ref - and
see how many times it can do that in given amount of time - 15s.  Both
single CPU and all CPUs scenarios are tested.  The test is run inside
qemu on my laptop - mobile i7 2 core / 4 threads.  Yeah, I know.  I'll
run it on a proper test machine later today.

Single CPU case.  Preemption enabled.  This is the best scenario for
atomic_t.  No cacheline bouncing at all.

    copy size  atomic_t		  percpu_ref	   diff

	0      1198217077	  1747505555	  +45.84%
	32	505504457	   465714742	   -7.87%
	64	511521639	   470741887	   -7.97%
	128	485319952	   434920137	  -10.38%
	256	421809359	   384871731	   -8.76%
	512	330527059	   307587622	   -6.94%

For some reason, percpu_ref wins if copy_size is zero.  I don't know
why that is.  The body isn't optimized out so it's still doing all the
refcnting.  Maybe the CPU doesn't have enough work to mask pipeline
bubbles from atomic ops?  In other cases, it's slower by around or
under 10% which isn't exactly noise but this is the worst possible
scenario.  Unless this is the only thing a pinned CPU is doing, it's
unlikely to be noticeable.

Now doing the same thing on multiple CPUs.  Note that while this is
the best scenario for percpu_ref, the hardware the test is run on is
very favorable to atomic_t - it's just two cores on the same package
sharing the L3 cache, so cacheline ping-poinging is relatively cheap.

    copy size  atomic_t		  percpu_ref	   diff

	0      342659959	  3794775739	  +1007.45%
	32     401933493	  1337286466	   +232.71%
	64     385459752	  1353978982	   +251.26%
	128    401379622	  1234780968	   +207.63%
	256    401170676	  1052511682	   +162.36%
	512    387796881	   794101245	   +104.77%

Even on this machine, the difference is huge.  If the refcnt is used
from different CPUs in any frequency, percpu_ref will destroy
atomic_t.  Also note that percpu_ref will scale perfectly as the
number of CPUs increases while atomic_t will get worse.

I'll play with it a bit more on an actual machine and post more
results.  Test program attached.

Thanks.

-- 
tejun

View attachment "test-pcpuref.c" of type "text/plain" (2952 bytes)