linux-kernel - Re: [patch 1/3] mm: memcontrol: lockless page counters

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20141017074718.GB5641@esperanza>
Date:	Fri, 17 Oct 2014 09:47:18 +0200
From:	Vladimir Davydov <vdavydov@...allels.com>
To:	Johannes Weiner <hannes@...xchg.org>
CC:	Andrew Morton <akpm@...ux-foundation.org>,
	Michal Hocko <mhocko@...e.cz>, <cgroups@...r.kernel.org>,
	<linux-mm@...ck.org>, <linux-kernel@...r.kernel.org>
Subject: Re: [patch 1/3] mm: memcontrol: lockless page counters

On Mon, Oct 13, 2014 at 09:46:01PM -0400, Johannes Weiner wrote:
> Memory is internally accounted in bytes, using spinlock-protected
> 64-bit counters, even though the smallest accounting delta is a page.
> The counter interface is also convoluted and does too many things.
> 
> Introduce a new lockless word-sized page counter API, then change all
> memory accounting over to it.  The translation from and to bytes then
> only happens when interfacing with userspace.
> 
> The removed locking overhead is noticable when scaling beyond the
> per-cpu charge caches - on a 4-socket machine with 144-threads, the
> following test shows the performance differences of 288 memcgs
> concurrently running a page fault benchmark:
> 
> vanilla:
> 
>    18631648.500498      task-clock (msec)         #  140.643 CPUs utilized            ( +-  0.33% )
>          1,380,638      context-switches          #    0.074 K/sec                    ( +-  0.75% )
>             24,390      cpu-migrations            #    0.001 K/sec                    ( +-  8.44% )
>      1,843,305,768      page-faults               #    0.099 M/sec                    ( +-  0.00% )
> 50,134,994,088,218      cycles                    #    2.691 GHz                      ( +-  0.33% )
>    <not supported>      stalled-cycles-frontend
>    <not supported>      stalled-cycles-backend
>  8,049,712,224,651      instructions              #    0.16  insns per cycle          ( +-  0.04% )
>  1,586,970,584,979      branches                  #   85.176 M/sec                    ( +-  0.05% )
>      1,724,989,949      branch-misses             #    0.11% of all branches          ( +-  0.48% )
> 
>      132.474343877 seconds time elapsed                                          ( +-  0.21% )
> 
> lockless:
> 
>    12195979.037525      task-clock (msec)         #  133.480 CPUs utilized            ( +-  0.18% )
>            832,850      context-switches          #    0.068 K/sec                    ( +-  0.54% )
>             15,624      cpu-migrations            #    0.001 K/sec                    ( +- 10.17% )
>      1,843,304,774      page-faults               #    0.151 M/sec                    ( +-  0.00% )
> 32,811,216,801,141      cycles                    #    2.690 GHz                      ( +-  0.18% )
>    <not supported>      stalled-cycles-frontend
>    <not supported>      stalled-cycles-backend
>  9,999,265,091,727      instructions              #    0.30  insns per cycle          ( +-  0.10% )
>  2,076,759,325,203      branches                  #  170.282 M/sec                    ( +-  0.12% )
>      1,656,917,214      branch-misses             #    0.08% of all branches          ( +-  0.55% )
> 
>       91.369330729 seconds time elapsed                                          ( +-  0.45% )
> 
> On top of improved scalability, this also gets rid of the icky long
> long types in the very heart of memcg, which is great for 32 bit and
> also makes the code a lot more readable.
> 
> Notable differences between the old and new API:
> 
> - res_counter_charge() and res_counter_charge_nofail() become
>   page_counter_try_charge() and page_counter_charge() resp. to match
>   the more common kernel naming scheme of try_do()/do()
> 
> - res_counter_uncharge_until() is only ever used to cancel a local
>   counter and never to uncharge bigger segments of a hierarchy, so
>   it's replaced by the simpler page_counter_cancel()
> 
> - res_counter_set_limit() is replaced by page_counter_limit(), which
>   expects its callers to serialize against themselves
> 
> - res_counter_memparse_write_strategy() is replaced by
>   page_counter_limit(), which rounds down to the nearest page size -
>   rather than up.  This is more reasonable for explicitely requested
>   hard upper limits.
> 
> - to keep charging light-weight, page_counter_try_charge() charges
>   speculatively, only to roll back if the result exceeds the limit.
>   Because of this, a failing bigger charge can temporarily lock out
>   smaller charges that would otherwise succeed.  The error is bounded
>   to the difference between the smallest and the biggest possible
>   charge size, so for memcg, this means that a failing THP charge can
>   send base page charges into reclaim upto 2MB (4MB) before the limit
>   would have been reached.  This should be acceptable.
> 
> Signed-off-by: Johannes Weiner <hannes@...xchg.org>

Definitely better than it was.

Acked-by: Vladimir Davydov <vdavydov@...allels.com>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/