linux-kernel - Re: [LKP] Re: [mm/memcg] bd0b230fe1: will-it-scale.per_process

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20201130084820.GB17338@dhcp22.suse.cz>
Date:   Mon, 30 Nov 2020 09:48:20 +0100
From:   Michal Hocko <mhocko@...e.com>
To:     Feng Tang <feng.tang@...el.com>
Cc:     Xing Zhengjun <zhengjun.xing@...ux.intel.com>,
        Waiman Long <longman@...hat.com>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Shakeel Butt <shakeelb@...gle.com>,
        Chris Down <chris@...isdown.name>,
        Johannes Weiner <hannes@...xchg.org>,
        Roman Gushchin <guro@...com>, Tejun Heo <tj@...nel.org>,
        Vladimir Davydov <vdavydov.dev@...il.com>,
        Yafang Shao <laoar.shao@...il.com>,
        LKML <linux-kernel@...r.kernel.org>, lkp@...ts.01.org,
        lkp@...el.com, zhengjun.xing@...el.com, ying.huang@...el.com,
        andi.kleen@...el.com
Subject: Re: [LKP] Re: [mm/memcg] bd0b230fe1: will-it-scale.per_process_ops
 -22.7% regression

On Wed 25-11-20 14:24:45, Feng Tang wrote:
[...]
> I think we finally found the trick :), further debugging shows it
> is not related to the alignment inside one cacheline, but the
> adjacency of 2 adjacent cacheliens (2N and 2N+1, one pair of 128 bytes).
> 
> For structure mem_cgroup, member 'vmstats_local', 'vmstats_percpu'
> sit in one cacheline, while 'vmstats[]' sits in the next cacheline,
> and when 'adjacent cacheline prefetch" is enabled, if these 2 lines
> sit in one pair (128 btyes), say 2N and 2N+1, then there seems to
> be some kind of false sharing, and if they sit in 2 pairs, say
> 2N-1 and 2N then it's fine.
> 
> And with the following patch to relayout these members, the regression
> is restored and event better. while reducing 64 bytes of sizeof
> 'struct mem_cgroup'
> 
> 	parent_commit	Waiman's_commit		+relayout patch
> 
> result	187K		145K			200K
> 
> Also, if we disable the hw prefetch feature, the Waiman's commit
> and its parent commit will have no performance difference.
> 
> Thanks,
> Feng
> 
> >From 2e63af34fa4853b2dd9669867c37a3cf07f7a505 Mon Sep 17 00:00:00 2001
> From: Feng Tang <feng.tang@...el.com>
> Date: Wed, 25 Nov 2020 13:22:21 +0800
> Subject: [PATCH] mm: memcg: relayout structure mem_cgroup to avoid cache
>  interfereing
> 
> 0day reported one -22.7% regression for will-it-scale page_fault2
> case [1] on a 4 sockets 144 CPU platform, and bisected to it to be
> caused by Waiman's optimization (commit bd0b230fe1) of saving one
> 'struct page_counter' space for 'struct mem_cgroup'.
> 
> Initially we thought it was due to the cache alignment change introduced
> by the patch, but further debug shows that it is due to some hot data
> members ('vmstats_local', 'vmstats_percpu', 'vmstats') sit in 2 adjacent
> cacheline (2N and 2N+1 cacheline), and when adjacent cache line prefetch
> is enabled, it triggers an "extended level" of cache false sharing for
> 2 adjacent cache lines.
> 
> So exchange the 2 member blocks, while keeping mostly the original
> cache alignment, which can restore and even enhance the performance,
> and save 64 bytes of space for 'struct mem_cgroup' (from 2880 to 2816,
> with 0day's default RHEL-8.3 kernel config)
> 
> [1]. https://lore.kernel.org/lkml/20201102091543.GM31092@shao2-debian/
> 
> Fixes: bd0b230fe145 ("mm/memcg: unify swap and memsw page counters")
> Reported-by: kernel test robot <rong.a.chen@...el.com>
> Signed-off-by: Feng Tang <feng.tang@...el.com>

Sorry for a late reply. This is indeed surprising! I was really
expecting page counter to be the culprit. Anyway this rearrangement 
looks ok as well. moving_account related stuff is still after padding
which is good because this rare operation shouldn't really interfere
with the rest of the structure. Btw. now you made me look into the
history and I have noticed e81bf9793b18 ("mem_cgroup: make sure
moving_account, move_lock_task and stat_cpu in the same cacheline") so
this is not the first time we are dealing with a regression here.

Linus has already merged the patch but for the record
Acked-by: Michal Hocko <mhocko@...e.com>

Thanks a lot for pursuing this!

-- 
Michal Hocko
SUSE Labs