[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20210817024500.GC72770@shbuild999.sh.intel.com>
Date: Tue, 17 Aug 2021 10:45:00 +0800
From: Feng Tang <feng.tang@...el.com>
To: Johannes Weiner <hannes@...xchg.org>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>,
kernel test robot <oliver.sang@...el.com>,
Roman Gushchin <guro@...com>, Michal Hocko <mhocko@...e.com>,
Shakeel Butt <shakeelb@...gle.com>,
Michal Koutn?? <mkoutny@...e.com>,
Balbir Singh <bsingharora@...il.com>,
Tejun Heo <tj@...nel.org>,
Andrew Morton <akpm@...ux-foundation.org>,
LKML <linux-kernel@...r.kernel.org>, lkp@...ts.01.org,
kernel test robot <lkp@...el.com>,
"Huang, Ying" <ying.huang@...el.com>,
Zhengjun Xing <zhengjun.xing@...ux.intel.com>,
andi.kleen@...el.com
Subject: Re: [mm] 2d146aa3aa: vm-scalability.throughput -36.4% regression
On Mon, Aug 16, 2021 at 05:41:57PM -0400, Johannes Weiner wrote:
> On Mon, Aug 16, 2021 at 11:28:55AM +0800, Feng Tang wrote:
> > On Thu, Aug 12, 2021 at 11:19:10AM +0800, Feng Tang wrote:
> > > On Tue, Aug 10, 2021 at 07:59:53PM -1000, Linus Torvalds wrote:
> > [SNIP]
> >
> > > And seems there is some cache false sharing when accessing mem_cgroup
> > > member: 'struct cgroup_subsys_state', from the offset (0x0 and 0x10 here)
> > > and the calling sites, the cache false sharing could happen between:
> > >
> > > cgroup_rstat_updated (read memcg->css.cgroup, offset 0x0)
> > > and
> > > get_mem_cgroup_from_mm
> > > css_tryget(&memcg->css) (read/write memcg->css.refcnt, offset 0x10)
> > >
> > > (This could be wrong as many of the functions are inlined, and the
> > > exact calling site isn't shown)
>
> Thanks for digging more into this.
>
> The offset 0x0 access is new in the page instantiation path with the
> bisected patch, so that part makes sense. The new sequence is this:
>
> shmem_add_to_page_cache()
> mem_cgroup_charge()
> get_mem_cgroup_from_mm()
> css_tryget() # touches memcg->css.refcnt
> xas_store()
> __mod_lruvec_page_state()
> __mod_lruvec_state()
> __mod_memcg_lruvec_state()
> __mod_memcg_state()
> __this_cpu_add()
> cgroup_rstat_updated() # touches memcg->css.cgroup
>
> whereas before, __mod_memcg_state() would just do stuff inside memcg.
Yes, the perf record/report also show these two are hotspots, one takes
about 6% cpu cycles, the other takes 10%.
> However, css.refcnt is a percpu-refcount. We should see a read-only
> lookup of the base pointer inside this cacheline, with the write
> occuring in percpu memory elsewhere. Even if it were in atomic/shared
> mode, which it shouldn't be for the root cgroup, the shared atomic_t
> is also located in an auxiliary allocation and shouldn't overlap with
> the cgroup pointer in any way.
>
> The css itself is embedded inside struct mem_cgroup, which does see
> modifications. But the closest of those is 3 cachelines down (struct
> page_counter memory), so that doesn't make sense, either.
>
> Does this theory require writes? Because I don't actually see any (hot
> ones, anyway) inside struct cgroup_subsys_state for this workload.
You are right. the access to 'css.refcnt' is a read, and false sharing
is kind of interference between read and write. I presumed it's a global
reference count, and the try_get is a write operation.
Initially from the perf-c2c data, the in-cacheline hotspots are only
0x0, and 0x10, and if we extends to 2 cachelines, there is one more
offset 0x54 (css.flags), but still I can't figure out which member
inside the 128 bytes range is written frequenty.
/* pah info for cgroup_subsys_state */
struct cgroup_subsys_state {
struct cgroup * cgroup; /* 0 8 */
struct cgroup_subsys * ss; /* 8 8 */
struct percpu_ref refcnt; /* 16 16 */
struct list_head sibling; /* 32 16 */
struct list_head children; /* 48 16 */
/* --- cacheline 1 boundary (64 bytes) --- */
struct list_head rstat_css_node; /* 64 16 */
int id; /* 80 4 */
unsigned int flags; /* 84 4 */
u64 serial_nr; /* 88 8 */
atomic_t online_cnt; /* 96 4 */
/* XXX 4 bytes hole, try to pack */
struct work_struct destroy_work; /* 104 32 */
/* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */
Since the test run implies this is cacheline related, and I'm not very
familiar with the mem_cgroup code, the original perf-c2c log is attached
which may give more hints.
Thanks,
Feng
> > > And to verify this, we did a test by adding padding between
> > > memcg->css.cgroup and memcg->css.refcnt to push them into 2
> > > different cache lines, and the performance are partly restored:
> > >
> > > dc26532aed0ab25c 2d146aa3aa842d7f5065802556b 73371bf27a8a8ea68df2fbf456b
> > > ---------------- --------------------------- ---------------------------
> > > 65523232 ± 4% -40.8% 38817332 ± 5% -19.6% 52701654 ± 3% vm-scalability.throughput
> > >
> > > We are still checking more, and will update if there is new data.
> >
> > Seems this is the second case to hit 'adjacent cacheline prefetch",
> > the first time we saw it is also related with mem_cgroup
> > https://lore.kernel.org/lkml/20201125062445.GA51005@shbuild999.sh.intel.com/
> >
> > In previous debug patch, the 'css.cgroup' and 'css.refcnt' is
> > separated to 2 cache lines, which are still adjacent (2N and 2N+1)
> > cachelines. And with more padding (add 128 bytes padding in between),
> > the performance is restored, and even better (test run 3 times):
> >
> > dc26532aed0ab25c 2d146aa3aa842d7f5065802556b 2e34d6daf5fbab0fb286dcdb3bc
> > ---------------- --------------------------- ---------------------------
> > 65523232 ± 4% -40.8% 38817332 ± 5% +23.4% 80862243 ± 3% vm-scalability.throughput
> >
> > The debug patch is:
> > --- a/include/linux/cgroup-defs.h
> > +++ b/include/linux/cgroup-defs.h
> > @@ -142,6 +142,8 @@ struct cgroup_subsys_state {
> > /* PI: the cgroup subsystem that this css is attached to */
> > struct cgroup_subsys *ss;
> >
> > + unsigned long pad[16];
> > +
> > /* reference count - access via css_[try]get() and css_put() */
> > struct percpu_ref refcnt;
>
> We aren't particularly space-constrained in this structure, so padding
> should generally be acceptable here.
View attachment "perf-c2c-2d146aa3.log" of type "text/plain" (91217 bytes)
Powered by blists - more mailing lists