[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <iwlpzi4qnpqri6wegibnsvth4yfdszksfvfyiei3qb3a4serbv@zrw3zsp55zoh>
Date: Fri, 17 May 2024 20:08:41 +0200
From: Mateusz Guzik <mjguzik@...il.com>
To: Kairui Song <ryncsn@...il.com>
Cc: "zhangpeng (AS)" <zhangpeng362@...wei.com>,
Rongwei Wang <rongwei.wrw@...il.com>, linux-mm@...ck.org, LKML <linux-kernel@...r.kernel.org>,
Andrew Morton <akpm@...ux-foundation.org>, dennisszhou@...il.com, shakeelb@...gle.com, jack@...e.cz,
Suren Baghdasaryan <surenb@...gle.com>, kent.overstreet@...ux.dev, mhocko@...e.cz, vbabka@...e.cz,
Yu Zhao <yuzhao@...gle.com>, yu.ma@...el.com, wangkefeng.wang@...wei.com,
sunnanyong@...wei.com
Subject: Re: [RFC PATCH v2 2/2] mm: convert mm's rss stats to use atomic mode
On Fri, May 17, 2024 at 11:29:57AM +0800, Kairui Song wrote:
> Mateusz Guzik <mjguzik@...il.com> 于 2024年5月16日周四 23:14写道:
> > A part of The Real Solution(tm) would make counter allocations scale
> > (including mcid, not just rss) or dodge them (while maintaining the
> > per-cpu distribution, see below for one idea), but that boils down to
> > balancing scalability versus total memory usage. It is trivial to just
> > slap together a per-cpu cache of these allocations and have the problem
> > go away for benchmarking purposes, while being probably being too memory
> > hungry for actual usage.
> >
> > I was pondering an allocator with caches per some number of cores (say 4
> > or 8). Microbenchmarks aside I suspect real workloads would not suffer
> > from contention at this kind of granularity. This would trivially reduce
> > memory usage compared to per-cpu caching. I suspect things like
> > mm_struct, task_struct, task stacks and similar would be fine with it.
> >
> > Suppose mm_struct is allocated from a more coarse grained allocator than
> > per-cpu. Total number of cached objects would be lower than it is now.
> > That would also mean these allocated but not currently used mms could
> > hold on to other stuff, for example per-cpu rss and mcid counters. Then
> > should someone fork or exit, alloc/free_percpu would be avoided for most
> > cases. This would scale better and be faster single-threaded than the
> > current state.
>
> And what is the issue with using only one CPU cache, and flush on mm
> switch? No more alloc after boot, and the total (and fixed) memory
> usage is just about a few unsigned long per CPU, which should be even
> lower that the old RSS cache solution (4 unsigned long per task). And
> it scaled very well with many kinds of microbench or workload I've
> tested.
>
> Unless the workload keeps doing something like "alloc one page then
> switch to another mm", I think the performance will be horrible
> already due to cache invalidations and many switch_*()s, RSS isn't
> really a concern there.
>
I only skimmed through your patchset. I do think it has a legitimate
approach, but personally I would not do it like that due to the extra
work on context switches. However, I have 0 say about this, so you will
need to prod the mm overlords to get this moving forward.
Maybe I was not clear enough in my opening e-mail, so I'm going to
reiterate some bits: there are scalability problems in execve even with
your patchset or the one which uses atomics. One of them concerns
another bit which allocates per-cpu memory (the mcid thing).
I note that sorting it out would possibly also take care of the rss
problem, outlining an example approach above.
Powered by blists - more mailing lists