[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <qmmncacgzhby5ewkzbu2gc7dovawyzevgyhwmdp6te6ez3svp2@ns3bcq6dvq6p>
Date: Sat, 29 Nov 2025 08:50:41 +0100
From: Mateusz Guzik <mjguzik@...il.com>
To: Jan Kara <jack@...e.cz>
Cc: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
Gabriel Krisman Bertazi <krisman@...e.de>, linux-mm@...ck.org, linux-kernel@...r.kernel.org,
Shakeel Butt <shakeel.butt@...ux.dev>, Michal Hocko <mhocko@...nel.org>,
Dennis Zhou <dennis@...nel.org>, Tejun Heo <tj@...nel.org>, Christoph Lameter <cl@...two.org>,
Andrew Morton <akpm@...ux-foundation.org>, David Hildenbrand <david@...hat.com>,
Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, "Liam R. Howlett" <Liam.Howlett@...cle.com>,
Vlastimil Babka <vbabka@...e.cz>, Mike Rapoport <rppt@...nel.org>,
Suren Baghdasaryan <surenb@...gle.com>, Thomas Gleixner <tglx@...utronix.de>
Subject: Re: [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for
single-threaded tasks
On Sat, Nov 29, 2025 at 06:57:21AM +0100, Mateusz Guzik wrote:
> Now to business:
> You mentioned the rss loops are a problem. I agree, but they can be
> largely damage-controlled. More importantly there are 2 loops of the
> sort already happening even with the patchset at hand.
>
> mm_alloc_cid() results in one loop in the percpu allocator to zero out
> the area, then mm_init_cid() performs the following:
> for_each_possible_cpu(i) {
> struct mm_cid *pcpu_cid = per_cpu_ptr(mm->pcpu_cid, i);
>
> pcpu_cid->cid = MM_CID_UNSET;
> pcpu_cid->recent_cid = MM_CID_UNSET;
> pcpu_cid->time = 0;
> }
>
> There is no way this is not visible already on 256 threads.
>
> Preferably some magic would be done to init this on first use on given
> CPU.There is some bitmap tracking CPU presence, maybe this can be
> tackled on top of it. But for the sake of argument let's say that's
> too expensive or perhaps not feasible. Even then, the walk can be done
> *once* by telling the percpu allocator to refrain from zeroing memory.
>
> Which brings me to rss counters. In the current kernel that's
> *another* loop over everything to zero it out. But it does not have to
> be that way. Suppose bitmap shenanigans mentioned above are no-go for
> these as well.
>
So I had another look and I think bitmapping it is perfectly feasible,
albeit requiring a little bit of refactoring to avoid adding overhead in
the common case.
There is a bitmap for tlb tracking, updated like so on context switch in
switch_mm_irqs_off():
if (next != &init_mm && !cpumask_test_cpu(cpu, mm_cpumask(next)))
cpumask_set_cpu(cpu, mm_cpumask(next));
... and of course cleared at times.
Easiest way out would add an additional bitmap with bits which are
*never* cleared. But that's another cache miss, preferably avoided.
Instead the entire thing could be reimplemented to have 2 bits per CPU
in the bitmap -- one for tlb and another for ever running on it.
Having spotted you are running on the given cpu for the first time, the
rss area gets zeroed out and *both* bits get set et voila. The common
case gets away with the same load as always. The less common case gets
more work of having to zero the counters initialize cid.
In return both cid and rss handling can avoid mandatory linear walks by
cpu count, instead merely having to visit the cpus known to have used a
given mm.
I don't think this is particularly ugly or complicated, just needs some
care & time to sit through and refactor away all the direct access into
helpers.
So if I was tasked with working on the overall problem, I would
definitely try to get this done. Fortunately for me this is not the
case. :-)
Powered by blists - more mailing lists