linux-kernel - Re: [RFC PATCH v2] mm: use per-numa-node atomics instead of percpu

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <CAGudoHHQ4y0Z_A0yzpfim_wGFVUuF3NaLgNtWUiquiCby6Ppkg@mail.gmail.com>
Date: Tue, 8 Apr 2025 09:46:15 +0200
From: Mateusz Guzik <mjguzik@...il.com>
To: Kairui Song <ryncsn@...il.com>
Cc: Sweet Tea Dorminy <sweettea-kernel@...miny.me>, Andrew Morton <akpm@...ux-foundation.org>, 
	Steven Rostedt <rostedt@...dmis.org>, Masami Hiramatsu <mhiramat@...nel.org>, 
	Mathieu Desnoyers <mathieu.desnoyers@...icios.com>, Dennis Zhou <dennis@...nel.org>, 
	Tejun Heo <tj@...nel.org>, Christoph Lameter <cl@...ux.com>, Martin Liu <liumartin@...gle.com>, 
	David Rientjes <rientjes@...gle.com>, Christian König <christian.koenig@....com>, 
	Shakeel Butt <shakeel.butt@...ux.dev>, Johannes Weiner <hannes@...xchg.org>, 
	Sweet Tea Dorminy <sweettea@...gle.com>, Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, 
	"Liam R . Howlett" <Liam.Howlett@...cle.com>, Suren Baghdasaryan <surenb@...gle.com>, 
	Vlastimil Babka <vbabka@...e.cz>, Christian Brauner <brauner@...nel.org>, 
	Wei Yang <richard.weiyang@...il.com>, David Hildenbrand <david@...hat.com>, 
	Miaohe Lin <linmiaohe@...wei.com>, Al Viro <viro@...iv.linux.org.uk>, linux-mm@...ck.org, 
	linux-kernel@...r.kernel.org, linux-trace-kernel@...r.kernel.org, 
	Yu Zhao <yuzhao@...gle.com>, Roman Gushchin <roman.gushchin@...ux.dev>
Subject: Re: [RFC PATCH v2] mm: use per-numa-node atomics instead of percpu_counters

On Fri, Apr 4, 2025 at 6:51 PM Kairui Song <ryncsn@...il.com> wrote:
>
> On Thu, Apr 3, 2025 at 10:31 PM Mateusz Guzik <mjguzik@...il.com> wrote:
> > Note there are 2 unrelated components in that patchset:
> > - one per-cpu instance of rss counters which is rolled up on context
> > switches, avoiding the costly counter alloc/free on mm
> > creation/teardown
> > - cpu iteration in get_mm_counter
> >
> > The allocation problem is fixable without abandoning the counters, see
> > my other e -mail (tl;dr let mm's hanging out in slab caches *keep* the
> > counters). This aspect has to be solved anyway due to mm_alloc_cid().
> > Providing a way to sort it out covers *both* the rss counters and the
> > cid thing.
>
> It's not just about the fork performance, on some servers there could
> be ~100K processes and ~200 CPUs, that will be hundreds of MBs of
> memory just for the counters.
>
> And nowadays it's not something uncommon for a desktop to have ~64
> CPUs and ~10K processes.
>
> If we use a single shared "per-cpu" counter (as in the patch), the
> total consumption will always be only about just dozens of bytes.
>

I agree there is a tradeoff here and your approach saves memory in
exchange for more work during a context switch.

I have no opinion which way to go here.

> >
> > In your patchset the accuracy increase comes at the expense of walking
> > all CPUs every time, while a big part of the point of using percpu
> > counters is to have a good enough approximation somewhere that this is
> > not necessary.
>
> It usually doesn't walk all CPUs, only the CPUs that actually used
> that mm_struct, by checking mm_struct's cpu_bitmap. I didn't check if
> all arch uses that bitmap though.
>
> It's true that one CPU having its bit set on one mm_struct's
> cpu_bitmap doesn't mean it updated the RSS counter so there will be
> false positives, the false positive rate is low as schedulers don't
> shuffle processes between processors randomly, and not every process
> will be ran at a period.
>
> Also per my observation the reader side is much colder compared to
> updater for /proc.
>

Per my comment, the read thing happens a lot for mmap and munmap so it
cannot be taken lightly. You can check yourself with bpftrace.

While I can agree vast majority of processes are not very thread-heavy
and vast majority of machines out there don't have hundreds of cores,
this does have to behave sanely for the cases which *do* exhibit these
conditions. For example a box with > 200 cores and 200+ threads to
boot, all running on the entirety of the box.

In your patch as posted fetching the value will force the walk *a lot*
and is consequently a no-go. This aspect needs to be dealt with for
the patchset to be ok. Otherwise few months down the road someone else
will show up and complain about a new slowdown stemming from it.

-- 
Mateusz Guzik <mjguzik gmail.com>