linux-kernel - Re: [PATCH 0/2] execve scalability issues, part 1

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAGudoHHe5nzRTuj4G1fphD+JJ02TE5BnHEDwFm=-W6DoEj2qVQ@mail.gmail.com>
Date:   Tue, 22 Aug 2023 16:24:56 +0200
From:   Mateusz Guzik <mjguzik@...il.com>
To:     Jan Kara <jack@...e.cz>
Cc:     Dennis Zhou <dennis@...nel.org>, linux-kernel@...r.kernel.org,
        tj@...nel.org, cl@...ux.com, akpm@...ux-foundation.org,
        shakeelb@...gle.com, linux-mm@...ck.org
Subject: Re: [PATCH 0/2] execve scalability issues, part 1

On 8/22/23, Jan Kara <jack@...e.cz> wrote:
> On Tue 22-08-23 00:29:49, Mateusz Guzik wrote:
>> On 8/21/23, Mateusz Guzik <mjguzik@...il.com> wrote:
>> > True Fix(tm) is a longer story.
>> >
>> > Maybe let's sort out this patchset first, whichever way. :)
>> >
>>
>> So I found the discussion around the original patch with a perf
>> regression report.
>>
>> https://lore.kernel.org/linux-mm/20230608111408.s2minsenlcjow7q3@quack3/
>>
>> The reporter suggests dodging the problem by only allocating per-cpu
>> counters when the process is going multithreaded. Given that there is
>> still plenty of forever single-threaded procs out there I think that's
>> does sound like a great plan regardless of what happens with this
>> patchset.
>>
>> Almost all access is already done using dedicated routines, so this
>> should be an afternoon churn to sort out, unless I missed a
>> showstopper. (maybe there is no good place to stuff a flag/whatever
>> other indicator about the state of counters?)
>>
>> That said I'll look into it some time this or next week.
>
> Good, just let me know how it went, I also wanted to start looking into
> this to come up with some concrete patches :). What I had in mind was that
> we could use 'counters == NULL' as an indication that the counter is still
> in 'single counter mode'.
>

In the current state there are only pointers to counters in mm_struct
and there is no storage for them in task_struct. So I don't think
merely null-checking the per-cpu stuff is going to cut it -- where
should the single-threaded counters land?

Bonus problem, non-current can modify these counters and this needs to
be safe against current playing with them at the same time. (and it
would be a shame to require current to use atomic on them)

That said, my initial proposal adds a union:
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5e74ce4a28cd..ea70f0c08286 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -737,7 +737,11 @@ struct mm_struct {

                unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for
/proc/PID/auxv */

-               struct percpu_counter rss_stat[NR_MM_COUNTERS];
+               union {
+                       struct percpu_counter rss_stat[NR_MM_COUNTERS];
+                       u64 *rss_stat_single;
+               };
+               bool    magic_flag_stuffed_elsewhere;

                struct linux_binfmt *binfmt;

Then for single-threaded case an area is allocated for NR_MM_COUNTERS
countes * 2 -- first set updated without any synchro by current
thread. Second set only to be modified by others and protected with
mm->arg_lock. The lock protects remote access to the union to begin
with.

Transition to per-CPU operation sets the magic flag (there is plenty
of spare space in mm_struct, I'll find a good home for it without
growing the struct). It would be a one-way street -- a process which
gets a bunch of threads and goes back to one stays with per-CPU.

Then you get the true value of something by adding both counters.

arg_lock is sparingly used, so remote ops are not expected to contend
with anything. In fact their cost is going to go down since percpu
summation takes a spinlock which also disables interrupts.

Local ops should be about the same in cost as they are right now.

I might have missed some detail in the above description, but I think
the approach is decent.

-- 
Mateusz Guzik <mjguzik gmail.com>