linux-kernel - Re: [PATCH v8 1/2] sched: Move task_mm_cid_work to mm work

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <5fc6daa0-30d6-4fcd-b58b-a570aeed5691@efficios.com>
Date: Thu, 20 Feb 2025 10:30:41 -0500
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Gabriele Monaco <gmonaco@...hat.com>, linux-kernel@...r.kernel.org,
 Andrew Morton <akpm@...ux-foundation.org>, Ingo Molnar <mingo@...hat.com>,
 Peter Zijlstra <peterz@...radead.org>, "Paul E. McKenney"
 <paulmck@...nel.org>, linux-mm@...ck.org
Cc: Ingo Molnar <mingo@...nel.org>, Shuah Khan <shuah@...nel.org>
Subject: Re: [PATCH v8 1/2] sched: Move task_mm_cid_work to mm work_struct

On 2025-02-20 09:42, Mathieu Desnoyers wrote:
> On 2025-02-20 05:26, Gabriele Monaco wrote:
>> Currently, the task_mm_cid_work function is called in a task work
>> triggered by a scheduler tick to frequently compact the mm_cids of each
>> process. This can delay the execution of the corresponding thread for
>> the entire duration of the function, negatively affecting the response
>> in case of real time tasks. In practice, we observe task_mm_cid_work
>> increasing the latency of 30-35us on a 128 cores system, this order of
>> magnitude is meaningful under PREEMPT_RT.
>>
>> Run the task_mm_cid_work in a new work_struct connected to the
>> mm_struct rather than in the task context before returning to
>> userspace.
>>
>> This work_struct is initialised with the mm and disabled before freeing
>> it. The queuing of the work happens while returning to userspace in
>> __rseq_handle_notify_resume, maintaining the checks to avoid running
>> more frequently than MM_CID_SCAN_DELAY.
>> To make sure this happens predictably also on long running tasks, we
>> trigger a call to __rseq_handle_notify_resume also from the scheduler
>> tick (which in turn will also schedule the work item).
>>
>> The main advantage of this change is that the function can be offloaded
>> to a different CPU and even preempted by RT tasks.
>>
>> Moreover, this new behaviour is more predictable with periodic tasks
>> with short runtime, which may rarely run during a scheduler tick.
>> Now, the work is always scheduled when the task returns to userspace.
>>
>> The work is disabled during mmdrop, since the function cannot sleep in
>> all kernel configurations, we cannot wait for possibly running work
>> items to terminate. We make sure the mm is valid in case the task is
>> terminating by reserving it with mmgrab/mmdrop, returning prematurely if
>> we are really the last user while the work gets to run.
>> This situation is unlikely since we don't schedule the work for exiting
>> tasks, but we cannot rule it out.
>>
>> Fixes: 223baf9d17f2 ("sched: Fix performance regression introduced by 
>> mm_cid")
>> Signed-off-by: Gabriele Monaco <gmonaco@...hat.com>
>> ---
> [...]
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 9aecd914ac691..363e51dd25175 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -5663,7 +5663,7 @@ void sched_tick(void)
>>           resched_latency = cpu_resched_latency(rq);
>>       calc_global_load_tick(rq);
>>       sched_core_tick(rq);
>> -    task_tick_mm_cid(rq, donor);
>> +    rseq_preempt(donor);
>>       scx_tick(rq);
>>       rq_unlock(rq, &rf);
> 
> There is one tiny important detail worth discussing here: I wonder if
> executing a __rseq_handle_notify_resume() on return to userspace on
> every scheduler tick will cause noticeable performance degradation ?
> 
> I think we can mitigate the impact if we can quickly compute the amount
> of contiguous unpreempted runtime since last preemption, then we could
> use this as a way to only issue rseq_preempt() when there has been a
> minimum amount of contiguous unpreempted execution. Otherwise the
> rseq_preempt() already issued by preemption is enough.
> 
> I'm not entirely sure how to compute this "unpreempted contiguous
> runtime" value within sched_tick() though, any ideas ?

I just discussed this with Peter over IRC, here is a possible way
forward for this:

The fair class has the information we are looking for as:

   se->sum_exec_runtime - se->prev_sum_exec_runtime

for rt and dl classes, we'll need to keep track of prev_sum_exec_runtime
in their respective set_next_entity() in the same way as fair does.
AFAIU it's not tracked at the moment in neither rt and dl.

Then we can decide for a threshold of consecutive runtime that makes
sense to trigger a rseq_preempt() from sched_tick(), and use that to
lessen its impact.

Thanks,

Mathieu

> 
> Thanks,
> 
> Mathieu
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com