linux-kernel - Re: [PATCH v8 1/2] sched: Move task_mm_cid_work to mm work

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f76f6c3a-a1c1-4bdc-bc0b-f419446cac9b@redhat.com>
Date: Thu, 20 Feb 2025 17:31:00 +0000 (UTC)
From: Gabriele Monaco <gmonaco@...hat.com>
To: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
Cc: linux-kernel@...r.kernel.org, Andrew Morton <akpm@...ux-foundation.org>,
	Ingo Molnar <mingo@...hat.com>,
	Peter Zijlstra <peterz@...radead.org>,
	"Paul E. McKenney" <paulmck@...nel.org>, linux-mm@...ck.org,
	Ingo Molnar <mingo@...nel.org>, Shuah Khan <shuah@...nel.org>
Subject: Re: [PATCH v8 1/2] sched: Move task_mm_cid_work to mm work_struct

2025-02-20T15:47:26Z Mathieu Desnoyers <mathieu.desnoyers@...icios.com>:

> On 2025-02-20 10:30, Gabriele Monaco wrote:
>>
>> On Thu, 2025-02-20 at 09:42 -0500, Mathieu Desnoyers wrote:
>>> On 2025-02-20 05:26, Gabriele Monaco wrote:
>>>> Currently, the task_mm_cid_work function is called in a task work
>>>> triggered by a scheduler tick to frequently compact the mm_cids of
>>>> each
>>>> process. This can delay the execution of the corresponding thread
>>>> for
>>>> the entire duration of the function, negatively affecting the
>>>> response
>>>> in case of real time tasks. In practice, we observe
>>>> task_mm_cid_work
>>>> increasing the latency of 30-35us on a 128 cores system, this order
>>>> of
>>>> magnitude is meaningful under PREEMPT_RT.
>>>>
>>>> Run the task_mm_cid_work in a new work_struct connected to the
>>>> mm_struct rather than in the task context before returning to
>>>> userspace.
>>>>
>>>> This work_struct is initialised with the mm and disabled before
>>>> freeing
>>>> it. The queuing of the work happens while returning to userspace in
>>>> __rseq_handle_notify_resume, maintaining the checks to avoid
>>>> running
>>>> more frequently than MM_CID_SCAN_DELAY.
>>>> To make sure this happens predictably also on long running tasks,
>>>> we
>>>> trigger a call to __rseq_handle_notify_resume also from the
>>>> scheduler
>>>> tick (which in turn will also schedule the work item).
>>>>
>>>> The main advantage of this change is that the function can be
>>>> offloaded
>>>> to a different CPU and even preempted by RT tasks.
>>>>
>>>> Moreover, this new behaviour is more predictable with periodic
>>>> tasks
>>>> with short runtime, which may rarely run during a scheduler tick.
>>>> Now, the work is always scheduled when the task returns to
>>>> userspace.
>>>>
>>>> The work is disabled during mmdrop, since the function cannot sleep
>>>> in
>>>> all kernel configurations, we cannot wait for possibly running work
>>>> items to terminate. We make sure the mm is valid in case the task
>>>> is
>>>> terminating by reserving it with mmgrab/mmdrop, returning
>>>> prematurely if
>>>> we are really the last user while the work gets to run.
>>>> This situation is unlikely since we don't schedule the work for
>>>> exiting
>>>> tasks, but we cannot rule it out.
>>>>
>>>> Fixes: 223baf9d17f2 ("sched: Fix performance regression introduced
>>>> by mm_cid")
>>>> Signed-off-by: Gabriele Monaco <gmonaco@...hat.com>
>>>> ---
>>> [...]
>>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>>> index 9aecd914ac691..363e51dd25175 100644
>>>> --- a/kernel/sched/core.c
>>>> +++ b/kernel/sched/core.c
>>>> @@ -5663,7 +5663,7 @@ void sched_tick(void)
>>>>        resched_latency = cpu_resched_latency(rq);
>>>>    calc_global_load_tick(rq);
>>>>    sched_core_tick(rq);
>>>> -   task_tick_mm_cid(rq, donor);
>>>> +   rseq_preempt(donor);
>>>>    scx_tick(rq);
>>>>         rq_unlock(rq, &rf);
>>>
>>> There is one tiny important detail worth discussing here: I wonder if
>>> executing a __rseq_handle_notify_resume() on return to userspace on
>>> every scheduler tick will cause noticeable performance degradation ?
>>>
>>> I think we can mitigate the impact if we can quickly compute the
>>> amount
>>> of contiguous unpreempted runtime since last preemption, then we
>>> could
>>> use this as a way to only issue rseq_preempt() when there has been a
>>> minimum amount of contiguous unpreempted execution. Otherwise the
>>> rseq_preempt() already issued by preemption is enough.
>>>
>>> I'm not entirely sure how to compute this "unpreempted contiguous
>>> runtime" value within sched_tick() though, any ideas ?
>> I was a bit concerned but, at least from the latency perspective, I
>> didn't see any noticeable difference. This may also depend on the
>> system under test, though.
>
> I see this as an issue for performance-related workloads, not
> specifically for latency: we'd be adding additional rseq notifiers
> triggered by the tick in workloads that are CPU-heavy and would
> otherwise not run it after tick. And we'd be adding this overhead
> even in scenarios where there are relatively frequent preemptions
> happening, because every tick would end up issuing rseq_preempt().
>
>> We may not need to do that, what we are doing here is improperly
>> calling rseq_preempt. What if we call an rseq_tick which sets a
>> different bit in rseq_event_mask and take that into consideration while
>> running __rseq_handle_notify_resume?
>
> I'm not sure how much it would help. It may reduce the amount of
> work to do, but we'd still be doing additional work at every tick.
>
> See my other email about using
>
>    se->sum_exec_runtime - se->prev_sum_exec_runtime
>
> to only do rseq_preempt() when the last preemption was a certain amount
> of consecutive runtime long ago. This is a better alternative I think.
>
>> We could follow the periodicity of the mm_cid compaction and, if the
>> rseq event is a tick, only continue if it is time to compact (and we
>> can return this value from task_queue_mm_cid to avoid checking twice).
>
> Note that the mm_cid compaction delay is per-mm, and the fact that we
> want to run __rseq_handle_notify_resume periodically to update the
> mm_cid fields applies to all threads. Therefore, I don't think we can
> use the mm_cid compaction delay (per-mm) for this.
>

Alright, didn't think of that, I can explore your suggestion. Looks like most of it is already implemented.
What would be a good value to consider the notify waited enough? 100ms or even less?
I don't think this would deserve a config.

>> We would be off by one period (commit the rseq happens before we
>> schedule the next compaction), but it should be acceptable:
>>      __rseq_handle_notify_resume()
>>      {
>>          should_queue = task_queue_mm_cid();
>
>> Another doubt about this case, here we are worrying about this
>> hypothetical long-running task, I'm assuming this can happen only for:
>> 1. isolated cpus with nohz_full and 1 task (the approach wouldn't work)
>
> The prev_sum_exec_runtime approach would work for this case.
>

I mean in that case nohz_full and isolation would ensure nothing else runs on the core, not even the tick (or perhaps that's also nohz=on). I don't think there's much we can do in such a case is there? (On that core/context at least)

>>    or
>> 2. tasks with RT priority mostly starving the cpu
>
> Likewise.
>
>> In 1. I'm not sure the user would really need rseq in the first place,
>
> Not sure, but I'd prefer to keep this option available unless we have a
> strong reason for not being able to support this.
>
>> in 2., assuming nothing like stalld/sched rt throttling is in place, we
>> will probably also never run the kworker doing mm_cid compaction (I'm
>> using the system_wq), for this reason it's probably wiser to use the
>> system_unbound_wq, which as far as I could understand is the only one
>> that would allow the work to run on any other CPU.
>> I might be missing something trivial here, what do you think though?
>
> Good point. I suspect using the system_unbound_wq would be preferable
> here, especially given that we're iterating over possible CPUs anyway,
> so I don't expect much gain from running in a system_wq over
> system_unbound_wq. Or am I missing something ?

I don't think so, I just picked it as it was easier, but it's probably best to switch.

Thanks,
Gabriele