[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <33c3b258-89e7-4238-965a-15248e997074@efficios.com>
Date: Thu, 20 Feb 2025 16:10:42 -0500
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Gabriele Monaco <gmonaco@...hat.com>
Cc: linux-kernel@...r.kernel.org, Andrew Morton <akpm@...ux-foundation.org>,
Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
"Paul E. McKenney" <paulmck@...nel.org>, linux-mm@...ck.org,
Ingo Molnar <mingo@...nel.org>, Shuah Khan <shuah@...nel.org>
Subject: Re: [PATCH v8 1/2] sched: Move task_mm_cid_work to mm work_struct
On 2025-02-20 12:31, Gabriele Monaco wrote:
> 2025-02-20T15:47:26Z Mathieu Desnoyers <mathieu.desnoyers@...icios.com>:
>
>> On 2025-02-20 10:30, Gabriele Monaco wrote:
>>>
>>> On Thu, 2025-02-20 at 09:42 -0500, Mathieu Desnoyers wrote:
>>>> On 2025-02-20 05:26, Gabriele Monaco wrote:
>>>>> Currently, the task_mm_cid_work function is called in a task work
>>>>> triggered by a scheduler tick to frequently compact the mm_cids of
>>>>> each
>>>>> process. This can delay the execution of the corresponding thread
>>>>> for
>>>>> the entire duration of the function, negatively affecting the
>>>>> response
>>>>> in case of real time tasks. In practice, we observe
>>>>> task_mm_cid_work
>>>>> increasing the latency of 30-35us on a 128 cores system, this order
>>>>> of
>>>>> magnitude is meaningful under PREEMPT_RT.
>>>>>
>>>>> Run the task_mm_cid_work in a new work_struct connected to the
>>>>> mm_struct rather than in the task context before returning to
>>>>> userspace.
>>>>>
>>>>> This work_struct is initialised with the mm and disabled before
>>>>> freeing
>>>>> it. The queuing of the work happens while returning to userspace in
>>>>> __rseq_handle_notify_resume, maintaining the checks to avoid
>>>>> running
>>>>> more frequently than MM_CID_SCAN_DELAY.
>>>>> To make sure this happens predictably also on long running tasks,
>>>>> we
>>>>> trigger a call to __rseq_handle_notify_resume also from the
>>>>> scheduler
>>>>> tick (which in turn will also schedule the work item).
>>>>>
>>>>> The main advantage of this change is that the function can be
>>>>> offloaded
>>>>> to a different CPU and even preempted by RT tasks.
>>>>>
>>>>> Moreover, this new behaviour is more predictable with periodic
>>>>> tasks
>>>>> with short runtime, which may rarely run during a scheduler tick.
>>>>> Now, the work is always scheduled when the task returns to
>>>>> userspace.
>>>>>
>>>>> The work is disabled during mmdrop, since the function cannot sleep
>>>>> in
>>>>> all kernel configurations, we cannot wait for possibly running work
>>>>> items to terminate. We make sure the mm is valid in case the task
>>>>> is
>>>>> terminating by reserving it with mmgrab/mmdrop, returning
>>>>> prematurely if
>>>>> we are really the last user while the work gets to run.
>>>>> This situation is unlikely since we don't schedule the work for
>>>>> exiting
>>>>> tasks, but we cannot rule it out.
>>>>>
>>>>> Fixes: 223baf9d17f2 ("sched: Fix performance regression introduced
>>>>> by mm_cid")
>>>>> Signed-off-by: Gabriele Monaco <gmonaco@...hat.com>
>>>>> ---
>>>> [...]
>>>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>>>> index 9aecd914ac691..363e51dd25175 100644
>>>>> --- a/kernel/sched/core.c
>>>>> +++ b/kernel/sched/core.c
>>>>> @@ -5663,7 +5663,7 @@ void sched_tick(void)
>>>>> resched_latency = cpu_resched_latency(rq);
>>>>> calc_global_load_tick(rq);
>>>>> sched_core_tick(rq);
>>>>> - task_tick_mm_cid(rq, donor);
>>>>> + rseq_preempt(donor);
>>>>> scx_tick(rq);
>>>>> rq_unlock(rq, &rf);
>>>>
>>>> There is one tiny important detail worth discussing here: I wonder if
>>>> executing a __rseq_handle_notify_resume() on return to userspace on
>>>> every scheduler tick will cause noticeable performance degradation ?
>>>>
>>>> I think we can mitigate the impact if we can quickly compute the
>>>> amount
>>>> of contiguous unpreempted runtime since last preemption, then we
>>>> could
>>>> use this as a way to only issue rseq_preempt() when there has been a
>>>> minimum amount of contiguous unpreempted execution. Otherwise the
>>>> rseq_preempt() already issued by preemption is enough.
>>>>
>>>> I'm not entirely sure how to compute this "unpreempted contiguous
>>>> runtime" value within sched_tick() though, any ideas ?
>>> I was a bit concerned but, at least from the latency perspective, I
>>> didn't see any noticeable difference. This may also depend on the
>>> system under test, though.
>>
>> I see this as an issue for performance-related workloads, not
>> specifically for latency: we'd be adding additional rseq notifiers
>> triggered by the tick in workloads that are CPU-heavy and would
>> otherwise not run it after tick. And we'd be adding this overhead
>> even in scenarios where there are relatively frequent preemptions
>> happening, because every tick would end up issuing rseq_preempt().
>>
>>> We may not need to do that, what we are doing here is improperly
>>> calling rseq_preempt. What if we call an rseq_tick which sets a
>>> different bit in rseq_event_mask and take that into consideration while
>>> running __rseq_handle_notify_resume?
>>
>> I'm not sure how much it would help. It may reduce the amount of
>> work to do, but we'd still be doing additional work at every tick.
>>
>> See my other email about using
>>
>> se->sum_exec_runtime - se->prev_sum_exec_runtime
>>
>> to only do rseq_preempt() when the last preemption was a certain amount
>> of consecutive runtime long ago. This is a better alternative I think.
>>
>>> We could follow the periodicity of the mm_cid compaction and, if the
>>> rseq event is a tick, only continue if it is time to compact (and we
>>> can return this value from task_queue_mm_cid to avoid checking twice).
>>
>> Note that the mm_cid compaction delay is per-mm, and the fact that we
>> want to run __rseq_handle_notify_resume periodically to update the
>> mm_cid fields applies to all threads. Therefore, I don't think we can
>> use the mm_cid compaction delay (per-mm) for this.
>>
>
> Alright, didn't think of that, I can explore your suggestion. Looks like most of it is already implemented.
> What would be a good value to consider the notify waited enough? 100ms or even less?
> I don't think this would deserve a config.
I'd go with 100ms initially, and adjust if need be.
>
>>> We would be off by one period (commit the rseq happens before we
>>> schedule the next compaction), but it should be acceptable:
>>> __rseq_handle_notify_resume()
>>> {
>>> should_queue = task_queue_mm_cid();
>>
>>> Another doubt about this case, here we are worrying about this
>>> hypothetical long-running task, I'm assuming this can happen only for:
>>> 1. isolated cpus with nohz_full and 1 task (the approach wouldn't work)
>>
>> The prev_sum_exec_runtime approach would work for this case.
>>
>
> I mean in that case nohz_full and isolation would ensure nothing else runs on the core, not even the tick (or perhaps that's also nohz=on). I don't think there's much we can do in such a case is there? (On that core/context at least)
In case of nohz_full without tick, the goal is to have pretty much no
kernel involved. So if userspace depends on the kernel for updating its
rseq mm_cid fields and it does not happen, well, too bad, userspace gets
what it asked for. Having less-compact mm_cid values than should be in
that corner-case is not an issue I think we need to deal with.
E.g. it's similar to missing scheduler stats bookkeeping with tick
disabled. I don't think userspace should expect precise stats in that
nohz_full without tick scenario.
>
>>> or
>>> 2. tasks with RT priority mostly starving the cpu
>>
>> Likewise.
>>
>>> In 1. I'm not sure the user would really need rseq in the first place,
>>
>> Not sure, but I'd prefer to keep this option available unless we have a
>> strong reason for not being able to support this.
>>
>>> in 2., assuming nothing like stalld/sched rt throttling is in place, we
>>> will probably also never run the kworker doing mm_cid compaction (I'm
>>> using the system_wq), for this reason it's probably wiser to use the
>>> system_unbound_wq, which as far as I could understand is the only one
>>> that would allow the work to run on any other CPU.
>>> I might be missing something trivial here, what do you think though?
>>
>> Good point. I suspect using the system_unbound_wq would be preferable
>> here, especially given that we're iterating over possible CPUs anyway,
>> so I don't expect much gain from running in a system_wq over
>> system_unbound_wq. Or am I missing something ?
>
> I don't think so, I just picked it as it was easier, but it's probably best to switch.
OK,
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Powered by blists - more mailing lists