linux-kernel - Re: [PATCH v7 1/2] sched: Move task_mm_cid_work to mm work

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0493b3c4-c37f-4ddd-93ee-6d7946e42846@efficios.com>
Date: Wed, 19 Feb 2025 12:08:35 -0500
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Gabriele Monaco <gmonaco@...hat.com>, linux-kernel@...r.kernel.org,
 Andrew Morton <akpm@...ux-foundation.org>, Ingo Molnar <mingo@...hat.com>,
 Peter Zijlstra <peterz@...radead.org>, "Paul E. McKenney"
 <paulmck@...nel.org>, linux-mm@...ck.org
Cc: Ingo Molnar <mingo@...nel.org>, Shuah Khan <shuah@...nel.org>
Subject: Re: [PATCH v7 1/2] sched: Move task_mm_cid_work to mm work_struct

On 2025-02-19 11:32, Gabriele Monaco wrote:
> 
> 
> On Wed, 2025-02-19 at 10:13 -0500, Mathieu Desnoyers wrote:
>>> On 2025-02-19 06:31, Gabriele Monaco wrote:
>>>>> Currently, the task_mm_cid_work function is called in a task work
>>>>> triggered by a scheduler tick to frequently compact the mm_cids of
>>>>> each
>>>>> process. This can delay the execution of the corresponding thread
>>>>> for
>>>>> the entire duration of the function, negatively affecting the
>>>>> response
>>>>> in case of real time tasks. In practice, we observe
>>>>> task_mm_cid_work
>>>>> increasing the latency of 30-35us on a 128 cores system, this order
>>>>> of
>>>>> magnitude is meaningful under PREEMPT_RT.
>>>>>
>>>>> Run the task_mm_cid_work in a new work_struct connected to the
>>>>> mm_struct rather than in the task context before returning to
>>>>> userspace.
>>>>>
>>>>> This work_struct is initialised with the mm and disabled before
>>>>> freeing
>>>>> it. Its execution is no longer triggered by scheduler ticks: the
>>>>> queuing
>>>>> of the work happens while returning to userspace in
>>>>> __rseq_handle_notify_resume, maintaining the checks to avoid
>>>>> running
>>>>> more frequently than MM_CID_SCAN_DELAY.
>>>>>
>>>>> The main advantage of this change is that the function can be
>>>>> offloaded
>>>>> to a different CPU and even preempted by RT tasks.
>>>>>
>>>>> Moreover, this new behaviour is more predictable with periodic
>>>>> tasks
>>>>> with short runtime, which may rarely run during a scheduler tick.
>>>>> Now, the work is always scheduled when the task returns to
>>>>> userspace.
>>>>>
>>>>> The work is disabled during mmdrop, since the function cannot sleep
>>>>> in
>>>>> all kernel configurations, we cannot wait for possibly running work
>>>>> items to terminate. We make sure the mm is valid in case the task
>>>>> is
>>>>> terminating by reserving it with mmgrab/mmdrop, returning
>>>>> prematurely if
>>>>> we are really the last user before mmgrab.
>>>>> This situation is unlikely since we don't schedule the work for
>>>>> exiting
>>>>> tasks, but we cannot rule it out.
>>>>>
>>>>> Fixes: 223baf9d17f2 ("sched: Fix performance regression introduced
>>>>> by mm_cid")
>>>>> Signed-off-by: Gabriele Monaco <gmonaco@...hat.com>
>>>>> ---
>>>>> diff --git a/kernel/rseq.c b/kernel/rseq.c
>>>>> index 442aba29bc4cf..f8394ebbb6f4d 100644
>>>>> --- a/kernel/rseq.c
>>>>> +++ b/kernel/rseq.c
>>>>> @@ -419,6 +419,7 @@ void __rseq_handle_notify_resume(struct ksignal
>>>>> *ksig, struct pt_regs *regs)
>>>>>     }
>>>>>     if (unlikely(rseq_update_cpu_node_id(t)))
>>>>>     goto error;
>>>>> + task_queue_mm_cid(t);
>>>>>     return;
>>>>>   
>>>>>    error:
>>>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>>>> index 9aecd914ac691..ee35f9962444b 100644
>>>>> --- a/kernel/sched/core.c
>>>>> +++ b/kernel/sched/core.c
>>>>> @@ -5663,7 +5663,6 @@ void sched_tick(void)
>>>>>     resched_latency = cpu_resched_latency(rq);
>>>>>     calc_global_load_tick(rq);
>>>>>     sched_core_tick(rq);
>>>>> - task_tick_mm_cid(rq, donor);
>>>
>>> I agree that this approach is promising, however I am concerned about
>>> the fact that a task running alone on its runqueue (thus without
>>> preemption) for a long time will never recompact mm_cid, and also
>>> will never update its mm_cid field.
>>>
>>> So I am tempted to insert this in the sched_tick to cover that
>>> scenario:
>>>
>>>        rseq_preempt(current);
>>>
>>> This would ensure that the task runs __rseq_handle_notify_resume() at
>>> least each tick.
>>>
> 
> Right, I thought about this scenario but forgot to add it in the final patch.
> We could have a test doing that: instead of sleeping, the task busy waits.
> 
> Does __rseq_handle_notify_resume need to run in this case, besides for the cid compaction, I mean? Otherwise we could again just enqueu
> the work from there.

Yes we need to do both:

- compact cid,
- run __rseq_handle_notify_resume to update the mm_cid.

We we don't care much if compacting the cid is done at some point
and __rseq_handle_notify_resume only updates the mm_cid field on
the following tick.

So enqueuing the work is not sufficient there, I would really
issue rseq_preempt(current) which makes sure a busy thread both
triggers cid compaction *and* gets its mm_cid updated.

> 
> I'll give a shot for both.
> 
> 
>>>>>     scx_tick(rq);
>>>>>   
>>>>>     rq_unlock(rq, &rf);
>>>>> @@ -10530,22 +10529,16 @@ static void
>>>>> sched_mm_cid_remote_clear_weight(struct mm_struct *mm, int cpu,
>>>>>     sched_mm_cid_remote_clear(mm, pcpu_cid, cpu);
>>>>>    }
>>>>>   
>>>>> -static void task_mm_cid_work(struct callback_head *work)
>>>>> +void task_mm_cid_work(struct work_struct *work)
>>>>>    {
>>>>>     unsigned long now = jiffies, old_scan, next_scan;
>>>>> - struct task_struct *t = current;
>>>>>     struct cpumask *cidmask;
>>>>> - struct mm_struct *mm;
>>>>> + struct mm_struct *mm = container_of(work, struct mm_struct,
>>>>> cid_work);
>>>>>     int weight, cpu;
>>>>>   
>>>>> - SCHED_WARN_ON(t != container_of(work, struct task_struct,
>>>>> cid_work));
>>>>> -
>>>>> - work->next = work; /* Prevent double-add */
>>>>> - if (t->flags & PF_EXITING)
>>>>> - return;
>>>>> - mm = t->mm;
>>>>> - if (!mm)
>>>>> + if (!atomic_read(&mm->mm_count))
>>>>>     return;
>>>>> + mmgrab(mm);
>>>
>>> AFAIU this is racy with respect to re-use of mm struct.
>>>
>>> I recommend that you move mmgrab() to task_queue_mm_cid() just before
>>> invoking schedule_work. That way you ensure that the mm count never
>>> reaches 0 while there is work in flight (and therefore guarantee that
>>> the mm is not re-used).
>>>
> 
> 
> Mmh good point, in that case I think we can still keep on testing the mm_count and return prematurely if it's 1 (we are the only user and the task exited before the work got scheduled).
> That would be a safe assumption if we don't get to 0, wouldn't it?

Yes, although don't forget the mmdrop in that case ;)

Thanks,

Mathieu


> 
> Thanks,
> Gabriele
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com