linux-kernel - Re: [PATCH v13 2/3] sched: Move task_mm_cid_work to mm work

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <5ebfd0be3a475583e53eebe2fe8d0a729cbb0343.camel@redhat.com>
Date: Thu, 08 May 2025 11:11:50 +0200
From: Gabriele Monaco <gmonaco@...hat.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>, Ingo Molnar
	 <mingo@...hat.org>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v13 2/3] sched: Move task_mm_cid_work to mm work_struct



On Mon, 2025-04-14 at 11:28 -0400, Mathieu Desnoyers wrote:
> On 2025-04-14 08:36, Gabriele Monaco wrote:
> > Currently, the task_mm_cid_work function is called in a task work
> > triggered by a scheduler tick to frequently compact the mm_cids of
> > each
> > process. This can delay the execution of the corresponding thread
> > for
> > the entire duration of the function, negatively affecting the
> > response
> > in case of real time tasks. In practice, we observe
> > task_mm_cid_work
> > increasing the latency of 30-35us on a 128 cores system, this order
> > of
> > magnitude is meaningful under PREEMPT_RT.
> > 
> > Run the task_mm_cid_work in a new work_struct connected to the
> > mm_struct rather than in the task context before returning to
> > userspace.
> > 
> > This work_struct is initialised with the mm and disabled before
> > freeing
> > it. The queuing of the work happens while returning to userspace in
> > __rseq_handle_notify_resume, maintaining the checks to avoid
> > running
> > more frequently than MM_CID_SCAN_DELAY.
> > To make sure this happens predictably also on long running tasks,
> > we
> > trigger a call to __rseq_handle_notify_resume also from the
> > scheduler
> > tick if the runtime exceeded a 100ms threshold.
> > 
> > The main advantage of this change is that the function can be
> > offloaded
> > to a different CPU and even preempted by RT tasks.
> > 
> > Moreover, this new behaviour is more predictable with periodic
> > tasks
> > with short runtime, which may rarely run during a scheduler tick.
> > Now, the work is always scheduled when the task returns to
> > userspace.
> > 
> > The work is disabled during mmdrop, since the function cannot sleep
> > in
> > all kernel configurations, we cannot wait for possibly running work
> > items to terminate. We make sure the mm is valid in case the task
> > is
> > terminating by reserving it with mmgrab/mmdrop, returning
> > prematurely if
> > we are really the last user while the work gets to run.
> > This situation is unlikely since we don't schedule the work for
> > exiting
> > tasks, but we cannot rule it out.
> 
> The implementation looks good to me. Peter, how does it look from
> your end ?
> 
> Thanks,
> 
> Mathieu
> 
> 

Gentle ping.

Peter, did you have some time to have a look at this patch?

Thanks,
Gabriele