linux-kernel - Re: [PATCH v6 2/3] sched: Move task_mm_cid

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ed5ff8e242c7abb760f408b4fea9701d9b39d08d.camel@redhat.com>
Date: Fri, 14 Feb 2025 07:44:13 +0100
From: Gabriele Monaco <gmonaco@...hat.com>
To: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>, 
	linux-kernel@...r.kernel.org
Cc: linux-mm@...ck.org, aubrey.li@...ux.intel.com, yu.c.chen@...el.com,
 Andrew Morton <akpm@...ux-foundation.org>, Peter Zijlstra
 <peterz@...radead.org>, Ingo Molnar	 <mingo@...nel.org>, "Paul E. McKenney"
 <paulmck@...nel.org>, Shuah Khan	 <shuah@...nel.org>
Subject: Re: [PATCH v6 2/3] sched: Move task_mm_cid_work to mm delayed work

On Thu, 2025-02-13 at 12:31 -0500, Mathieu Desnoyers wrote:
> On 2025-02-13 08:25, Gabriele Monaco wrote:
> > On Thu, 2025-02-13 at 14:52 +0800, kernel test robot wrote:
> > > kernel test robot noticed
> > > "WARNING:at_kernel/workqueue.c:#__queue_delayed_work" on:
> > > 
> > > [    2.640924][    T0] ------------[ cut here ]------------
> > > [ 2.641646][ T0] WARNING: CPU: 0 PID: 0 at
> > > kernel/workqueue.c:2495
> > > __queue_delayed_work (kernel/workqueue.c:2495 (discriminator 9))
> > > [    2.642874][    T0] Modules linked in:
> > > [    2.643381][    T0] CPU: 0 UID: 0 PID: 0 Comm: swapper Not
> > > tainted
> > > 6.14.0-rc2-00002-g287adf9e9c1f #1
> > > [    2.644582][    T0] Hardware name: QEMU Standard PC (i440FX +
> > > PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> > > [ 2.645943][ T0] RIP: 0010:__queue_delayed_work
> > > (kernel/workqueue.c:2495 (discriminator 9))
> > 
> > There seem to be major problems with this configuration, I'm trying
> > to
> > understand what's wrong but, for the time being, this patchset is
> > not
> > ready for inclusion.
> 
> I'm staring at this now, and I'm thinking we could do a simpler
> change
> that would solve your RT issues without having to introduce a
> dependency
> on workqueue.c.
> 
> So if the culprit is that task_mm_cid_work() runs for too long on
> large
> many-cpus systems, why not break it up into smaller iterations ?
> 
> So rather than iterating on "for_each_possible_cpu", we could simply
> break this down into iteration on at most N cpus, so:
> 
> tick #1: iteration on CPUs 0 ..   N - 1
> tick #2: iteration on CPUs N .. 2*N - 1
> ...
> circling back to 0 when it reaches the number of possible cpus.
> 
> This N value could be configurable, e.g. CONFIG_RSEQ_CID_SCAN_BATCH,
> with a sane default. An RT system could decide to make that value
> lower.
> 
> Then all we need to do is remember which was that last observed cpu
> number in the mm struct, so the next tick picks up from there.
> 
> The main downside of this approach compared to scheduling delayed
> work in a workqueue is that it depends on having the mm be current
> when
> the scheduler tick happens. But perhaps this is something we could
> fix
> in a different way that does not add a dependency on workqueue. I'm
> not
> sure how though.
> 
> Thoughts ?

Mmh, that's indeed neat, what is not so good about this type of task
work is that it's a pure latency, it will happen before scheduling the
task and can't be interrupted.
The only acceptable latency is a bounded one and your idea is going in
that direction.

As you mentioned, this will make the compaction of mm_cid even more
rare and will likely have the test in 3/3 fail even more often, I'm not
sure if this is necessarily a bad thing though, since mm_cid compaction
is mainly aesthetic, so we could just increase the duration of the test
or even add a busy loop inside to make the task more likely to run this
compaction.

I gave a thought about this whole thing, don't take this too seriously,
but what I see essentially flawed in this approach is:
1. task_works are set on tick
2. task_works are run returning to userspace

1. is the issue with frequency and 2. can be mitigated by your idea,
but not essentially solved. What if we (also) did:
1+. set this task_work while switching in
2+. run this task_work while switching out to sleep (i.e. no
preemption)

1+. would make sure all threads have this task_work scheduled at a
certain point (perhaps a bit too much, but we have a periodicity check
in place). 2+. can effectively run the task in a moment when it is not
problematic for the real time response: on a preemptible kernel, as
soon as a task with higher priority is ready to run, it will preempt
the currently running one, the fact current is going to sleep willingly
implies there's no higher priority task ready, so likely no task really
caring about RT response is going to run after.
Not all tasks are ever going to sleep, so we must keep the original
TWA_RESUME in the task_work, especially for those long-running or low-
priority, both unlikely to be RT tasks.

I'm going to try a patch with this CONFIG_RSEQ_CID_SCAN_BATCH and
tuning the test to pass. In the future we can see if those ideas make
sense and perhaps bring them in.

Thanks,
Gabriele