linux-kernel - Re: [RFC PATCH 0/3] workqueue: Add configure to reduce work latency

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-Id: <20251228164956.3669600-1-jackzxcui1989@163.com>
Date: Mon, 29 Dec 2025 00:49:56 +0800
From: Xin Zhao <jackzxcui1989@....com>
To: jackzxcui1989@....com,
	tj@...nel.org
Cc: hch@...radead.org,
	jiangshanlai@...il.com,
	linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 0/3] workqueue: Add configure to reduce work latency

Dear Tejun,

On Sat,  6 Dec 2025 12:33:57 +0800 Xin Zhao <jackzxcui1989@....com> wrote:
> On Fri, 5 Dec 2025 07:47:40 -1000 Tejun Heo <tj@...nel.org> wrote:
> > On Fri, Dec 05, 2025 at 08:54:42PM +0800, Xin Zhao wrote:
> > > In a system with high real-time requirements, we have noticed that many
> > > high-priority tasks, such as kernel threads responsible for dispatching
> > > GPU tasks and receiving data sources, often experience latency spikes
> > > due to insufficient real-time execution of work.
> > > The existing sysfs can adjust nice value for unbound workqueues. Add new
> > > 'policy' node to support three common policies: SCHED_NORMAL, SCHED_FIFO,
> > > or SCHED_RR. The original 'nice' node is retained for compatibility, add
> > > new 'rtprio' node to adjust real-time priority when 'policy' is SCHED_FIFO
> > > or SCHED_RR. The value of 'rtprio' uses the same numerical meaning as user
> > > space tool chrt.
> > > Introduce variable 'nr_idle_extra', which allows user space to configure
> > > unbound workqueue through sysfs according to the real-time requirement.
> > > By default, workqueue created by system will set 'nr_idle_extra' to 0.
> > > When the policy of workqueue is set to SCHED_FIFO or SCHED_RR via sysfs,
> > > 'nr_idle_extra' will be set to WORKER_NR_RT_DEF(2) as default.
> > > Supporting the private configuration aims to deterministically ensure that
> > > tasks within one workqueue are not affected by tasks from other workqueues
> > > with the same attributes. If the user has high real-time requirements,
> > > they can increase the nr_idle_extra supported in the previous patch while
> > > also setting the workqueue 'private', allowing it to independently use
> > > kworker threads, thus ensuring scheduling-related work delays never occur.
> > 
> > I don't think I'm applying this:
> > 
> > - The rationale is too vague. What are you exactly running and observing?
> >   How does this improve the situation?
> > 
> > - If wq supports private pools, then I don't think it makes sense to add wq
> >   interface to change their attributes. Once turned private, the worker
> >   threads are fixed and userspace can set whatever attributes they want to
> >   set, no?
> 
> 
> Our system is used to run intelligent drive, which have explicit and stringent
> requirements for real-time performance. This is why I developed this set of
> patches:
> 1. Data acquisition quality inspection relies on the deterministic processing
> of UART IMU data, which cannot exceed a specified latency range. Otherwise,
> some topic data will be observed to have higher latency. As you know, I have
> already proposed a patch for the TTY flip buffer to improve this situation.
> Recent tests show that even with the workqueue attributes set to a nice value
> of -20, after long-term operation, 2% of the entries in the quality inspection
> total still show anomalies.
> 2. The GPU model processing has a requirement of at least 20 frames per second.
> The time allocated for dispatching and running GPU tasks is within 10ms.
> Excluding the execution time of GPU itself, the remaining time for scheduling
> the dispatch of work tasks to their actual execution is only about 1ms.
> Although there aren't many high real-time tasks on the system, there are still
> some. Using common CFS kworker, perfetto trace captured show that due to
> untimely scheduling of kworker/u37, there are often kernel submit costs
> exceeding 20ms from "dispatching work to the actual execution of work."
> 3. The workqueue API is the most commonly used programming interface for task
> processing in kernel drivers. The GPU driver and TTY driver where we encounter
> issues also use it. Using kthread_work instead would involve adjustments to
> the driver code's logic and retesting, while adding functionality to the
> existing workqueue API only requires testing the workqueue itself.
> Additionally, the workqueue API has complex and superior logic, its pooling
> management logic saves system resources. I believe that providing real-time
> capabilities based on the current workqueue API is a better choice.
> 
> 
> Assuming we need to provide real-time capabilities based on the workqueue API,
> let's discuss how to implement this:
> 1. Regarding your point that the wq interface is no longer needed once wq
> supports private pools, I would like to say that creating kworker threads for
> worker pools is dynamic in terms of creation and release. Concurrently
> enqueuing multiple works in a workqueue may lead to new thread creation. After
> a user sets the scheduling attributes of kworker threads which belong to a
> private wq, any newly created kworker threads will not automatically inherit
> the scheduling attributes set by the user, as their parent process is kthreadd.
> 2. In the commit log of the nr_idle_extra patch, I described two common types
> of latency:
>     Type 1: The need_more_worker check whether pool->nr_running is zero; if it
>     is not zero, it will not wake up idle kworker threads to execute
>     immediately, resulting in work execution latency.
>     Type 2: The need_more_worker has already checked that pool->nr_running is
>     zero, but currently, there are no idle kworker threads, leading to work
>     execution latency.
> The addition of the nr_idle_extra feature is intended to allow users to
> optionally reduce execution latency based on real-time requirements.
> 
> 
> As for the testing results of this set of patches, I have enabled them this
> week and do the performance and stability tests. I will share the results once
> testing is complete.

My RT workqueue patch has been running stably on our autonomous driving system
for over two weeks. Previously, there would be a 2% data loss in the data
collection scenario due to kworker scheduling delays, which resulted in IMU
timestamp duplication. After applying this patch, and modifying the scheduling
properties of the tty flip buffer workqueue to FIFO 20, while binding it to the
relevant CPUs (CPU0 and CPU1), we no longer encounter any instances of IMU
timestamp duplication. In another GPU scenario, we also applied this patch, which
reduced the kernel submit cost from previous spikes of up to 40ms to a stable 7ms,
yielding significant benefits.
I have enabled this patch in our project, and I believe this is a relatively
common issue that may require similar special user-space configurations for kworker
in other scenarios. The kernel I am using is the RT-Linux kernel version 6.1.

Additionally, I saw your previous commit 636b927eba5bc633753f8eb80f35e1d5be806e51,
titled "workqueue: Make unbound workqueues to use per-cpu pool_workqueues." I
found another potential common issue: once this commit takes effect, controlling
max_active through workqueue requires using the alloc_ordered_workqueue macro.
However, once the alloc_ordered_workqueue macro is used, sysfs settings cannot be
applied.
In our 6.1 version, I could still bypass this sysfs setting by reverting the
alloc_ordered_workqueue macro to alloc_workqueue with WQ_UNBOUND and max_active set
to 1 on our non-NUMA system. However, in newer kernel versions, if the unbound
workqueue is not created using alloc_ordered_workqueue, it defaults to per-CPU
control of max_active, which differs from the typical per-node understanding of
max_active control. In other words, in the latest kernel versions, I cannot set a
system-wide ordered workqueue's CPU binding properties through sysfs. We feel that
allowing ordered workqueues to register sysfs nodes and adjust related attributes
would be meaningful.

Furthermore, I noticed that there are still many comments in the current code that
describe the unbound workqueue's max_active as applying to the whole system. I
believe these comments also need to be updated, such as the comment on max-active
in workqueue.h related to alloc_workqueue.

--
Xin Zhao