lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Message-Id: <20231227145143.2399-7-jiangshanlai@gmail.com> Date: Wed, 27 Dec 2023 22:51:42 +0800 From: Lai Jiangshan <jiangshanlai@...il.com> To: linux-kernel@...r.kernel.org Cc: Tejun Heo <tj@...nel.org>, Naohiro.Aota@....com, Lai Jiangshan <jiangshan.ljs@...group.com>, Lai Jiangshan <jiangshanlai@...il.com>, Dennis Dalessandro <dennis.dalessandro@...nelisnetworks.com> Subject: [PATCH 6/7] workqueue: Implement system-wide max_active enforcement for unbound workqueues From: Tejun Heo <tj@...nel.org> A pool_workqueue (pwq) represents the connection between a workqueue and a worker_pool. One of the roles that a pwq plays is enforcement of the max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU for per-cpu workqueues and per each NUMA node for unbound workqueues, which was a natural result of per-cpu workqueues being served by per-cpu pools and unbound by per-NUMA pools. In terms of max_active enforcement, this was, while not perfect, workable. For per-cpu workqueues, it was fine. For unbound, it wasn't great in that NUMA machines would get max_active that's multiplied by the number of nodes but didn't cause huge problems because NUMA machines are relatively rare and the node count is usually pretty low. However, cache layouts are more complex now and sharing a PWQ across a whole node didn't really work well for unbound workqueues. Thus, a series of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues") implemented more flexible affinity mechanism for unbound workqueues which enables using e.g. last-level-cache aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues") made unbound workqueues use per-cpu pwqs like per-cpu workqueues. While the change was necessary to enable more flexible affinity scopes, this came with the side effect of blowing up the effective max_active for unbound workqueues. Before, the effective max_active for unbound workqueues was multiplied by the number of nodes. After, by the number of CPUs. 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues") claims that this should generally be okay. It is okay for users which self-regulates concurrency level which are the vast majority; however, there are enough use cases which actually depend on max_active to prevent the level of concurrency from going bonkers including several IO handling workqueues that can issue a work item for each in-flight IO. With targeted benchmarks, the misbehavior can easily be exposed as reported in http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3. Unfortunately, there is no way to express what these use cases need using per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want to set max_active too low but as soon as we increase max_active a bit, we can end up with unreasonable number of in-flight work items when many CPUs issue IOs at the same time. ie. The acceptable lowest max_active is higher than the acceptable highest max_active. Ideally, max_active for an unbound workqueue should be system-wide so that the users can regulate the total level of concurrency regardless of node and cache layout. The reasons workqueue hasn't implemented that yet are: - One max_active enforcement decouples from pool boundaires, chaining execution after a work item finishes requires inter-pool operations which would require lock dancing, which is nasty. - Sharing a single nr_active count across the whole system can be pretty expensive on NUMA machines. - Per-pwq enforcement had been more or less okay while we were using per-node pools. Instead of forcing max_active enforcement system-wide and PWQ-across, this patch distributes max_active among pods based on a previous patch that changes per-cpu PWQ to per-pod PWQ. With per-pod PWQ, max_active is distributed into each PWQ based on the proportion of online CPUs in a PWQ to the total system's online CPU count. - Using per-pod PWQ max_active enforcement can avoid sharing a single counter across multiple worker_pools and avoid complicating locking mechanism. - Workqueue used to be able to process a chain of interdependent work items which is as long as max_active. We can't do this anymore as max_active is distributed across the pods. Instead, a new parameter min_active is introduced which determines the minimum level of concurrency within a pod regardless of how max_active distribution comes out to be. It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8. This can lead to higher effective max_active than configured and also deadlocks if a workqueue was depending on being able to handle chains of interdependent work items that are longer than 8. If either case happens, we'll need to add an interface to adjust min_active and users are required to adjust affinity manually. higher effective max_active can happens when: - uninstalled PWQs. They will be gone when they finished all their pending works. - default PWQ. It is normally dormant unless it is the solo active PWQ. - div round up It can cause the effective max_active more than configured by nr_pods-1 at most. - clamp up to min_active It can cause the effective max_active at least to be min_active*nr_pods. Signed-off-by: Tejun Heo <tj@...nel.org> Reported-by: Naohiro Aota <Naohiro.Aota@....com> Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3 Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues") Signed-off-by: Lai Jiangshan <jiangshan.ljs@...group.com> --- include/linux/workqueue.h | 34 +++++++++++++++++++++++++++++++--- kernel/workqueue.c | 28 ++++++++++++++++++++++++---- 2 files changed, 55 insertions(+), 7 deletions(-) diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h index 24b1e5070f4d..4ba2554f71a2 100644 --- a/include/linux/workqueue.h +++ b/include/linux/workqueue.h @@ -405,6 +405,13 @@ enum { WQ_MAX_ACTIVE = 512, /* I like 512, better ideas? */ WQ_UNBOUND_MAX_ACTIVE = WQ_MAX_ACTIVE, WQ_DFL_ACTIVE = WQ_MAX_ACTIVE / 2, + + /* + * Per-PWQ default cap on min_active. Unless explicitly set, min_active + * is set to min(max_active, WQ_DFL_MIN_ACTIVE). For more details, see + * workqueue_struct->min_active definition. + */ + WQ_DFL_MIN_ACTIVE = 8, }; /* @@ -447,11 +454,32 @@ extern struct workqueue_struct *system_freezable_power_efficient_wq; * alloc_workqueue - allocate a workqueue * @fmt: printf format for the name of the workqueue * @flags: WQ_* flags - * @max_active: max in-flight work items per CPU, 0 for default + * @max_active: max in-flight work items, 0 for default * remaining args: args for @fmt * - * Allocate a workqueue with the specified parameters. For detailed - * information on WQ_* flags, please refer to + * For a per-cpu workqueue, @max_active limits the number of in-flight work + * items for each CPU. e.g. @max_active of 1 indicates that each CPU can be + * executing at most one work item for the workqueue. + * + * For unbound workqueues, @max_active limits the number of in-flight work items + * for the whole system. e.g. @max_active of 16 indicates that that there can be + * at most 16 work items executing for the workqueue in the whole system. + * + * As sharing the same active counter for an unbound workqueue across multiple + * PWQs can be expensive, @max_active is distributed to each PWQ according + * to the proportion of the number of online CPUs and enforced independently. + * + * Depending on online CPU distribution, a PWQ may end up with assigned + * max_active which is significantly lower than @max_active, which can lead to + * deadlocks if the concurrency limit is lower than the maximum number + * of interdependent work items for the workqueue. + * + * To guarantee forward progress regardless of online CPU distribution, the + * concurrency limit on every PWQ is guaranteed to be equal to or greater than + * min_active which is set to min(@max_active, %WQ_DFL_MIN_ACTIVE). This means + * that the sum of per-PWQ max_active's may be larger than @max_active. + * + * For detailed information on %WQ_* flags, please refer to * Documentation/core-api/workqueue.rst. * * RETURNS: diff --git a/kernel/workqueue.c b/kernel/workqueue.c index d1c671597289..382c53f89cb4 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -298,7 +298,8 @@ struct workqueue_struct { struct worker *rescuer; /* MD: rescue worker */ int nr_drainers; /* WQ: drain in progress */ - int saved_max_active; /* WQ: saved pwq max_active */ + int saved_max_active; /* WQ: saved max_active */ + int min_active; /* WQ: pwq min_active */ struct workqueue_attrs *unbound_attrs; /* PW: only for unbound wqs */ struct pool_workqueue *dfl_pwq; /* PW: only for unbound wqs */ @@ -4140,10 +4141,15 @@ static void pwq_release_workfn(struct kthread_work *work) * pwq_calculate_max_active - Determine max_active to use * @pwq: pool_workqueue of interest * - * Determine the max_active @pwq should use. + * Determine the max_active @pwq should use based on the proportion of + * online CPUs in the @pwq to the total system's online CPU count if + * @pwq->wq is unbound. */ static int pwq_calculate_max_active(struct pool_workqueue *pwq) { + int pwq_nr_online_cpus; + int max_active; + /* * During [un]freezing, the caller is responsible for ensuring * that pwq_adjust_max_active() is called at least once after @@ -4152,7 +4158,18 @@ static int pwq_calculate_max_active(struct pool_workqueue *pwq) if ((pwq->wq->flags & WQ_FREEZABLE) && workqueue_freezing) return 0; - return pwq->wq->saved_max_active; + if (!(pwq->wq->flags & WQ_UNBOUND)) + return pwq->wq->saved_max_active; + + pwq_nr_online_cpus = cpumask_weight_and(pwq->pool->attrs->__pod_cpumask, cpu_online_mask); + max_active = DIV_ROUND_UP(pwq->wq->saved_max_active * pwq_nr_online_cpus, num_online_cpus()); + + /* + * To guarantee forward progress regardless of online CPU distribution, + * the concurrency limit on every pwq is guaranteed to be equal to or + * greater than wq->min_active. + */ + return clamp(max_active, pwq->wq->min_active, pwq->wq->saved_max_active); } /** @@ -4745,6 +4762,7 @@ struct workqueue_struct *alloc_workqueue(const char *fmt, /* init wq */ wq->flags = flags; wq->saved_max_active = max_active; + wq->min_active = min(max_active, WQ_DFL_MIN_ACTIVE); mutex_init(&wq->mutex); atomic_set(&wq->nr_pwqs_to_flush, 0); INIT_LIST_HEAD(&wq->pwqs); @@ -4898,7 +4916,8 @@ EXPORT_SYMBOL_GPL(destroy_workqueue); * @wq: target workqueue * @max_active: new max_active value. * - * Set max_active of @wq to @max_active. + * Set max_active of @wq to @max_active. See the alloc_workqueue() function + * comment. * * CONTEXT: * Don't call from IRQ context. @@ -4917,6 +4936,7 @@ void workqueue_set_max_active(struct workqueue_struct *wq, int max_active) wq->flags &= ~__WQ_ORDERED; wq->saved_max_active = max_active; + wq->min_active = min(wq->min_active, max_active); for_each_pwq(pwq, wq) pwq_adjust_max_active(pwq); -- 2.19.1.6.gb485710b
Powered by blists - more mailing lists