lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <91eacde0-df99-4d5c-a980-91046f66e612@samsung.com>
Date: Tue, 30 Jan 2024 23:30:21 +0100
From: Marek Szyprowski <m.szyprowski@...sung.com>
To: Tejun Heo <tj@...nel.org>, Lai Jiangshan <jiangshanlai@...il.com>
Cc: linux-kernel@...r.kernel.org, Naohiro.Aota@....com, kernel-team@...a.com
Subject: Re: [PATCH v4 09/10] workqueue: Implement system-wide nr_active
 enforcement for unbound workqueues

Dear All,

On 29.01.2024 19:14, Tejun Heo wrote:
> From: Tejun Heo <tj@...nel.org>
> Date: Mon, 29 Jan 2024 08:11:25 -1000
>
> A pool_workqueue (pwq) represents the connection between a workqueue and a
> worker_pool. One of the roles that a pwq plays is enforcement of the
> max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
> workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
> for per-cpu workqueues and per each NUMA node for unbound workqueues, which
> was a natural result of per-cpu workqueues being served by per-cpu pools and
> unbound by per-NUMA pools.
>
> In terms of max_active enforcement, this was, while not perfect, workable.
> For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
> NUMA machines would get max_active that's multiplied by the number of nodes
> but didn't cause huge problems because NUMA machines are relatively rare and
> the node count is usually pretty low.
>
> However, cache layouts are more complex now and sharing a worker pool across
> a whole node didn't really work well for unbound workqueues. Thus, a series
> of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
> to use per-cpu pool_workqueues") implemented more flexible affinity
> mechanism for unbound workqueues which enables using e.g. last-level-cache
> aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
> workqueues to use per-cpu pool_workqueues") made unbound workqueues use
> per-cpu pwqs like per-cpu workqueues.
>
> While the change was necessary to enable more flexible affinity scopes, this
> came with the side effect of blowing up the effective max_active for unbound
> workqueues. Before, the effective max_active for unbound workqueues was
> multiplied by the number of nodes. After, by the number of CPUs.
>
> 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
> pool_workqueues") claims that this should generally be okay. It is okay for
> users which self-regulates concurrency level which are the vast majority;
> however, there are enough use cases which actually depend on max_active to
> prevent the level of concurrency from going bonkers including several IO
> handling workqueues that can issue a work item for each in-flight IO. With
> targeted benchmarks, the misbehavior can easily be exposed as reported in
> http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
>
> Unfortunately, there is no way to express what these use cases need using
> per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
> to set max_active too low but as soon as we increase max_active a bit, we
> can end up with unreasonable number of in-flight work items when many CPUs
> issue IOs at the same time. ie. The acceptable lowest max_active is higher
> than the acceptable highest max_active.
>
> Ideally, max_active for an unbound workqueue should be system-wide so that
> the users can regulate the total level of concurrency regardless of node and
> cache layout. The reasons workqueue hasn't implemented that yet are:
>
> - One max_active enforcement decouples from pool boundaires, chaining
>    execution after a work item finishes requires inter-pool operations which
>    would require lock dancing, which is nasty.
>
> - Sharing a single nr_active count across the whole system can be pretty
>    expensive on NUMA machines.
>
> - Per-pwq enforcement had been more or less okay while we were using
>    per-node pools.
>
> It looks like we no longer can avoid decoupling max_active enforcement from
> pool boundaries. This patch implements system-wide nr_active mechanism with
> the following design characteristics:
>
> - To avoid sharing a single counter across multiple nodes, the configured
>    max_active is split across nodes according to the proportion of each
>    workqueue's online effective CPUs per node. e.g. A node with twice more
>    online effective CPUs will get twice higher portion of max_active.
>
> - Workqueue used to be able to process a chain of interdependent work items
>    which is as long as max_active. We can't do this anymore as max_active is
>    distributed across the nodes. Instead, a new parameter min_active is
>    introduced which determines the minimum level of concurrency within a node
>    regardless of how max_active distribution comes out to be.
>
>    It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
>    This can lead to higher effective max_weight than configured and also
>    deadlocks if a workqueue was depending on being able to handle chains of
>    interdependent work items that are longer than 8.
>
>    I believe these should be fine given that the number of CPUs in each NUMA
>    node is usually higher than 8 and work item chain longer than 8 is pretty
>    unlikely. However, if these assumptions turn out to be wrong, we'll need
>    to add an interface to adjust min_active.
>
> - Each unbound wq has an array of struct wq_node_nr_active which tracks
>    per-node nr_active. When its pwq wants to run a work item, it has to
>    obtain the matching node's nr_active. If over the node's max_active, the
>    pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
>    the completion path round-robins the pending pwqs activating the first
>    inactive work item of each, which involves some pool lock dancing and
>    kicking other pools. It's not the simplest code but doesn't look too bad.
>
> v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
>
>      - wq_adjust_max_active() is now protected by wq->mutex instead of
>        wq_pool_mutex.
>
> v3: - wq_node_max_active() used to calculate per-node max_active on the fly
>        based on system-wide CPU online states. Lai pointed out that this can
>        lead to skewed distributions for workqueues with restricted cpumasks.
>        Update the max_active distribution to use per-workqueue effective
>        online CPU counts instead of system-wide and cache the calculation
>        results in node_nr_active->max.
>
> v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
>
> Signed-off-by: Tejun Heo <tj@...nel.org>
> Reported-by: Naohiro Aota <Naohiro.Aota@....com>
> Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
> Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
> Reviewed-by: Lai Jiangshan <jiangshanlai@...il.com>

This patch landed in linux-next from 20240130 as commit 5797b1c18919 
("workqueue: Implement system-wide nr_active enforcement for unbound 
workqueues"). Unfortunately it causes the following regression on ARM64 
RK3568 SoC based Odroid-M1 board:

Unable to handle kernel paging request at virtual address ffff0002100296e0
Mem abort info:
   ESR = 0x0000000096000005
   EC = 0x25: DABT (current EL), IL = 32 bits
   SET = 0, FnV = 0
   EA = 0, S1PTW = 0
   FSC = 0x05: level 1 translation fault
Data abort info:
   ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000
   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
swapper pgtable: 4k pages, 48-bit VAs, pgdp=000000000255a000
[ffff0002100296e0] pgd=18000001ffff7003, p4d=18000001ffff7003, 
pud=0000000000000000
Internal error: Oops: 0000000096000005 [#1] PREEMPT SMP
Modules linked in:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.8.0-rc2-next-20240130+ #14392
Hardware name: Hardkernel ODROID-M1 (DT)
pstate: 600000c9 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : wq_update_node_max_active+0x50/0x1fc
lr : wq_update_node_max_active+0x1f0/0x1fc
..
Call trace:
  wq_update_node_max_active+0x50/0x1fc
  apply_wqattrs_commit+0xf0/0x114
  apply_workqueue_attrs_locked+0x58/0xa0
  alloc_workqueue+0x5ac/0x774
  workqueue_init_early+0x460/0x540
  start_kernel+0x258/0x684
  __primary_switched+0xb8/0xc0
Code: 9100a273 35000d01 53067f00 d0016dc1 (f8607a60)
---[ end trace 0000000000000000 ]---
Kernel panic - not syncing: Attempted to kill the idle task!
---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]---

This wasn't trivial to bisect, because next-20240130 suffers from other 
regressions: 
https://lore.kernel.org/all/30ddedc9-0829-4a99-9cb1-39190937981c@samsung.com/ 
but reverting this change and the ones mentioned in that thread fixes 
all the issues observed on top of today's linux-next release. Let me 
know if there is anything I can do to help fixing this issue.

> ---
>   include/linux/workqueue.h |  35 +++-
>   kernel/workqueue.c        | 341 ++++++++++++++++++++++++++++++++++----
>   2 files changed, 341 insertions(+), 35 deletions(-)

 > ...


Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ