linux-kernel - Performance drop due to alloc_workqueue() misuse and recent change

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3>
Date:   Mon, 4 Dec 2023 16:03:47 +0000
From:   Naohiro Aota <Naohiro.Aota@....com>
To:     Tejun Heo <tj@...nel.org>, Lai Jiangshan <jiangshanlai@...il.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-btrfs@...r.kernel.org" <linux-btrfs@...r.kernel.org>
CC:     "ceph-devel@...r.kernel.org" <ceph-devel@...r.kernel.org>,
        "cgroups@...r.kernel.org" <cgroups@...r.kernel.org>,
        "coreteam@...filter.org" <coreteam@...filter.org>,
        "dm-devel@...ts.linux.dev" <dm-devel@...ts.linux.dev>,
        "dri-devel@...ts.freedesktop.org" <dri-devel@...ts.freedesktop.org>,
        "gfs2@...ts.linux.dev" <gfs2@...ts.linux.dev>,
        "intel-gfx@...ts.freedesktop.org" <intel-gfx@...ts.freedesktop.org>,
        "iommu@...ts.linux.dev" <iommu@...ts.linux.dev>,
        "linux-arm-kernel@...ts.infradead.org" 
        <linux-arm-kernel@...ts.infradead.org>,
        "linux-bcachefs@...r.kernel.org" <linux-bcachefs@...r.kernel.org>,
        "linux-block@...r.kernel.org" <linux-block@...r.kernel.org>,
        "linux-cachefs@...hat.com" <linux-cachefs@...hat.com>,
        "linux-cifs@...r.kernel.org" <linux-cifs@...r.kernel.org>,
        "linux-crypto@...r.kernel.org" <linux-crypto@...r.kernel.org>,
        "linux-erofs@...ts.ozlabs.org" <linux-erofs@...ts.ozlabs.org>,
        "linux-f2fs-devel@...ts.sourceforge.net" 
        <linux-f2fs-devel@...ts.sourceforge.net>,
        "linux-fscrypt@...r.kernel.org" <linux-fscrypt@...r.kernel.org>,
        "linux-media@...r.kernel.org" <linux-media@...r.kernel.org>,
        "linux-mediatek@...ts.infradead.org" 
        <linux-mediatek@...ts.infradead.org>,
        "linux-mm@...ck.org" <linux-mm@...ck.org>,
        "linux-mmc@...r.kernel.org" <linux-mmc@...r.kernel.org>,
        "linux-nfs@...r.kernel.org" <linux-nfs@...r.kernel.org>,
        "linux-nvme@...ts.infradead.org" <linux-nvme@...ts.infradead.org>,
        "linux-raid@...r.kernel.org" <linux-raid@...r.kernel.org>,
        "linux-rdma@...r.kernel.org" <linux-rdma@...r.kernel.org>,
        "linux-remoteproc@...r.kernel.org" <linux-remoteproc@...r.kernel.org>,
        "linux-scsi@...r.kernel.org" <linux-scsi@...r.kernel.org>,
        "linux-trace-kernel@...r.kernel.org" 
        <linux-trace-kernel@...r.kernel.org>,
        "linux-usb@...r.kernel.org" <linux-usb@...r.kernel.org>,
        "linux-wireless@...r.kernel.org" <linux-wireless@...r.kernel.org>,
        "linux-xfs@...r.kernel.org" <linux-xfs@...r.kernel.org>,
        "nbd@...er.debian.org" <nbd@...er.debian.org>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        "ntb@...ts.linux.dev" <ntb@...ts.linux.dev>,
        "open-iscsi@...glegroups.com" <open-iscsi@...glegroups.com>,
        "oss-drivers@...igine.com" <oss-drivers@...igine.com>,
        "platform-driver-x86@...r.kernel.org" 
        <platform-driver-x86@...r.kernel.org>,
        "samba-technical@...ts.samba.org" <samba-technical@...ts.samba.org>,
        "target-devel@...r.kernel.org" <target-devel@...r.kernel.org>,
        "virtualization@...ts.linux.dev" <virtualization@...ts.linux.dev>,
        "wireguard@...ts.zx2c4.com" <wireguard@...ts.zx2c4.com>
Subject: Performance drop due to alloc_workqueue() misuse and recent change

Recently, commit 636b927eba5b ("workqueue: Make unbound workqueues to use
per-cpu pool_workqueues") changed WQ_UNBOUND workqueue's behavior. It
changed the meaning of alloc_workqueue()'s max_active from an upper limit
imposed per NUMA node to a limit per CPU. As a result, massive number of
workers can be running at the same time, especially if the workqueue user
thinks the max_active is a global limit.

Actually, it is already written it is per-CPU limit in the documentation
before the commit. However, several callers seem to misuse max_active,
maybe thinking it is a global limit. It is an unexpected behavior change
for them.

For example, these callers set max_active = num_online_cpus(), which is a
suspicious limit applying to per-CPU. This config means we can have nr_cpu
* nr_cpu active tasks working at the same time.

fs/f2fs/data.c: sbi->post_read_wq = alloc_workqueue("f2fs_post_read_wq",
fs/f2fs/data.c-                                          WQ_UNBOUND | WQ_HIGHPRI,
fs/f2fs/data.c-                                          num_online_cpus());

fs/crypto/crypto.c:     fscrypt_read_workqueue = alloc_workqueue("fscrypt_read_queue",
fs/crypto/crypto.c-                                              WQ_UNBOUND | WQ_HIGHPRI,
fs/crypto/crypto.c-                                              num_online_cpus());

fs/verity/verify.c:     fsverity_read_workqueue = alloc_workqueue("fsverity_read_queue",
fs/verity/verify.c-                                               WQ_HIGHPRI,
fs/verity/verify.c-                                               num_online_cpus());

drivers/crypto/hisilicon/qm.c:  qm->wq = alloc_workqueue("%s", WQ_HIGHPRI | WQ_MEM_RECLAIM |
drivers/crypto/hisilicon/qm.c-                           WQ_UNBOUND, num_online_cpus(),
drivers/crypto/hisilicon/qm.c-                           pci_name(qm->pdev));

block/blk-crypto-fallback.c:    blk_crypto_wq = alloc_workqueue("blk_crypto_wq",
block/blk-crypto-fallback.c-                                    WQ_UNBOUND | WQ_HIGHPRI |
block/blk-crypto-fallback.c-                                    WQ_MEM_RECLAIM, num_online_cpus());

drivers/md/dm-crypt.c:          cc->crypt_queue = alloc_workqueue("kcryptd/%s",
drivers/md/dm-crypt.c-                                            WQ_CPU_INTENSIVE | WQ_MEM_RECLAIM | WQ_UNBOUND,
drivers/md/dm-crypt.c-                                            num_online_cpus(), devname);

Furthermore, the change affects performance in a certain case.

Btrfs creates several WQ_UNBOUND workqueues with a default max_active =
min(NRCPUS + 2, 8). As my machine has 96 CPUs with NUMA disabled, this
max_active config allows running over 700 active works. Before the commit,
it is limited to 8 if NUMA is disabled or limited to 16 if NUMA nodes is 2.

I reverted the workqueue code back to before the commit, and I ran the
following fio command on RAID0 btrfs on 6 SSDs.

fio --group_reporting --eta=always --eta-interval=30s --eta-newline=30s \
    --rw=write --fallocate=none \
    --direct=1 --ioengine=libaio --iodepth=32 \
    --filesize=100G \
    --blocksize=64k \
    --time_based --runtime=300s \
    --end_fsync=1 \
    --directory=${MNT} \
    --name=writer --numjobs=32

By changing workqueue's max_active, the result varies.

- wq max_active=8   (intended limit by btrfs?)
  WRITE: bw=2495MiB/s (2616MB/s), 2495MiB/s-2495MiB/s (2616MB/s-2616MB/s), io=753GiB (808GB), run=308953-308953msec
- wq max_active=16  (actual limit on 2 NUMA nodes setup)
  WRITE: bw=1736MiB/s (1820MB/s), 1736MiB/s-1736MiB/s (1820MB/s-1820MB/s), io=670GiB (720GB), run=395532-395532msec
- wq max_active=768 (simulating current limit)
  WRITE: bw=1276MiB/s (1338MB/s), 1276MiB/s-1276MiB/s (1338MB/s-1338MB/s), io=375GiB (403GB), run=300984-300984msec

The current performance is slower than the previous limit (max_active=16)
by 27%, or it is 50% slower than the intended limit.  The performance drop
might be due to contention of the btrfs-endio-write works. There are over
700 kworker instances were created and 100 works are on the 'D' state
competing for a lock.

More specifically, I tested the same workload on the commit.

- At commit 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
  WRITE: bw=1191MiB/s (1249MB/s), 1191MiB/s-1191MiB/s (1249MB/s-1249MB/s), io=350GiB (376GB), run=300714-300714msec
- At the previous commit = 4cbfd3de73 ("workqueue: Call wq_update_unbound_numa() on all CPUs in NUMA node on CPU hotplug")
  WRITE: bw=1747MiB/s (1832MB/s), 1747MiB/s-1747MiB/s (1832MB/s-1832MB/s), io=748GiB (803GB), run=438134-438134msec

So, it is -31.8% performance down with the commit.

In summary, we misuse max_active, considering it is a global limit. And,
the recent commit introduced a huge performance drop in some cases.  We
need to review alloc_workqueue() usage to check if its max_active setting
is proper or not.