linux-kernel - Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ZGgAKK-c_DZpvNJB@slm.duckdns.org>
Date:   Fri, 19 May 2023 13:03:04 -1000
From:   Tejun Heo <tj@...nel.org>
To:     Linus Torvalds <torvalds@...ux-foundation.org>
Cc:     jiangshanlai@...il.com, peterz@...radead.org,
        linux-kernel@...r.kernel.org, kernel-team@...a.com,
        joshdon@...gle.com, brho@...gle.com, briannorris@...omium.org,
        nhuck@...gle.com, agk@...hat.com, snitzer@...nel.org,
        void@...ifault.com
Subject: Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue
 execution locality

Oh, a bit of addition.

Once below saturation, latency and bw are mostly the two sides of the same
coin but just to be sure, here are latency results. The single-threaded sync
IO is run with 1ms interval between IOs.

  taskset 0x8 fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=512 \
	--ioengine=sync --iodepth=1 --runtime=60 --numjobs=1 --time_based \
	--group_reporting --name=iops-test-job --verify=sha512 --thinktime=1ms

SYSTEM

  read: IOPS=480, BW=240KiB/s (246kB/s)(14.1MiB/60001msec)
    clat (usec): min=8, max=401, avg=30.96, stdev= 9.60
     lat (usec): min=8, max=401, avg=31.01, stdev= 9.60
    clat percentiles (usec):
     |  1.00th=[   11],  5.00th=[   13], 10.00th=[   25], 20.00th=[   27],
     | 30.00th=[   28], 40.00th=[   29], 50.00th=[   29], 60.00th=[   30],
     | 70.00th=[   32], 80.00th=[   42], 90.00th=[   44], 95.00th=[   44],
     | 99.00th=[   46], 99.50th=[   46], 99.90th=[   56], 99.95th=[   71],
     | 99.99th=[  253]
   bw (  KiB/s): min=  214, max=  265, per=99.85%, avg=240.29, stdev=11.35, samples=119
   iops        : min=  428, max=  530, avg=480.59, stdev=22.70, samples=119

CPU_STRICT

  read: IOPS=474, BW=237KiB/s (243kB/s)(385KiB/1624msec)
    clat (usec): min=9, max=240, avg=28.00, stdev=11.20
     lat (usec): min=9, max=240, avg=28.05, stdev=11.20
    clat percentiles (usec):
     |  1.00th=[   12],  5.00th=[   26], 10.00th=[   26], 20.00th=[   26],
     | 30.00th=[   27], 40.00th=[   28], 50.00th=[   28], 60.00th=[   28],
     | 70.00th=[   29], 80.00th=[   30], 90.00th=[   31], 95.00th=[   31],
     | 99.00th=[   32], 99.50th=[   50], 99.90th=[  241], 99.95th=[  241],
     | 99.99th=[  241]

CACHE

  read: IOPS=479, BW=240KiB/s (245kB/s)(14.0MiB/60002msec)
    clat (nsec): min=7874, max=75922, avg=13342.34, stdev=6906.53
     lat (nsec): min=7904, max=75952, avg=13386.08, stdev=6906.60
    clat percentiles (nsec):
     |  1.00th=[ 8384],  5.00th=[ 8896], 10.00th=[ 9152], 20.00th=[ 9408],
     | 30.00th=[ 9536], 40.00th=[ 9920], 50.00th=[10432], 60.00th=[10688],
     | 70.00th=[11072], 80.00th=[13632], 90.00th=[27264], 95.00th=[28288],
     | 99.00th=[30592], 99.50th=[30848], 99.90th=[41216], 99.95th=[56064],
     | 99.99th=[74240]
   bw (  KiB/s): min=  204, max=  269, per=99.69%, avg=239.67, stdev=11.02, samples=119
   iops        : min=  408, max=  538, avg=479.34, stdev=22.04, samples=119


It's a bit confusing because fio switched to printing nsecs for CACHE but
CPU_STRICT (per-cpu)'s average completion latency is, expectedly, better
than SYSTEM - 28ms vs. 31ms, but CACHE's is way better at 13.3ms.

Thanks.

-- 
tejun