[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <847D3821-1D92-468C-88C3-34284BA7922E@michaelmarod.com>
Date: Wed, 30 Mar 2022 14:50:45 +0000
From: Michael Marod <michael@...haelmarod.com>
To: linux-kernel@...r.kernel.org
Subject: Re: NVME performance regression in Linux 5.x due to lack of block level IO queueing
Fix: There were 0 "IO unplugs" and 5 "Timer unplugs" on the Linux 4.4.0 instance and 5 "IO unplugs" and 0 "Timer unplugs" on the Linux 5.4.0 one.
> On Mar 30, 2022, at 10:31 AM, Michael Marod <michael@...haelmarod.com> wrote:
>
> It has been noticed in an AWS Cloud environment that the IO performance on an Ubuntu Bionic 18.04 i3 instance with Linux 5.4.0 is significantly worse than that of an Ubuntu Xenial 16.04 i3 instance with Linux 4.4.0. I've been digging into the issue and it seems like block level queueing is not working as well (if at all?) on the Linux 5 instance. When I've run performance tests using fio, iostat reports 0.00 average queue size for the Linux 5.4.0 instance but goes up to 0.5 or higher on the 4.4.0 one.
>
> In my setup, I have 2 NVMe devices (/dev/nvme0n1 and /dev/nvme1n1) configured in a raid0 on /dev/md127 which is mounted on /opt. I've tested devices directly and through the RAID controller and it does not seem to make a difference. In the output below I am writing through /opt which exercises the file system and RAID controller. Also, fio was compiled from source with latest version on both instances.
>
> # fio -name=randrw -filename=/opt/testfile -direct=1 -thread -rw=randrw -ioengine=psync -bs=4k -size=10G -numjobs=1 -group_reporting=1 -runtime=120
>
> Note: these are different versions of iostat which is why the output is different
>
> # Linux 4.4.0-1137-aws:
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
> nvme1n1 0.00 0.00 3248.00 3173.00 12992.00 12692.00 8.00 0.44 0.07 0.10 0.03 0.07 44.00
> nvme0n1 0.00 0.00 3229.00 3292.00 12916.00 13168.00 8.00 0.47 0.07 0.10 0.04 0.07 46.80
>
> # Linux 5.4.0-1068-aws:
> Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
> nvme0n1 2842.57 2789.11 11370.30 11156.44 0.00 0.00 0.00 0.00 0.10 0.04 0.00 4.00 4.00 0.18 99.01
> nvme1n1 2849.50 2748.51 11398.02 10994.06 0.00 0.00 0.00 0.00 0.10 0.04 0.00 4.00 4.00 0.18 99.01
>
> For another (possibly simpler) demonstration -- I ran blktrace/blkparse on both Linux 4.4.0 and 5.5.0 instances with the same hardware (i3.4xlarge). Then I used fio to do a simple stress test that sends 5 write requests.
>
> # fio -name=write -filename=/opt/testfile -direct=1 -thread -rw=write, -ioengine=psync -bs=4k -size=4k -numjobs=5 -group_reporting=1 -runtime=5
>
> The blkparse output shows that the Linux 4.4.0 instance had 5 "Writes Queued" while the Linux 5.4.0 one had 0. Additionally, there were 0 "IO unplugs" and 5 "Timer unplugs" on the Linux 4.4.0 instance and 5 "Timer unplugs" and 0 "IO unplugs" on the Linux 5.4.0 one.
>
> // On the Bionic machine with Linux 5.4.0-1068-aws #72~18.04.1-Ubuntu SMP Thu Mar 3 08:49:49 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux:
> # blktrace -d /dev/nvme0n1 -d /dev/nvme1n1 -o - | blkparse -i -
> 259,0 12 1 0.000000000 114067 A WS 401408 + 8 <- (9,127) 274432
> 259,0 12 2 0.000003433 114067 G WS 401408 + 8 [fio]
> 259,0 12 3 0.000004078 114067 P N [fio]
> 259,0 12 4 0.000004548 114067 U N [fio] 1
> 259,0 12 5 0.000005654 114067 D WS 401408 + 8 [fio]
> 259,0 4 1 0.000045379 0 C WS 401408 + 8 [0]
> 259,0 14 1 0.000210927 0 C WS 401408 + 8 [0]
> 259,0 14 2 0.000254591 114065 A WS 401408 + 8 <- (9,127) 274432
> 259,0 14 3 0.000257373 114065 G WS 401408 + 8 [fio]
> 259,0 14 4 0.000257853 114065 P N [fio]
> 259,0 14 5 0.000258340 114065 U N [fio] 1
> 259,0 14 6 0.000259293 114065 D WS 401408 + 8 [fio]
> 259,0 8 2 0.000290583 0 C WS 401408 + 8 [0]
> 259,0 13 1 0.000080964 114068 A WS 401408 + 8 <- (9,127) 274432
> 259,0 13 2 0.000083702 114068 G WS 401408 + 8 [fio]
> 259,0 13 3 0.000084247 114068 P N [fio]
> 259,0 13 4 0.000084654 114068 U N [fio] 1
> 259,0 13 5 0.000085801 114068 D WS 401408 + 8 [fio]
> 259,0 3 1 1266874889.709450585 114066 A WS 401408 + 8 <- (9,127) 274432
> 259,0 3 2 1266874889.709455280 114066 G WS 401408 + 8 [fio]
> 259,0 3 3 1266874889.709456267 114066 P N [fio]
> 259,0 3 4 1266874889.709456843 114066 U N [fio] 1
> 259,0 3 5 1266874889.709458888 114066 D WS 401408 + 8 [fio]
> 259,0 15 1 0.000164738 114069 A WS 401408 + 8 <- (9,127) 274432
> 259,0 15 2 0.000168016 114069 G WS 401408 + 8 [fio]
> 259,0 15 3 0.000168571 114069 P N [fio]
> 259,0 15 4 0.000169037 114069 U N [fio] 1
> 259,0 15 5 0.000169993 114069 D WS 401408 + 8 [fio]
> 259,0 10 1 0.000128797 0 C WS 401408 + 8 [0]
> ^C259,0 8 1 1266874889.709500996 0 C WS 401408 + 8 [0]
> CPU3 (259,0):
> Reads Queued: 0, 0KiB Writes Queued: 0, 0KiB
> Read Dispatches: 0, 0KiB Write Dispatches: 1, 4KiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB
> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
> Read depth: 0 Write depth: 3
> IO unplugs: 1 Timer unplugs: 0
> CPU4 (259,0):
> Reads Queued: 0, 0KiB Writes Queued: 0, 0KiB
> Read Dispatches: 0, 0KiB Write Dispatches: 0, 0KiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 0, 0KiB Writes Completed: 1, 4KiB
> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
> Read depth: 0 Write depth: 3
> IO unplugs: 0 Timer unplugs: 0
> CPU8 (259,0):
> Reads Queued: 0, 0KiB Writes Queued: 0, 0KiB
> Read Dispatches: 0, 0KiB Write Dispatches: 0, 0KiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 0, 0KiB Writes Completed: 2, 8KiB
> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
> Read depth: 0 Write depth: 3
> IO unplugs: 0 Timer unplugs: 0
> CPU10 (259,0):
> Reads Queued: 0, 0KiB Writes Queued: 0, 0KiB
> Read Dispatches: 0, 0KiB Write Dispatches: 0, 0KiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 0, 0KiB Writes Completed: 1, 4KiB
> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
> Read depth: 0 Write depth: 3
> IO unplugs: 0 Timer unplugs: 0
> CPU12 (259,0):
> Reads Queued: 0, 0KiB Writes Queued: 0, 0KiB
> Read Dispatches: 0, 0KiB Write Dispatches: 1, 4KiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB
> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
> Read depth: 0 Write depth: 3
> IO unplugs: 1 Timer unplugs: 0
> CPU13 (259,0):
> Reads Queued: 0, 0KiB Writes Queued: 0, 0KiB
> Read Dispatches: 0, 0KiB Write Dispatches: 1, 4KiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB
> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
> Read depth: 0 Write depth: 3
> IO unplugs: 1 Timer unplugs: 0
> CPU14 (259,0):
> Reads Queued: 0, 0KiB Writes Queued: 0, 0KiB
> Read Dispatches: 0, 0KiB Write Dispatches: 1, 4KiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 0, 0KiB Writes Completed: 1, 4KiB
> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
> Read depth: 0 Write depth: 3
> IO unplugs: 1 Timer unplugs: 0
> CPU15 (259,0):
> Reads Queued: 0, 0KiB Writes Queued: 0, 0KiB
> Read Dispatches: 0, 0KiB Write Dispatches: 1, 4KiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB
> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
> Read depth: 0 Write depth: 3
> IO unplugs: 1 Timer unplugs: 0
>
> Total (259,0):
> Reads Queued: 0, 0KiB Writes Queued: 0, 0KiB
> Read Dispatches: 0, 0KiB Write Dispatches: 5, 20KiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 0, 0KiB Writes Completed: 5, 20KiB
> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
> IO unplugs: 5 Timer unplugs: 0
>
> Throughput (R/W): 0KiB/s / 0KiB/s
> Events (259,0): 30 entries
> Skips: 0 forward (0 - 0.0%)
>
> // On the Xenial machine with 4.4.0-1137-aws #151-Ubuntu SMP Tue Mar 1 09:47:43 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
> # blktrace -d /dev/nvme0n1 -d /dev/nvme1n1 -o - | blkparse -i -
> 259,1 1 1 0.000000000 82519 Q WS 399368 + 8 [fio]
> 259,1 1 2 0.000000980 82519 G WS 399368 + 8 [fio]
> 259,1 1 3 0.000003468 82519 UT N [fio] 1
> 259,1 1 4 0.000003867 82519 I WS 399368 + 8 [fio]
> 259,1 1 5 0.000011785 1393 D WS 399368 + 8 [kworker/1:1H]
> 259,1 2 1 1266874889.709465021 82517 Q WS 399368 + 8 [fio]
> 259,1 2 2 1266874889.709467133 82517 G WS 399368 + 8 [fio]
> 259,1 2 3 1266874889.709471568 82517 UT N [fio] 1
> 259,1 2 4 1266874889.709471568 82517 I WS 399368 + 8 [fio]
> 259,1 2 5 1266874889.709476205 980 D WS 399368 + 8 [kworker/2:1H]
> 259,1 15 1 0.000051309 0 C WS 399368 + 8 [0]
> 259,1 3 1 0.000079179 82520 Q WS 399368 + 8 [fio]
> 259,1 3 2 0.000080079 82520 G WS 399368 + 8 [fio]
> 259,1 3 3 0.000082140 82520 UT N [fio] 1
> 259,1 3 4 0.000082391 82520 I WS 399368 + 8 [fio]
> 259,1 3 5 0.000087819 975 D WS 399368 + 8 [kworker/3:1H]
> 259,1 12 1 0.000125455 0 C WS 399368 + 8 [0]
> 259,1 5 1 0.000152778 82521 Q WS 399368 + 8 [fio]
> 259,1 5 2 0.000153811 82521 G WS 399368 + 8 [fio]
> 259,1 5 3 0.000155811 82521 UT N [fio] 1
> 259,1 5 4 0.000156314 82521 I WS 399368 + 8 [fio]
> 259,1 5 5 0.000168577 1552 D WS 399368 + 8 [kworker/5:1H]
> 259,1 7 1 0.000206251 0 C WS 399368 + 8 [0]
> 259,1 9 1 0.000234725 82518 Q WS 399368 + 8 [fio]
> 259,1 9 2 0.000235496 82518 G WS 399368 + 8 [fio]
> 259,1 9 3 0.000237147 82518 UT N [fio] 1
> 259,1 9 4 0.000237382 82518 I WS 399368 + 8 [fio]
> 259,1 9 5 0.000242425 2001 D WS 399368 + 8 [kworker/9:1H]
> 259,1 15 2 0.000290161 0 C WS 399368 + 8 [0]
> 259,1 13 1 1266874889.709520875 0 C WS 399368 + 8 [0]
> ^CCPU1 (259,1):
> Reads Queued: 0, 0KiB Writes Queued: 1, 4KiB
> Read Dispatches: 0, 0KiB Write Dispatches: 1, 4KiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB
> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
> Read depth: 0 Write depth: 2
> IO unplugs: 0 Timer unplugs: 1
> CPU2 (259,1):
> Reads Queued: 0, 0KiB Writes Queued: 1, 4KiB
> Read Dispatches: 0, 0KiB Write Dispatches: 1, 4KiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB
> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
> Read depth: 0 Write depth: 2
> IO unplugs: 0 Timer unplugs: 1
> CPU3 (259,1):
> Reads Queued: 0, 0KiB Writes Queued: 1, 4KiB
> Read Dispatches: 0, 0KiB Write Dispatches: 1, 4KiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB
> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
> Read depth: 0 Write depth: 2
> IO unplugs: 0 Timer unplugs: 1
> CPU5 (259,1):
> Reads Queued: 0, 0KiB Writes Queued: 1, 4KiB
> Read Dispatches: 0, 0KiB Write Dispatches: 1, 4KiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB
> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
> Read depth: 0 Write depth: 2
> IO unplugs: 0 Timer unplugs: 1
> CPU7 (259,1):
> Reads Queued: 0, 0KiB Writes Queued: 0, 0KiB
> Read Dispatches: 0, 0KiB Write Dispatches: 0, 0KiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 0, 0KiB Writes Completed: 1, 4KiB
> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
> Read depth: 0 Write depth: 2
> IO unplugs: 0 Timer unplugs: 0
> CPU9 (259,1):
> Reads Queued: 0, 0KiB Writes Queued: 1, 4KiB
> Read Dispatches: 0, 0KiB Write Dispatches: 1, 4KiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB
> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
> Read depth: 0 Write depth: 2
> IO unplugs: 0 Timer unplugs: 1
> CPU12 (259,1):
> Reads Queued: 0, 0KiB Writes Queued: 0, 0KiB
> Read Dispatches: 0, 0KiB Write Dispatches: 0, 0KiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 0, 0KiB Writes Completed: 1, 4KiB
> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
> Read depth: 0 Write depth: 2
> IO unplugs: 0 Timer unplugs: 0
> CPU13 (259,1):
> Reads Queued: 0, 0KiB Writes Queued: 0, 0KiB
> Read Dispatches: 0, 0KiB Write Dispatches: 0, 0KiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 0, 0KiB Writes Completed: 1, 4KiB
> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
> Read depth: 0 Write depth: 2
> IO unplugs: 0 Timer unplugs: 0
> CPU15 (259,1):
> Reads Queued: 0, 0KiB Writes Queued: 0, 0KiB
> Read Dispatches: 0, 0KiB Write Dispatches: 0, 0KiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 0, 0KiB Writes Completed: 2, 8KiB
> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
> Read depth: 0 Write depth: 2
> IO unplugs: 0 Timer unplugs: 0
>
> Total (259,1):
> Reads Queued: 0, 0KiB Writes Queued: 5, 20KiB
> Read Dispatches: 0, 0KiB Write Dispatches: 5, 20KiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 0, 0KiB Writes Completed: 5, 20KiB
> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
> IO unplugs: 0 Timer unplugs: 5
>
> Throughput (R/W): 0KiB/s / 0KiB/s
> Events (259,1): 30 entries
> Skips: 0 forward (0 - 0.0%)
>
> Please CC me directly as I am not subscribed to LKML.
>
> Thanks,
>
> -- Michael Marod (michael@...haelmarod.com)
Powered by blists - more mailing lists