linux-kernel - Re: Known and unfixed active data loss bug in MM + XFS with large folios since Dec 2021 (any kernel from 6.1 upwards)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <CCFA457F-E115-47F0-87F1-F64A51BDE96C@flyingcircus.io>
Date: Fri, 11 Oct 2024 15:50:12 +0200
From: Christian Theune <ct@...ingcircus.io>
To: Chris Mason <clm@...a.com>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>,
 Dave Chinner <david@...morbit.com>,
 Matthew Wilcox <willy@...radead.org>,
 Jens Axboe <axboe@...nel.dk>,
 linux-mm@...ck.org,
 "linux-xfs@...r.kernel.org" <linux-xfs@...r.kernel.org>,
 linux-fsdevel@...r.kernel.org,
 linux-kernel@...r.kernel.org,
 Daniel Dao <dqminh@...udflare.com>,
 regressions@...ts.linux.dev,
 regressions@...mhuis.info
Subject: Re: Known and unfixed active data loss bug in MM + XFS with large
 folios since Dec 2021 (any kernel from 6.1 upwards)

Hi,

> On 11. Oct 2024, at 15:06, Chris Mason <clm@...a.com> wrote:
> 
> - It's actually taking the IO a long time to finish.  We can poke at the
> pending requests, how does the device look in the VM?  (virtio, scsi etc).

I _think_ that’s not it. This is a Qemu w/ virtio-block + Ceph stack with 2x10G and fully SSD backed. The last 24 hours show operation latency at less than 0.016ms. Ceph’s slow request warning (30s limit) has not triggered in the last 24 hours.

Also, aside from a VM that was exhausting its Qemu io throttling for a minute (and stuck in completely different tracebacks) the only blocked task reports from the last 48 hours was this specific process.

I’d expect that we’d see a lot more reports about IO issues from multiple VMs and multiple loads at the same time when the storage misbehaves (we did experience those in the long long past in older Ceph versions and with spinning rust, so I’m pretty confident (at the moment) this isn’t a storage issue per se).

Incidentally this now reminds me of a different (maybe not?) issue that I’ve been trying to track down with mdraid/xfs:
https://marc.info/?l=linux-raid&m=172295385102939&w=2

This is only tested on an older kernel so far (5.15.138) and we ended up seeing IOPS stuck in the md device but not below it. However, MD isn’t involved here. I made the connection because the original traceback also shows it stuck in “wait_on_page_writeback”, but maybe that’s a red herring:

[Aug 6 09:35] INFO: task .backy-wrapped:2615 blocked for more than 122 seconds.
[ +0.008130] Not tainted 5.15.138 #1-NixOS
[ +0.005194] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ +0.008895] task:.backy-wrapped state:D stack: 0 pid: 2615 ppid: 1 flags:0x00000002
[ +0.000005] Call Trace:
[ +0.000002] <TASK>
[ +0.000004] __schedule+0x373/0x1580
[ +0.000009] ? xlog_cil_commit+0x559/0x880 [xfs]
[ +0.000041] schedule+0x5b/0xe0
[ +0.000001] io_schedule+0x42/0x70
[ +0.000001] wait_on_page_bit_common+0x119/0x380
[ +0.000005] ? __page_cache_alloc+0x80/0x80
[ +0.000002] wait_on_page_writeback+0x22/0x70
[ +0.000001] truncate_inode_pages_range+0x26f/0x6d0
[ +0.000006] evict+0x15f/0x180
[ +0.000003] __dentry_kill+0xde/0x170
[ +0.000001] dput+0x15b/0x330
[ +0.000002] do_renameat2+0x34e/0x5b0
[ +0.000003] __x64_sys_rename+0x3f/0x50
[ +0.000002] do_syscall_64+0x3a/0x90
[ +0.000002] entry_SYSCALL_64_after_hwframe+0x62/0xcc
[ +0.000003] RIP: 0033:0x7fdd1885275b
[ +0.000002] RSP: 002b:00007ffde643ad18 EFLAGS: 00000246 ORIG_RAX: 0000000000000052
[ +0.000002] RAX: ffffffffffffffda RBX: 00007ffde643adb0 RCX: 00007fdd1885275b
[ +0.000001] RDX: 0000000000000000 RSI: 00007fdd09a3d3d0 RDI: 00007fdd098549d0
[ +0.000001] RBP: 00007ffde643ad60 R08: 00000000ffffffff R09: 0000000000000000
[ +0.000001] R10: 00007ffde643af90 R11: 0000000000000246 R12: 00000000ffffff9c
[ +0.000000] R13: 00000000ffffff9c R14: 000000000183cab0 R15: 00007fdd0b128810
[ +0.000001] </TASK>
[ +0.000011] INFO: task kworker/u64:0:2380262 blocked for more than 122 seconds.
[ +0.008309] Not tainted 5.15.138 #1-NixOS
[ +0.005190] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ +0.008895] task:kworker/u64:0 state:D stack: 0 pid:2380262 ppid: 2 flags:0x00004000
[ +0.000004] Workqueue: kcryptd/253:4 kcryptd_crypt [dm_crypt]
[ +0.000006] Call Trace:
[ +0.000001] <TASK>
[ +0.000001] __schedule+0x373/0x1580
[ +0.000003] schedule+0x5b/0xe0
[ +0.000001] md_bitmap_startwrite+0x177/0x1e0
[ +0.000004] ? finish_wait+0x90/0x90
[ +0.000004] add_stripe_bio+0x449/0x770 [raid456]
[ +0.000005] raid5_make_request+0x1cf/0xbd0 [raid456]
[ +0.000003] ? kmem_cache_alloc_node_trace+0x391/0x3e0
[ +0.000004] ? linear_map+0x44/0x90 [dm_mod]
[ +0.000005] ? finish_wait+0x90/0x90
[ +0.000001] ? __blk_queue_split+0x516/0x580
[ +0.000003] md_handle_request+0x122/0x1b0
[ +0.000003] md_submit_bio+0x6e/0xb0
[ +0.000001] __submit_bio+0x18f/0x220
[ +0.000002] ? crypt_page_alloc+0x46/0x60 [dm_crypt]
[ +0.000002] submit_bio_noacct+0xbe/0x2d0
[ +0.000001] kcryptd_crypt+0x392/0x550 [dm_crypt]
[ +0.000002] process_one_work+0x1d6/0x360
[ +0.000003] worker_thread+0x4d/0x3b0
[ +0.000002] ? process_one_work+0x360/0x360
[ +0.000001] kthread+0x118/0x140
[ +0.000001] ? set_kthread_struct+0x50/0x50
[ +0.000001] ret_from_fork+0x22/0x30
[ +0.000004] </TASK>
…(more md kworker tasks pile up here)

Christian

-- 
Christian Theune · ct@...ingcircus.io · +49 345 219401 0
Flying Circus Internet Operations GmbH · https://flyingcircus.io
Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick