[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <6e010dbb-f125-4f44-9b1a-9e6ac9bb66ff@molgen.mpg.de>
Date: Thu, 21 Mar 2024 10:15:04 +0100
From: Donald Buczek <buczek@...gen.mpg.de>
To: Alexander Viro <viro@...iv.linux.org.uk>,
Christian Brauner <brauner@...nel.org>, linux-fsdevel@...r.kernel.org,
linux-kernel@...r.kernel.org, it+linux@...gen.mpg.de
Subject: possible 6.6 regression: Deadlock involving super_lock()
Hi,
we have a set of 6 systems with similar usage patterns which ran on 5.15 kernels for over a year. Only two weeks after we've switched one of the systems from a 5.15 kernel to a 6.6 kernel, it went into a deadlock. I'm aware that I don't have enough information that this could be analyzed, but I though I drop it here anyway, because the deadlock seems to involve the locking of a superblock and I've seen that some changes in that area went into 6.6. Maybe someone has an idea or suggestions for further inspection if this happens the next time.
These systems
- use automounter a lot (many mount/umount events)
- use nfs a lot (most data is on remote filesystems over nfs)
- are used interactively (users occasionally overload any resource like memory, cores or network)
When we've noticed the problem, several processes were blocked, including the automounter which waited for a mount that didn't complete:
# # /proc/73777/task/73777: mount.nfs : /sbin/mount.nfs rabies:/amd/rabies/M/MG009/project/avitidata /project/avitidata -s -o rw,nosuid,sec=mariux
# cat /proc/73777/task/73777/stack
[<0>] super_lock+0x40/0x140
[<0>] grab_super+0x29/0xc0
[<0>] grab_super_dead+0x2e/0x140
[<0>] sget_fc+0x1e1/0x2d0
[<0>] nfs_get_tree_common+0x86/0x520 [nfs]
[<0>] vfs_get_tree+0x21/0xb0
[<0>] nfs_do_submount+0x128/0x180 [nfs]
[<0>] nfs4_submount+0x566/0x6d0 [nfsv4]
[<0>] nfs_d_automount+0x16b/0x230 [nfs]
[<0>] __traverse_mounts+0x8f/0x210
[<0>] step_into+0x32a/0x740
[<0>] link_path_walk.part.0.constprop.0+0x246/0x380
[<0>] path_lookupat+0x3e/0x190
[<0>] filename_lookup+0xe8/0x1f0
[<0>] vfs_path_lookup+0x52/0x80
[<0>] mount_subtree+0xa0/0x150
[<0>] do_nfs4_mount+0x269/0x360 [nfsv4]
[<0>] nfs4_try_get_tree+0x48/0xd0 [nfsv4]
[<0>] vfs_get_tree+0x21/0xb0
[<0>] path_mount+0x79e/0xa50
[<0>] __x64_sys_mount+0x11a/0x150
[<0>] do_syscall_64+0x46/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
Also, one writeback thread was blocked. I mention that, because I don't get how these these two threads could depend on each other:
# # /proc/39359/task/39359: kworker/u268:5+flush-0:58 :
# cat /proc/39359/task/39359/stack
[<0>] folio_wait_bit_common+0x135/0x350
[<0>] write_cache_pages+0x1a0/0x3a0
[<0>] nfs_writepages+0x12a/0x1e0 [nfs]
[<0>] do_writepages+0xcf/0x1e0
[<0>] __writeback_single_inode+0x46/0x3a0
[<0>] writeback_sb_inodes+0x1f5/0x4d0
[<0>] __writeback_inodes_wb+0x4c/0xf0
[<0>] wb_writeback+0x1f5/0x320
[<0>] wb_workfn+0x350/0x4f0
[<0>] process_one_work+0x142/0x300
[<0>] worker_thread+0x2f5/0x410
[<0>] kthread+0xe8/0x120
[<0>] ret_from_fork+0x34/0x50
[<0>] ret_from_fork_asm+0x1b/0x30
As a result, of course, more and more processes were blocked. A full list of all stack traces and some more info from the system in the blocked state is at https://owww.molgen.mpg.de/~buczek/2024-03-18_mount/info.log
dmesg not included in that file, but I've reviewed it and there was nothing unusual in it.
Thanks
Donald
--
Donald Buczek
buczek@...gen.mpg.de
Tel: +49 30 8413 14
Powered by blists - more mailing lists