lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <6e010dbb-f125-4f44-9b1a-9e6ac9bb66ff@molgen.mpg.de>
Date: Thu, 21 Mar 2024 10:15:04 +0100
From: Donald Buczek <buczek@...gen.mpg.de>
To: Alexander Viro <viro@...iv.linux.org.uk>,
 Christian Brauner <brauner@...nel.org>, linux-fsdevel@...r.kernel.org,
 linux-kernel@...r.kernel.org, it+linux@...gen.mpg.de
Subject: possible 6.6 regression: Deadlock involving super_lock()

Hi,

we have a set of 6 systems with similar usage patterns which ran on 5.15 kernels for over a year.  Only two weeks after we've switched one of the systems from a 5.15 kernel to a 6.6 kernel, it went into a deadlock. I'm aware that I don't have enough information that this could be analyzed, but I though I drop it here anyway, because the deadlock seems to involve the locking of a superblock and I've seen that some changes in that area went into 6.6. Maybe someone has an idea or suggestions for further inspection if this happens the next time.

These systems

- use automounter a lot (many mount/umount events)
- use nfs a lot (most data is on remote filesystems over nfs)
- are used interactively (users occasionally overload any resource like memory, cores or network)

When we've noticed the problem, several processes were blocked, including the automounter which waited for a mount that didn't complete:

# # /proc/73777/task/73777: mount.nfs : /sbin/mount.nfs rabies:/amd/rabies/M/MG009/project/avitidata /project/avitidata -s -o rw,nosuid,sec=mariux
# cat /proc/73777/task/73777/stack

[<0>] super_lock+0x40/0x140
[<0>] grab_super+0x29/0xc0
[<0>] grab_super_dead+0x2e/0x140
[<0>] sget_fc+0x1e1/0x2d0
[<0>] nfs_get_tree_common+0x86/0x520 [nfs]
[<0>] vfs_get_tree+0x21/0xb0
[<0>] nfs_do_submount+0x128/0x180 [nfs]
[<0>] nfs4_submount+0x566/0x6d0 [nfsv4]
[<0>] nfs_d_automount+0x16b/0x230 [nfs]
[<0>] __traverse_mounts+0x8f/0x210
[<0>] step_into+0x32a/0x740
[<0>] link_path_walk.part.0.constprop.0+0x246/0x380
[<0>] path_lookupat+0x3e/0x190
[<0>] filename_lookup+0xe8/0x1f0
[<0>] vfs_path_lookup+0x52/0x80
[<0>] mount_subtree+0xa0/0x150
[<0>] do_nfs4_mount+0x269/0x360 [nfsv4]
[<0>] nfs4_try_get_tree+0x48/0xd0 [nfsv4]
[<0>] vfs_get_tree+0x21/0xb0
[<0>] path_mount+0x79e/0xa50
[<0>] __x64_sys_mount+0x11a/0x150
[<0>] do_syscall_64+0x46/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8


Also, one writeback thread was blocked. I mention that, because I don't get how these these two threads could depend on each other:

# # /proc/39359/task/39359: kworker/u268:5+flush-0:58 : 
# cat /proc/39359/task/39359/stack

[<0>] folio_wait_bit_common+0x135/0x350
[<0>] write_cache_pages+0x1a0/0x3a0
[<0>] nfs_writepages+0x12a/0x1e0 [nfs]
[<0>] do_writepages+0xcf/0x1e0
[<0>] __writeback_single_inode+0x46/0x3a0
[<0>] writeback_sb_inodes+0x1f5/0x4d0
[<0>] __writeback_inodes_wb+0x4c/0xf0
[<0>] wb_writeback+0x1f5/0x320
[<0>] wb_workfn+0x350/0x4f0
[<0>] process_one_work+0x142/0x300
[<0>] worker_thread+0x2f5/0x410
[<0>] kthread+0xe8/0x120
[<0>] ret_from_fork+0x34/0x50
[<0>] ret_from_fork_asm+0x1b/0x30

As a result, of course, more and more processes were blocked. A full list of all stack traces and some more info from the system in the blocked state is at https://owww.molgen.mpg.de/~buczek/2024-03-18_mount/info.log

dmesg not included in that file, but I've reviewed it and there was nothing unusual in it.


Thanks

  Donald

-- 
Donald Buczek
buczek@...gen.mpg.de
Tel: +49 30 8413 14  

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ