linux-kernel - Re: [PATCH] vfs: Fix potential circular locking through setxattr() and removexattr()

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240723104533.mznf3svde36w6izp@quack3>
Date: Tue, 23 Jul 2024 12:45:33 +0200
From: Jan Kara <jack@...e.cz>
To: David Howells <dhowells@...hat.com>
Cc: Alexander Viro <viro@...iv.linux.org.uk>,
	Christian Brauner <brauner@...nel.org>, Jan Kara <jack@...e.cz>,
	Jeff Layton <jlayton@...nel.org>, Gao Xiang <xiang@...nel.org>,
	Matthew Wilcox <willy@...radead.org>, netfs@...ts.linux.dev,
	linux-erofs@...ts.ozlabs.org, linux-fsdevel@...r.kernel.org,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH] vfs: Fix potential circular locking through setxattr()
 and removexattr()

On Tue 23-07-24 09:59:54, David Howells wrote:
> When using cachefiles, lockdep may emit something similar to the circular
> locking dependency notice below.  The problem appears to stem from the
> following:
> 
>  (1) Cachefiles manipulates xattrs on the files in its cache when called
>      from ->writepages().
> 
>  (2) The setxattr() and removexattr() system call handlers get the name
>      (and value) from userspace after taking the sb_writers lock, putting
>      accesses of the vma->vm_lock and mm->mmap_lock inside of that.
> 
>  (3) The afs filesystem uses a per-inode lock to prevent multiple
>      revalidation RPCs and in writeback vs truncate to prevent parallel
>      operations from deadlocking against the server on one side and local
>      page locks on the other.
> 
> Fix this by moving the getting of the name and value in {get,remove}xattr()
> outside of the sb_writers lock.  This also has the minor benefits that we
> don't need to reget these in the event of a retry and we never try to take
> the sb_writers lock in the event we can't pull the name and value into the
> kernel.

Well, it seems like you are trying to get rid of the dependency
sb_writers->mmap_sem. But there are other places where this dependency is
created, in particular write(2) path is a place where it would be very
difficult to get rid of it (you take sb_writers, then do all the work
preparing the write and then you copy user data into page cache which
may require mmap_sem).

But looking at the lockdep splat below:

>  ======================================================
>  WARNING: possible circular locking dependency detected
>  6.10.0-build2+ #956 Not tainted
>  ------------------------------------------------------
>  fsstress/6050 is trying to acquire lock:
>  ffff888138fd82f0 (mapping.invalidate_lock#3){++++}-{3:3}, at: filemap_fault+0x26e/0x8b0
> 
>  but task is already holding lock:
>  ffff888113f26d18 (&vma->vm_lock->lock){++++}-{3:3}, at: lock_vma_under_rcu+0x165/0x250
> 
>  which lock already depends on the new lock.
> 
>  the existing dependency chain (in reverse order) is:
> 
>  -> #4 (&vma->vm_lock->lock){++++}-{3:3}:
>         __lock_acquire+0xaf0/0xd80
>         lock_acquire.part.0+0x103/0x280
>         down_write+0x3b/0x50
>         vma_start_write+0x6b/0xa0
>         vma_link+0xcc/0x140
>         insert_vm_struct+0xb7/0xf0
>         alloc_bprm+0x2c1/0x390
>         kernel_execve+0x65/0x1a0
>         call_usermodehelper_exec_async+0x14d/0x190
>         ret_from_fork+0x24/0x40
>         ret_from_fork_asm+0x1a/0x30
> 
>  -> #3 (&mm->mmap_lock){++++}-{3:3}:
>         __lock_acquire+0xaf0/0xd80
>         lock_acquire.part.0+0x103/0x280
>         __might_fault+0x7c/0xb0
>         strncpy_from_user+0x25/0x160
>         removexattr+0x7f/0x100
>         __do_sys_fremovexattr+0x7e/0xb0
>         do_syscall_64+0x9f/0x100
>         entry_SYSCALL_64_after_hwframe+0x76/0x7e
> 
>  -> #2 (sb_writers#14){.+.+}-{0:0}:
>         __lock_acquire+0xaf0/0xd80
>         lock_acquire.part.0+0x103/0x280
>         percpu_down_read+0x3c/0x90
>         vfs_iocb_iter_write+0xe9/0x1d0
>         __cachefiles_write+0x367/0x430
>         cachefiles_issue_write+0x299/0x2f0
>         netfs_advance_write+0x117/0x140
>         netfs_write_folio.isra.0+0x5ca/0x6e0
>         netfs_writepages+0x230/0x2f0
>         afs_writepages+0x4d/0x70
>         do_writepages+0x1e8/0x3e0
>         filemap_fdatawrite_wbc+0x84/0xa0
>         __filemap_fdatawrite_range+0xa8/0xf0
>         file_write_and_wait_range+0x59/0x90
>         afs_release+0x10f/0x270
>         __fput+0x25f/0x3d0
>         __do_sys_close+0x43/0x70
>         do_syscall_64+0x9f/0x100
>         entry_SYSCALL_64_after_hwframe+0x76/0x7e

This is the problematic step - from quite deep in the locking chain holding
invalidate_lock and having PG_Writeback set you suddently jump to very outer
locking context grabbing sb_writers. Now AFAICT this is not a real deadlock
problem because the locks are actually on different filesystems, just
lockdep isn't able to see this. So I don't think you will get rid of these
lockdep splats unless you somehow manage to convey to lockdep that there's
the "upper" fs (AFS in this case) and the "lower" fs (the one behind
cachefiles) and their locks are different.

>  -> #1 (&vnode->validate_lock){++++}-{3:3}:
>         __lock_acquire+0xaf0/0xd80
>         lock_acquire.part.0+0x103/0x280
>         down_read+0x95/0x200
>         afs_writepages+0x37/0x70
>         do_writepages+0x1e8/0x3e0
>         filemap_fdatawrite_wbc+0x84/0xa0
>         filemap_invalidate_inode+0x167/0x1e0
>         netfs_unbuffered_write_iter+0x1bd/0x2d0
>         vfs_write+0x22e/0x320
>         ksys_write+0xbc/0x130
>         do_syscall_64+0x9f/0x100
>         entry_SYSCALL_64_after_hwframe+0x76/0x7e
> 
>  -> #0 (mapping.invalidate_lock#3){++++}-{3:3}:
>         check_noncircular+0x119/0x160
>         check_prev_add+0x195/0x430
>         __lock_acquire+0xaf0/0xd80
>         lock_acquire.part.0+0x103/0x280
>         down_read+0x95/0x200
>         filemap_fault+0x26e/0x8b0
>         __do_fault+0x57/0xd0
>         do_pte_missing+0x23b/0x320
>         __handle_mm_fault+0x2d4/0x320
>         handle_mm_fault+0x14f/0x260
>         do_user_addr_fault+0x2a2/0x500
>         exc_page_fault+0x71/0x90
>         asm_exc_page_fault+0x22/0x30

								Honza
-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR