linux-kernel - Re: [PATCH] vfs: Fix potential circular locking through setxattr() and removexattr()

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240723-aberkennen-unruhen-61570127dc6e@brauner>
Date: Tue, 23 Jul 2024 13:11:51 +0200
From: Christian Brauner <brauner@...nel.org>
To: Jan Kara <jack@...e.cz>
Cc: David Howells <dhowells@...hat.com>, 
	Alexander Viro <viro@...iv.linux.org.uk>, Jeff Layton <jlayton@...nel.org>, Gao Xiang <xiang@...nel.org>, 
	Matthew Wilcox <willy@...radead.org>, netfs@...ts.linux.dev, linux-erofs@...ts.ozlabs.org, 
	linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] vfs: Fix potential circular locking through setxattr()
 and removexattr()

On Tue, Jul 23, 2024 at 12:45:33PM GMT, Jan Kara wrote:
> On Tue 23-07-24 09:59:54, David Howells wrote:
> > When using cachefiles, lockdep may emit something similar to the circular
> > locking dependency notice below.  The problem appears to stem from the
> > following:
> > 
> >  (1) Cachefiles manipulates xattrs on the files in its cache when called
> >      from ->writepages().
> > 
> >  (2) The setxattr() and removexattr() system call handlers get the name
> >      (and value) from userspace after taking the sb_writers lock, putting
> >      accesses of the vma->vm_lock and mm->mmap_lock inside of that.
> > 
> >  (3) The afs filesystem uses a per-inode lock to prevent multiple
> >      revalidation RPCs and in writeback vs truncate to prevent parallel
> >      operations from deadlocking against the server on one side and local
> >      page locks on the other.
> > 
> > Fix this by moving the getting of the name and value in {get,remove}xattr()
> > outside of the sb_writers lock.  This also has the minor benefits that we
> > don't need to reget these in the event of a retry and we never try to take
> > the sb_writers lock in the event we can't pull the name and value into the
> > kernel.
> 
> Well, it seems like you are trying to get rid of the dependency
> sb_writers->mmap_sem. But there are other places where this dependency is

Independent of this issue, I think that moving the retrieval of name and
value out of the lock is a good thing. The commit message might need to
get reworded of course.

> created, in particular write(2) path is a place where it would be very
> difficult to get rid of it (you take sb_writers, then do all the work
> preparing the write and then you copy user data into page cache which
> may require mmap_sem).
> 
> But looking at the lockdep splat below:
> 
> >  ======================================================
> >  WARNING: possible circular locking dependency detected
> >  6.10.0-build2+ #956 Not tainted
> >  ------------------------------------------------------
> >  fsstress/6050 is trying to acquire lock:
> >  ffff888138fd82f0 (mapping.invalidate_lock#3){++++}-{3:3}, at: filemap_fault+0x26e/0x8b0
> > 
> >  but task is already holding lock:
> >  ffff888113f26d18 (&vma->vm_lock->lock){++++}-{3:3}, at: lock_vma_under_rcu+0x165/0x250
> > 
> >  which lock already depends on the new lock.
> > 
> >  the existing dependency chain (in reverse order) is:
> > 
> >  -> #4 (&vma->vm_lock->lock){++++}-{3:3}:
> >         __lock_acquire+0xaf0/0xd80
> >         lock_acquire.part.0+0x103/0x280
> >         down_write+0x3b/0x50
> >         vma_start_write+0x6b/0xa0
> >         vma_link+0xcc/0x140
> >         insert_vm_struct+0xb7/0xf0
> >         alloc_bprm+0x2c1/0x390
> >         kernel_execve+0x65/0x1a0
> >         call_usermodehelper_exec_async+0x14d/0x190
> >         ret_from_fork+0x24/0x40
> >         ret_from_fork_asm+0x1a/0x30
> > 
> >  -> #3 (&mm->mmap_lock){++++}-{3:3}:
> >         __lock_acquire+0xaf0/0xd80
> >         lock_acquire.part.0+0x103/0x280
> >         __might_fault+0x7c/0xb0
> >         strncpy_from_user+0x25/0x160
> >         removexattr+0x7f/0x100
> >         __do_sys_fremovexattr+0x7e/0xb0
> >         do_syscall_64+0x9f/0x100
> >         entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > 
> >  -> #2 (sb_writers#14){.+.+}-{0:0}:
> >         __lock_acquire+0xaf0/0xd80
> >         lock_acquire.part.0+0x103/0x280
> >         percpu_down_read+0x3c/0x90
> >         vfs_iocb_iter_write+0xe9/0x1d0
> >         __cachefiles_write+0x367/0x430
> >         cachefiles_issue_write+0x299/0x2f0
> >         netfs_advance_write+0x117/0x140
> >         netfs_write_folio.isra.0+0x5ca/0x6e0
> >         netfs_writepages+0x230/0x2f0
> >         afs_writepages+0x4d/0x70
> >         do_writepages+0x1e8/0x3e0
> >         filemap_fdatawrite_wbc+0x84/0xa0
> >         __filemap_fdatawrite_range+0xa8/0xf0
> >         file_write_and_wait_range+0x59/0x90
> >         afs_release+0x10f/0x270
> >         __fput+0x25f/0x3d0
> >         __do_sys_close+0x43/0x70
> >         do_syscall_64+0x9f/0x100
> >         entry_SYSCALL_64_after_hwframe+0x76/0x7e
> 
> This is the problematic step - from quite deep in the locking chain holding
> invalidate_lock and having PG_Writeback set you suddently jump to very outer
> locking context grabbing sb_writers. Now AFAICT this is not a real deadlock
> problem because the locks are actually on different filesystems, just
> lockdep isn't able to see this. So I don't think you will get rid of these
> lockdep splats unless you somehow manage to convey to lockdep that there's
> the "upper" fs (AFS in this case) and the "lower" fs (the one behind
> cachefiles) and their locks are different.
> 
> >  -> #1 (&vnode->validate_lock){++++}-{3:3}:
> >         __lock_acquire+0xaf0/0xd80
> >         lock_acquire.part.0+0x103/0x280
> >         down_read+0x95/0x200
> >         afs_writepages+0x37/0x70
> >         do_writepages+0x1e8/0x3e0
> >         filemap_fdatawrite_wbc+0x84/0xa0
> >         filemap_invalidate_inode+0x167/0x1e0
> >         netfs_unbuffered_write_iter+0x1bd/0x2d0
> >         vfs_write+0x22e/0x320
> >         ksys_write+0xbc/0x130
> >         do_syscall_64+0x9f/0x100
> >         entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > 
> >  -> #0 (mapping.invalidate_lock#3){++++}-{3:3}:
> >         check_noncircular+0x119/0x160
> >         check_prev_add+0x195/0x430
> >         __lock_acquire+0xaf0/0xd80
> >         lock_acquire.part.0+0x103/0x280
> >         down_read+0x95/0x200
> >         filemap_fault+0x26e/0x8b0
> >         __do_fault+0x57/0xd0
> >         do_pte_missing+0x23b/0x320
> >         __handle_mm_fault+0x2d4/0x320
> >         handle_mm_fault+0x14f/0x260
> >         do_user_addr_fault+0x2a2/0x500
> >         exc_page_fault+0x71/0x90
> >         asm_exc_page_fault+0x22/0x30
> 
> 								Honza
> -- 
> Jan Kara <jack@...e.com>
> SUSE Labs, CR