linux-kernel - [PATCH] xfs: Fix false positive lockdep warning with sb_internal & fs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20200102155208.8977-1-longman@redhat.com>
Date:   Thu,  2 Jan 2020 10:52:08 -0500
From:   Waiman Long <longman@...hat.com>
To:     "Darrick J. Wong" <darrick.wong@...cle.com>
Cc:     linux-xfs@...r.kernel.org, linux-kernel@...r.kernel.org,
        Dave Chinner <david@...morbit.com>, Qian Cai <cai@....pw>,
        Eric Sandeen <sandeen@...hat.com>,
        Waiman Long <longman@...hat.com>
Subject: [PATCH] xfs: Fix false positive lockdep warning with sb_internal & fs_reclaim

Depending on the workloads, the following circular locking dependency
warning between sb_internal (a percpu rwsem) and fs_reclaim (a pseudo
lock) may show up:

======================================================
WARNING: possible circular locking dependency detected
5.0.0-rc1+ #60 Tainted: G        W
------------------------------------------------------
fsfreeze/4346 is trying to acquire lock:
0000000026f1d784 (fs_reclaim){+.+.}, at:
fs_reclaim_acquire.part.19+0x5/0x30

but task is already holding lock:
0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650

which lock already depends on the new lock.
  :
 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(sb_internal);
                               lock(fs_reclaim);
                               lock(sb_internal);
  lock(fs_reclaim);

 *** DEADLOCK ***

4 locks held by fsfreeze/4346:
 #0: 00000000b478ef56 (sb_writers#8){++++}, at: percpu_down_write+0xb4/0x650
 #1: 000000001ec487a9 (&type->s_umount_key#28){++++}, at: freeze_super+0xda/0x290
 #2: 000000003edbd5a0 (sb_pagefaults){++++}, at: percpu_down_write+0xb4/0x650
 #3: 0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650

stack backtrace:
Call Trace:
 dump_stack+0xe0/0x19a
 print_circular_bug.isra.10.cold.34+0x2f4/0x435
 check_prev_add.constprop.19+0xca1/0x15f0
 validate_chain.isra.14+0x11af/0x3b50
 __lock_acquire+0x728/0x1200
 lock_acquire+0x269/0x5a0
 fs_reclaim_acquire.part.19+0x29/0x30
 fs_reclaim_acquire+0x19/0x20
 kmem_cache_alloc+0x3e/0x3f0
 kmem_zone_alloc+0x79/0x150
 xfs_trans_alloc+0xfa/0x9d0
 xfs_sync_sb+0x86/0x170
 xfs_log_sbcount+0x10f/0x140
 xfs_quiesce_attr+0x134/0x270
 xfs_fs_freeze+0x4a/0x70
 freeze_super+0x1af/0x290
 do_vfs_ioctl+0xedc/0x16c0
 ksys_ioctl+0x41/0x80
 __x64_sys_ioctl+0x73/0xa9
 do_syscall_64+0x18f/0xd23
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

According to Dave Chinner:

  Freezing the filesystem, after all the data has been cleaned. IOWs
  memory reclaim will never run the above writeback path when
  the freeze process is trying to allocate a transaction here because
  there are no dirty data pages in the filesystem at this point.

  Indeed, this xfs_sync_sb() path sets XFS_TRANS_NO_WRITECOUNT so that
  it /doesn't deadlock/ by taking freeze references for the
  transaction. We've just drained all the transactions
  in progress and written back all the dirty metadata, too, and so the
  filesystem is completely clean and only needs the superblock to be
  updated to complete the freeze process. And to do that, it does not
  take a freeze reference because calling sb_start_intwrite() here
  would deadlock.

  IOWs, this is a false positive, caused by the fact that
  xfs_trans_alloc() is called from both above and below memory reclaim
  as well as within /every level/ of freeze processing. Lockdep is
  unable to describe the staged flush logic in the freeze process that
  prevents deadlocks from occurring, and hence we will pretty much
  always see false positives in the freeze path....

Perhaps breaking the fs_reclaim pseudo lock into a per filesystem lock
may fix the issue. However, that will greatly complicate the logic and
may not be worth it.

Another way to fix it is to disable the taking of the fs_reclaim
pseudo lock when in the freezing code path as a reclaim on the freezed
filesystem is not possible as stated above. This patch takes this
approach by setting the __GFP_NOLOCKDEP flag in the slab memory
allocation calls when the filesystem has been freezed.

Without this patch, the command sequence below will show that the lock
dependency chain sb_internal -> fs_reclaim exists.

 # fsfreeze -f /home
 # fsfreeze --unfreeze /home
 # grep -i fs_reclaim -C 3 /proc/lockdep_chains | grep -C 5 sb_internal

After applying the patch, such sb_internal -> fs_reclaim lock dependency
chain can no longer be found. Because of that, the locking dependency
warning will not be shown.

Signed-off-by: Waiman Long <longman@...hat.com>
---
 fs/xfs/kmem.h      | 11 ++++++++++-
 fs/xfs/xfs_log.c   |  3 ++-
 fs/xfs/xfs_trans.c |  8 +++++++-
 3 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index 6143117770e9..901189dcc474 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -20,6 +20,12 @@ typedef unsigned __bitwise xfs_km_flags_t;
 #define KM_MAYFAIL	((__force xfs_km_flags_t)0x0008u)
 #define KM_ZERO		((__force xfs_km_flags_t)0x0010u)
 
+#ifdef CONFIG_LOCKDEP
+#define KM_NOLOCKDEP	((__force xfs_km_flags_t)0x0020u)
+#else
+#define KM_NOLOCKDEP	((__force xfs_km_flags_t)0)
+#endif
+
 /*
  * We use a special process flag to avoid recursive callbacks into
  * the filesystem during transactions.  We will also issue our own
@@ -30,7 +36,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
 {
 	gfp_t	lflags;
 
-	BUG_ON(flags & ~(KM_NOFS|KM_MAYFAIL|KM_ZERO));
+	BUG_ON(flags & ~(KM_NOFS|KM_MAYFAIL|KM_ZERO|KM_NOLOCKDEP));
 
 	lflags = GFP_KERNEL | __GFP_NOWARN;
 	if (flags & KM_NOFS)
@@ -49,6 +55,9 @@ kmem_flags_convert(xfs_km_flags_t flags)
 	if (flags & KM_ZERO)
 		lflags |= __GFP_ZERO;
 
+	if (flags & KM_NOLOCKDEP)
+		lflags |= __GFP_NOLOCKDEP;
+
 	return lflags;
 }
 
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index f6006d94a581..b1997649ecd8 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -454,7 +454,8 @@ xfs_log_reserve(
 	XFS_STATS_INC(mp, xs_try_logspace);
 
 	ASSERT(*ticp == NULL);
-	tic = xlog_ticket_alloc(log, unit_bytes, cnt, client, permanent, 0);
+	tic = xlog_ticket_alloc(log, unit_bytes, cnt, client, permanent,
+			mp->m_super->s_writers.frozen ? KM_NOLOCKDEP : 0);
 	*ticp = tic;
 
 	xlog_grant_push_ail(log, tic->t_cnt ? tic->t_unit_res * tic->t_cnt
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 3b208f9a865c..c0e42e4f5b77 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -262,8 +262,14 @@ xfs_trans_alloc(
 	 * Allocate the handle before we do our freeze accounting and setting up
 	 * GFP_NOFS allocation context so that we avoid lockdep false positives
 	 * by doing GFP_KERNEL allocations inside sb_start_intwrite().
+	 *
+	 * To prevent false positive lockdep warning of circular locking
+	 * dependency between sb_internal and fs_reclaim, disable the
+	 * acquisition of the fs_reclaim pseudo-lock when the superblock
+	 * has been frozen or in the process of being frozen.
 	 */
-	tp = kmem_zone_zalloc(xfs_trans_zone, 0);
+	tp = kmem_zone_zalloc(xfs_trans_zone,
+		mp->m_super->s_writers.frozen ? KM_NOLOCKDEP : 0);
 	if (!(flags & XFS_TRANS_NO_WRITECOUNT))
 		sb_start_intwrite(mp->m_super);
 
-- 
2.18.1