linux-kernel - Re: [PATCH 0/4] change sb_writers to use percpu_rw

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150810145942.GF3768@quack.suse.cz>
Date:	Mon, 10 Aug 2015 16:59:42 +0200
From:	Jan Kara <jack@...e.cz>
To:	Oleg Nesterov <oleg@...hat.com>
Cc:	Al Viro <viro@...iv.linux.org.uk>,
	Dave Chinner <david@...morbit.com>,
	Dave Hansen <dave.hansen@...ux.intel.com>,
	Jan Kara <jack@...e.cz>,
	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
	Peter Zijlstra <peterz@...radead.org>,
	linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 0/4] change sb_writers to use percpu_rw_semaphore

  Hello,

On Fri 07-08-15 21:55:52, Oleg Nesterov wrote:
> I'll try to re-check/re-test, but so far I think that this and the
> previous series are correct. However 3/4 from the previous series
> (attached at the end just in case) uncovers (I think) some problems
> in xfs locking.
> 
> What should I do now?
> 
> On 07/22, Oleg Nesterov wrote:
> >
> > Testing. Well, so far I only verified that ioctl(FIFREEZE) +
> > ioctl(FITHAW) seems to wors "as expected" on my testing machine
> > with ext3. So probably this needs more testing.
> 
> Finally I got the testing machine and ran xfstests, I did
> 
> 	dd if=/dev/zero of=TEST.img bs=1MiB count=4000
> 	dd if=/dev/zero of=SCRATCH.img bs=1MiB count=4000
> 
> 	losetup --find --show TEST.img
> 	losetup --find --show SCRATCH.img
> 
> 	mkfs.xfs -f /dev/loop0
> 	mkfs.xfs -f /dev/loop1
> 
> 	mkdir -p TEST SCRATCH
> 
> 	mount /dev/loop0 TEST
> 	mount /dev/loop1 SCRATCH
> 
> 	TEST_DEV=/dev/loop0 TEST_DIR=TEST SCRATCH_DEV=/dev/loop1 SCRATCH_MNT=SCRATCH \
> 	./check `grep -il freeze tests/*/???`
> 
> several times and every time the result looks like below:
> 
> 	FSTYP         -- xfs (non-debug)
> 	PLATFORM      -- Linux/x86_64 intel-canoepass-10 4.1.0+
> 	MKFS_OPTIONS  -- -f -bsize=4096 /dev/loop1
> 	MOUNT_OPTIONS -- -o context=system_u:object_r:nfs_t:s0 /dev/loop1 SCRATCH
> 
> 	generic/068 59s ... 61s
> 	generic/085 23s ... 26s
> 	generic/280 2s ... 3s
> 	generic/311 169s ... 167s
> 	xfs/011 21s ... 20s
> 	xfs/119 32s ... 21s
> 	xfs/297 455s ... 301s
> 	Ran: generic/068 generic/085 generic/280 generic/311 xfs/011 xfs/119 xfs/297
> 	Passed all 7 tests

Hum, I had a look at the lockdep report below and it's one of the
peculiarities of the freeze protection. For record let me repeat the full
argument for why I don't think there's a possibility for a real deadlock.
Feel free to skip to the end if you get bored.  

One would like to construct the lock chain as:

CPU0 (chown foo dir)	CPU1 (readdir dir)	CPU2 (page fault)
process Y		process X, thread 0	process X, thread 1

			get ILOCK for dir
gets freeze protection
starts transaction in xfs_setattr_nonsize
waits to get ILOCK on 'dir'
						get mmap_sem for X
			wait for mmap_sem for process X
			  in filldir()
						wait for freeze protection in
						  xfs_page_mkwrite

and CPU3 then being in freeze_super() blocking CPU2 and waiting for CPU0 to
finish it's freeze-protected section. But this cannot happen. The reason is
that we block writers level-by-level and thus while there are writers at
level X, we do not block writers at level X+1. So in this particular case
freeze_super() will block waiting for CPU0 to finish its freeze protected
section while CPU2 is free to continue.

In general we have a chain like

freeze L0 -> freeze L1 -> freeze L2 -> ILOCK -> mmap_sem --\
                A                                          |
                \------------------------------------------/

But since ILOCK is always acquired with freeze protection at L0 and we can
block at L1 only after there are no writers at L0, this loop can never
happen.

Note that if we use the property of freezing that lock at level X+1 cannot
block when we hold lock at level X, we can as well simplify the dependency
graph and track in it only the lowest level of freeze lock that is
currently acquired (since the levels above it cannot block and do not in
any way influence blocking of other processes either and thus are
irrelevant for the purpose of deadlock detection). Then the dependency
graph we'd get would be:

freeze L0 -> ILOCK -> mmap_sem -> freeze L1

and we have a nice acyclic graph we like to see... So probably we have to
hack the lockdep instrumentation some more and just don't tell lockdep
about freeze locks at higher levels if we already hold a lock at lower
level. Thoughts?

								Honza

> But with CONFIG_LOCKDEP=y generic/068 triggers the warning:
> 
> 	[   66.092831] ======================================================
> 	[   66.099726] [ INFO: possible circular locking dependency detected ]
> 	[   66.106719] 4.1.0+ #2 Not tainted
> 	[   66.110414] -------------------------------------------------------
> 	[   66.117405] fsstress/2210 is trying to acquire lock:
> 	[   66.122944]  (&mm->mmap_sem){++++++}, at: [<ffffffff81200562>] might_fault+0x42/0xa0
> 	[   66.131637]
> 		       but task is already holding lock:
> 	[   66.138146]  (&xfs_dir_ilock_class){++++..}, at: [<ffffffffa057da52>] xfs_ilock+0xc2/0x170 [xfs]
> 	[   66.148022]
> 		       which lock already depends on the new lock.
> 
> 	[   66.157141]
> 		       the existing dependency chain (in reverse order) is:
> 	[   66.165490]
> 		       -> #4 (&xfs_dir_ilock_class){++++..}:
> 	[   66.170974]        [<ffffffff810fc94e>] lock_acquire+0xbe/0x150
> 	[   66.177596]        [<ffffffff810f4bbc>] down_write_nested+0x3c/0x70
> 	[   66.184605]        [<ffffffffa057dab6>] xfs_ilock+0x126/0x170 [xfs]
> 	[   66.191645]        [<ffffffffa057c4da>] xfs_setattr_nonsize+0x3ba/0x5d0 [xfs]
> 	[   66.199638]        [<ffffffffa057cb5a>] xfs_vn_setattr+0x9a/0xd0 [xfs]
> 	[   66.206950]        [<ffffffff81279411>] notify_change+0x271/0x3a0
> 	[   66.213762]        [<ffffffff8125391b>] chown_common.isra.11+0x15b/0x200
> 	[   66.221251]        [<ffffffff812551a6>] SyS_lchown+0xa6/0xf0
> 	[   66.227576]        [<ffffffff8182662e>] system_call_fastpath+0x12/0x76
> 	[   66.234878]
> 		       -> #3 (sb_internal#2){++++++}:
> 	[   66.239698]        [<ffffffff810fc94e>] lock_acquire+0xbe/0x150
> 	[   66.246316]        [<ffffffff81824256>] down_write+0x36/0x70
> 	[   66.252644]        [<ffffffff81100979>] percpu_down_write+0x39/0x110
> 	[   66.259760]        [<ffffffff8125901d>] freeze_super+0xbd/0x190
> 	[   66.266369]        [<ffffffff8126dbc8>] do_vfs_ioctl+0x3f8/0x520
> 	[   66.273082]        [<ffffffff8126dd71>] SyS_ioctl+0x81/0xa0
> 	[   66.279311]        [<ffffffff8182662e>] system_call_fastpath+0x12/0x76
> 	[   66.286610]
> 		       -> #2 (sb_pagefaults#2){++++++}:
> 	[   66.291623]        [<ffffffff810fc94e>] lock_acquire+0xbe/0x150
> 	[   66.298237]        [<ffffffff811007c1>] percpu_down_read+0x51/0xa0
> 	[   66.305144]        [<ffffffff8125979b>] __sb_start_write+0xdb/0x120
> 	[   66.312140]        [<ffffffff81295eda>] block_page_mkwrite+0x3a/0xb0
> 	[   66.319245]        [<ffffffffa057046e>] xfs_filemap_page_mkwrite+0x5e/0xb0 [xfs]
> 	[   66.327533]        [<ffffffff812009a8>] do_page_mkwrite+0x58/0x100
> 	[   66.334433]        [<ffffffff81205497>] handle_mm_fault+0x537/0x17c0
> 	[   66.341533]        [<ffffffff8106a03c>] __do_page_fault+0x19c/0x450
> 	[   66.348542]        [<ffffffff8106a31f>] do_page_fault+0x2f/0x80
> 	[   66.355172]        [<ffffffff818286e8>] page_fault+0x28/0x30
> 	[   66.361493]
> 		       -> #1 (&(&ip->i_mmaplock)->mr_lock){++++++}:
> 	[   66.367659]        [<ffffffff810fc94e>] lock_acquire+0xbe/0x150
> 	[   66.374275]        [<ffffffff810f4aef>] down_read_nested+0x3f/0x60
> 	[   66.381185]        [<ffffffffa057daf8>] xfs_ilock+0x168/0x170 [xfs]
> 	[   66.388212]        [<ffffffffa057050c>] xfs_filemap_fault+0x4c/0xb0 [xfs]
> 	[   66.395817]        [<ffffffff81200a9e>] __do_fault+0x4e/0x100
> 	[   66.402239]        [<ffffffff81205454>] handle_mm_fault+0x4f4/0x17c0
> 	[   66.409340]        [<ffffffff8106a03c>] __do_page_fault+0x19c/0x450
> 	[   66.416344]        [<ffffffff8106a31f>] do_page_fault+0x2f/0x80
> 	[   66.422959]        [<ffffffff818286e8>] page_fault+0x28/0x30
> 	[   66.429283]
> 		       -> #0 (&mm->mmap_sem){++++++}:
> 	[   66.434093]        [<ffffffff810fb867>] __lock_acquire+0x20d7/0x2100
> 	[   66.441194]        [<ffffffff810fc94e>] lock_acquire+0xbe/0x150
> 	[   66.447808]        [<ffffffff8120058f>] might_fault+0x6f/0xa0
> 	[   66.454235]        [<ffffffff8126e05a>] filldir+0x9a/0x130
> 	[   66.460373]        [<ffffffffa056daed>] xfs_dir2_sf_getdents.isra.10+0x1bd/0x220 [xfs]
> 	[   66.469233]        [<ffffffffa056e72e>] xfs_readdir+0x17e/0x1a0 [xfs]
> 	[   66.476449]        [<ffffffffa057095b>] xfs_file_readdir+0x2b/0x30 [xfs]
> 	[   66.483954]        [<ffffffff8126de2a>] iterate_dir+0x9a/0x140
> 	[   66.490473]        [<ffffffff8126e361>] SyS_getdents+0x91/0x120
> 	[   66.497091]        [<ffffffff8182662e>] system_call_fastpath+0x12/0x76
> 	[   66.504381]
> 		       other info that might help us debug this:
> 
> 	[   66.513316] Chain exists of:
> 			 &mm->mmap_sem --> sb_internal#2 --> &xfs_dir_ilock_class
> 
> 	[   66.522619]  Possible unsafe locking scenario:
> 
> 	[   66.529222]        CPU0                    CPU1
> 	[   66.534275]        ----                    ----
> 	[   66.539327]   lock(&xfs_dir_ilock_class);
> 	[   66.543823]                                lock(sb_internal#2);
> 	[   66.550465]                                lock(&xfs_dir_ilock_class);
> 	[   66.557767]   lock(&mm->mmap_sem);
> 	[   66.561580]
> 			*** DEADLOCK ***
> 
> 	[   66.568186] 2 locks held by fsstress/2210:
> 	[   66.572753]  #0:  (&type->i_mutex_dir_key#5){+.+.+.}, at: [<ffffffff8126ddf1>] iterate_dir+0x61/0x140
> 	[   66.583103]  #1:  (&xfs_dir_ilock_class){++++..}, at: [<ffffffffa057da52>] xfs_ilock+0xc2/0x170 [xfs]
> 	[   66.593457]
> 		       stack backtrace:
> 	[   66.598321] CPU: 26 PID: 2210 Comm: fsstress Not tainted 4.1.0+ #2
> 	[   66.605215] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS RMLSDP.86I.R3.27.D685.1305151734 05/15/2013
> 	[   66.616372]  0000000000000000 0000000048bb9c77 ffff8817fa127ad8 ffffffff8181d9dd
> 	[   66.624663]  0000000000000000 ffffffff8288f500 ffff8817fa127b28 ffffffff810f7b26
> 	[   66.632955]  0000000000000002 ffff8817fa127b98 ffff8817fa127b28 ffff881803f06480
> 	[   66.641249] Call Trace:
> 	[   66.643983]  [<ffffffff8181d9dd>] dump_stack+0x45/0x57
> 	[   66.649713]  [<ffffffff810f7b26>] print_circular_bug+0x206/0x280
> 	[   66.656413]  [<ffffffff810fb867>] __lock_acquire+0x20d7/0x2100
> 	[   66.662919]  [<ffffffff810fa1b7>] ? __lock_acquire+0xa27/0x2100
> 	[   66.669523]  [<ffffffff810fc94e>] lock_acquire+0xbe/0x150
> 	[   66.675543]  [<ffffffff81200562>] ? might_fault+0x42/0xa0
> 	[   66.681566]  [<ffffffff8120058f>] might_fault+0x6f/0xa0
> 	[   66.687394]  [<ffffffff81200562>] ? might_fault+0x42/0xa0
> 	[   66.693418]  [<ffffffff8126e05a>] filldir+0x9a/0x130
> 	[   66.698968]  [<ffffffffa057db34>] ? xfs_ilock_data_map_shared+0x34/0x40 [xfs]
> 	[   66.706941]  [<ffffffffa056daed>] xfs_dir2_sf_getdents.isra.10+0x1bd/0x220 [xfs]
> 	[   66.715203]  [<ffffffffa057da52>] ? xfs_ilock+0xc2/0x170 [xfs]
> 	[   66.721712]  [<ffffffffa056e72e>] xfs_readdir+0x17e/0x1a0 [xfs]
> 	[   66.728316]  [<ffffffff81822b6d>] ? mutex_lock_killable_nested+0x3ad/0x480
> 	[   66.735998]  [<ffffffffa057095b>] xfs_file_readdir+0x2b/0x30 [xfs]
> 	[   66.742894]  [<ffffffff8126de2a>] iterate_dir+0x9a/0x140
> 	[   66.748819]  [<ffffffff8127a1ac>] ? __fget_light+0x6c/0xa0
> 	[   66.754938]  [<ffffffff8126e361>] SyS_getdents+0x91/0x120
> 	[   66.760958]  [<ffffffff8126dfc0>] ? fillonedir+0xf0/0xf0
> 	[   66.766884]  [<ffffffff8182662e>] system_call_fastpath+0x12/0x76
> 
> The chain reported by lockdep is not exactly the same every time,
> but similar.
> 
> Once again, I'll recheck. but the patch below still looks correct
> to me, and I think that it is actually a fix.
> 
> Before this patch freeze_super() calls sync_filesystem() and
> s_op->freeze_fs(sb) without ->s_writers locks (as it seen by
> lockdep) and this is wrong.
> 
> After this patch lockdep knows about ->s_writers locks held and this
> is what we want, but apparently this way lockdep can notice the new
> dependencies.
> 
> Oleg.
> 
> ------------------------------------------------------------------------------
> Subject: [PATCH 3/4] move rwsem_release() from sb_wait_write() to freeze_super()
> 
> Move the "fool the lockdep" code from sb_wait_write() into the new
> simple helper, sb_lockdep_release(), called once before return from
> freeze_super().
> 
> This is preparation, but imo this makes sense in any case. This way
> we do not hide from lockdep the "write" locks we hold when we call
> s_op->freeze_fs(sb).
> 
> Signed-off-by: Oleg Nesterov <oleg@...hat.com>
> ---
>  fs/super.c |   16 ++++++++++------
>  1 files changed, 10 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index d0fdd49..e7ea9f1 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -1236,16 +1236,10 @@ static void sb_wait_write(struct super_block *sb, int level)
>  {
>  	s64 writers;
>  
> -	/*
> -	 * We just cycle-through lockdep here so that it does not complain
> -	 * about returning with lock to userspace
> -	 */
>  	rwsem_acquire(&sb->s_writers.lock_map[level-1], 0, 0, _THIS_IP_);
> -	rwsem_release(&sb->s_writers.lock_map[level-1], 1, _THIS_IP_);
>  
>  	do {
>  		DEFINE_WAIT(wait);
> -
>  		/*
>  		 * We use a barrier in prepare_to_wait() to separate setting
>  		 * of frozen and checking of the counter
> @@ -1261,6 +1255,14 @@ static void sb_wait_write(struct super_block *sb, int level)
>  	} while (writers);
>  }
>  
> +static void sb_freeze_release(struct super_block *sb)
> +{
> +	int level;
> +	/* Avoid the warning from lockdep_sys_exit() */
> +	for (level = 0; level < SB_FREEZE_LEVELS; ++level)
> +		rwsem_release(sb->s_writers.lock_map + level, 1, _THIS_IP_);
> +}
> +
>  /**
>   * freeze_super - lock the filesystem and force it into a consistent state
>   * @sb: the super to lock
> @@ -1349,6 +1351,7 @@ int freeze_super(struct super_block *sb)
>  			sb->s_writers.frozen = SB_UNFROZEN;
>  			smp_wmb();
>  			wake_up(&sb->s_writers.wait_unfrozen);
> +			sb_freeze_release(sb);
>  			deactivate_locked_super(sb);
>  			return ret;
>  		}
> @@ -1358,6 +1361,7 @@ int freeze_super(struct super_block *sb)
>  	 * sees write activity when frozen is set to SB_FREEZE_COMPLETE.
>  	 */
>  	sb->s_writers.frozen = SB_FREEZE_COMPLETE;
> +	sb_freeze_release(sb);
>  	up_write(&sb->s_umount);
>  	return 0;
>  }
> -- 
> 1.5.5.1
> 
> 
-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/