linux-kernel - Re: [PATCH block/for-4.5-fixes] writeback: keep superblock pinned during cgroup writeback association switches

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20160217205721.GE14140@quack.suse.cz>
Date:	Wed, 17 Feb 2016 21:57:21 +0100
From:	Jan Kara <jack@...e.cz>
To:	Tejun Heo <tj@...nel.org>
Cc:	Jens Axboe <axboe@...nel.dk>, Tahsin Erdogan <tahsin@...gle.com>,
	cgroups@...r.kernel.org, Theodore Ts'o <tytso@....edu>,
	Nauman Rafique <nauman@...gle.com>,
	linux-kernel@...r.kernel.org, Jan Kara <jack@...e.com>
Subject: Re: [PATCH block/for-4.5-fixes] writeback: keep superblock pinned
 during cgroup writeback association switches

On Tue 16-02-16 13:24:57, Tejun Heo wrote:
> From 586afaa034bec88934bad4eb6ab38ba07031ec5a Mon Sep 17 00:00:00 2001
> From: Tejun Heo <tj@...nel.org>
> Date: Tue, 16 Feb 2016 13:14:35 -0500
> 
> If cgroup writeback is in use, an inode is associated with a cgroup
> for writeback.  If the inode's main dirtier changes to another cgroup,
> the association gets updated asynchronously.  Nothing was pinning the
> superblock while such switches are in progress and superblock could go
> away while async switching is pending or in progress leading to
> crashes like the following.
> 
>  kernel BUG at fs/jbd2/transaction.c:319!
>  invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
>  CPU: 1 PID: 29158 Comm: kworker/1:10 Not tainted 4.5.0-rc3 #51
>  Hardware name: Google Google, BIOS Google 01/01/2011
>  Workqueue: events inode_switch_wbs_work_fn
>  task: ffff880213dbbd40 ti: ffff880209264000 task.ti: ffff880209264000
>  RIP: 0010:[<ffffffff803e6922>]  [<ffffffff803e6922>] start_this_handle+0x382/0x3e0
>  RSP: 0018:ffff880209267c30  EFLAGS: 00010202
>  ...
>  Call Trace:
>   [<ffffffff803e6be4>] jbd2__journal_start+0xf4/0x190
>   [<ffffffff803cfc7e>] __ext4_journal_start_sb+0x4e/0x70
>   [<ffffffff803b31ec>] ext4_evict_inode+0x12c/0x3d0
>   [<ffffffff8035338b>] evict+0xbb/0x190
>   [<ffffffff80354190>] iput+0x130/0x190
>   [<ffffffff80360223>] inode_switch_wbs_work_fn+0x343/0x4c0
>   [<ffffffff80279819>] process_one_work+0x129/0x300
>   [<ffffffff80279b16>] worker_thread+0x126/0x480
>   [<ffffffff8027ed14>] kthread+0xc4/0xe0
>   [<ffffffff809771df>] ret_from_fork+0x3f/0x70
> 
> Fix it by bumping s_active while cgroup association switching is in
> flight.
> 
> Signed-off-by: Tejun Heo <tj@...nel.org>
> Reported-and-tested-by: Tahsin Erdogan <tahsin@...gle.com>
> Link: http://lkml.kernel.org/g/CAAeU0aNCq7LGODvVGRU-oU_o-6enii5ey0p1c26D1ZzYwkDc5A@mail.gmail.com
> Fixes: d10c80955265 ("writeback: implement foreign cgroup inode bdi_writeback switching")
> Cc: stable@...r.kernel.org #v4.5+

Well, but this has the side-effect that trying to umount a filesystem while
migrations are happening will result in EBUSY error. Without obvious reason
why that happens. As an admin I would be rather upset when umount sometimes
returns EBUSY without apparent reason and you have to basically implement a
loop around umount to make it reliable. So a nack from me for this patch.

Traditionally, we have used sb->s_count and sb->s_umount semaphore to pin
superblock while writeback code was working on it. That makes umount block
until we can safely unmount the filesystem and thus doesn't result in these
spurious EBUSY errors. But from a quick look this can be problematic for the
cgroup setting.

Alternatively, you could either cancel all the switching work when
unmounting filesystem or maybe just handle I_WB_SWITCH similarly to I_SYNC
- don't grab inode reference when switching is going on, just make
I_WB_SWITCH pin the inode and wait in evict() for it to be clear (similarly
as we call inode_wait_for_writeback() there).

								Honza
> ---
>  fs/fs-writeback.c | 15 +++++++++++----
>  1 file changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 6915c95..1f76d89 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -317,6 +317,7 @@ static void inode_switch_wbs_work_fn(struct work_struct *work)
>  	struct inode_switch_wbs_context *isw =
>  		container_of(work, struct inode_switch_wbs_context, work);
>  	struct inode *inode = isw->inode;
> +	struct super_block *sb = inode->i_sb;
>  	struct address_space *mapping = inode->i_mapping;
>  	struct bdi_writeback *old_wb = inode->i_wb;
>  	struct bdi_writeback *new_wb = isw->new_wb;
> @@ -423,6 +424,7 @@ static void inode_switch_wbs_work_fn(struct work_struct *work)
>  	wb_put(new_wb);
>  
>  	iput(inode);
> +	deactivate_super(sb);
>  	kfree(isw);
>  }
>  
> @@ -469,11 +471,14 @@ static void inode_switch_wbs(struct inode *inode, int new_wb_id)
>  
>  	/* while holding I_WB_SWITCH, no one else can update the association */
>  	spin_lock(&inode->i_lock);
> +
>  	if (inode->i_state & (I_WB_SWITCH | I_FREEING) ||
> -	    inode_to_wb(inode) == isw->new_wb) {
> -		spin_unlock(&inode->i_lock);
> -		goto out_free;
> -	}
> +	    inode_to_wb(inode) == isw->new_wb)
> +		goto out_unlock;
> +
> +	if (!atomic_inc_not_zero(&inode->i_sb->s_active))
> +		goto out_unlock;
> +
>  	inode->i_state |= I_WB_SWITCH;
>  	spin_unlock(&inode->i_lock);
>  
> @@ -489,6 +494,8 @@ static void inode_switch_wbs(struct inode *inode, int new_wb_id)
>  	call_rcu(&isw->rcu_head, inode_switch_wbs_rcu_fn);
>  	return;
>  
> +out_unlock:
> +	spin_unlock(&inode->i_lock);
>  out_free:
>  	if (isw->new_wb)
>  		wb_put(isw->new_wb);
> -- 
> 2.5.0
> 
> 
-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR