linux-kernel - Re: [PATCH v4] cgroup, blkcg: prevent dirty inodes to pin dying memory cgroups

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <YJ57u0v5K9MXQxoP@carbon.DHCP.thefacebook.com>
Date:   Fri, 14 May 2021 06:31:39 -0700
From:   Roman Gushchin <guro@...com>
To:     Jan Kara <jack@...e.cz>
CC:     Tejun Heo <tj@...nel.org>, <linux-fsdevel@...r.kernel.org>,
        <linux-kernel@...r.kernel.org>, <linux-mm@...ck.org>,
        Alexander Viro <viro@...iv.linux.org.uk>,
        Dennis Zhou <dennis@...nel.org>,
        Dave Chinner <dchinner@...hat.com>, <cgroups@...r.kernel.org>
Subject: Re: [PATCH v4] cgroup, blkcg: prevent dirty inodes to pin dying
 memory cgroups

On Fri, May 14, 2021 at 01:23:20PM +0200, Jan Kara wrote:
> On Thu 13-05-21 11:12:16, Roman Gushchin wrote:
> > On Thu, May 13, 2021 at 12:12:39PM +0200, Jan Kara wrote:
> > > On Wed 12-05-21 17:42:58, Roman Gushchin wrote:
> > > > +			WARN_ON_ONCE(inode->i_wb != wb);
> > > > +			inode->i_wb = NULL;
> > > > +			wb_put(wb);
> > > > +			list_del_init(&inode->i_io_list);
> > > 
> > > So I was thinking about this and I'm still a bit nervous that setting i_wb
> > > to NULL is going to cause subtle crashes somewhere. Granted you are very
> > > careful when not to touch the inode but still, even stuff like
> > > inode_to_bdi() is not safe to call with inode->i_wb is NULL. So I'm afraid
> > > that some place in the writeback code will be looking at i_wb without
> > > having any of those bits set and boom. E.g. inode_to_wb() call in
> > > test_clear_page_writeback() - what protects that one?
> > 
> > I assume that if the page is dirty/under the writeback, the inode must be
> > dirty too, so we can't zero inode->i_wb.
> 
> If page is under writeback, the inode can be already clean. You could check
>   !mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)
> but see how fragile it is... For every place using inode_to_wb() you have
> to come up with a test excluding it...
> 
> > > I forgot what possibilities did we already discuss in the past but cannot
> > > we just switch inode->i_wb to inode_to_bdi(inode)->wb (i.e., root cgroup
> > > writeback structure)? That would be certainly safer...
> > 
> > I am/was nervous too. I had several BUG_ON()'s in all such places and ran
> > the test kernel for about a day on my dev desktop (even updated to Fedora
> > 34 lol), and haven't seen any panics. And certainly I can give it a
> > comprehensive test in a more production environment.
> 
> I appreciate the testing but it is also about how likely this is to break
> sometime in future because someone unaware of this corner-case will add new
> inode_to_wb() call not excluded by one of your conditions.

Ok, you convinced me, will change in the next version.

> 
> > Re switching to the root wb: it's certainly a possibility too, however
> > zeroing has an advantage: the next potential writeback will be accounted
> > to the right cgroup without a need in an additional switch.
> > I'd try to go with zeroing if it's possible, and keep the switching to the
> > root wb as plab B.
> 
> Yes, there will be the cost of an additional switch. But inodes attached to
> dying cgroups shouldn't be the fast path anyway so does it matter?
> 
> > > > @@ -633,6 +647,48 @@ static void cgwb_bdi_unregister(struct backing_dev_info *bdi)
> > > >  	mutex_unlock(&bdi->cgwb_release_mutex);
> > > >  }
> > > >  
> > > > +/**
> > > > + * cleanup_offline_cgwbs - try to release dying cgwbs
> > > > + *
> > > > + * Try to release dying cgwbs by switching attached inodes to the wb
> > > > + * belonging to the root memory cgroup. Processed wbs are placed at the
> > > > + * end of the list to guarantee the forward progress.
> > > > + *
> > > > + * Should be called with the acquired cgwb_lock lock, which might
> > > > + * be released and re-acquired in the process.
> > > > + */
> > > > +static void cleanup_offline_cgwbs_workfn(struct work_struct *work)
> > > > +{
> > > > +	struct bdi_writeback *wb;
> > > > +	LIST_HEAD(processed);
> > > > +
> > > > +	spin_lock_irq(&cgwb_lock);
> > > > +
> > > > +	while (!list_empty(&offline_cgwbs)) {
> > > > +		wb = list_first_entry(&offline_cgwbs, struct bdi_writeback,
> > > > +				      offline_node);
> > > > +
> > > > +		list_move(&wb->offline_node, &processed);
> > > > +
> > > > +		if (wb_has_dirty_io(wb))
> > > > +			continue;
> > > > +
> > > > +		if (!percpu_ref_tryget(&wb->refcnt))
> > > > +			continue;
> > > > +
> > > > +		spin_unlock_irq(&cgwb_lock);
> > > > +		cleanup_offline_wb(wb);
> > > > +		spin_lock_irq(&cgwb_lock);
> > > > +
> > > > +		wb_put(wb);
> > > > +	}
> > > > +
> > > > +	if (!list_empty(&processed))
> > > > +		list_splice_tail(&processed, &offline_cgwbs);
> > > > +
> > > > +	spin_unlock_irq(&cgwb_lock);
> > > 
> > > Shouldn't we reschedule this work with some delay if offline_cgwbs is
> > > non-empty? Otherwise we can end up with non-empty &offline_cgwbs and no
> > > cleaning scheduled...
> > 
> > I'm not sure. In general it's not a big problem to have few outstanding
> > wb structures, we just need to make sure we don't pile them.
> > I'm scheduling the cleanup from the memcg offlining path, so if new cgroups
> > are regularly created and destroyed, some pressure will be applied.
> > 
> > To reschedule based on time we need to come up with some heuristic how to
> > calculate the required delay and I don't have any specific ideas. If you do,
> > I'm totally fine to incorporate them.
> 
> Well, I'd pick e.g. dirty_expire_interval (30s by default) as a time after
> which we try again. Because after that time writeback has likely already
> happened. But I don't insist here. If you think leaving inodes attached to
> dead cgroups for potentially long time in some cases isn't a problem, then
> we can leave this as is. If we find it's a problem in the future, we can
> always add the time-based retry.

Yes, I agree, we can add it later.

Thank you!

Roman