linux-kernel - Re: [PATCH 0/3] mm/fs: get PG_error out of the writeback reporting business

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1489056471.2791.2.camel@redhat.com>
Date:   Thu, 09 Mar 2017 05:47:51 -0500
From:   Jeff Layton <jlayton@...hat.com>
To:     Jan Kara <jack@...e.cz>, Theodore Ts'o <tytso@....edu>
Cc:     Ross Zwisler <ross.zwisler@...ux.intel.com>,
        viro@...iv.linux.org.uk, konishi.ryusuke@....ntt.co.jp,
        linux-fsdevel@...r.kernel.org, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, linux-nilfs@...r.kernel.org,
        NeilBrown <neilb@...e.com>
Subject: Re: [PATCH 0/3] mm/fs: get PG_error out of the writeback reporting
 business

On Thu, 2017-03-09 at 10:04 +0100, Jan Kara wrote:
> On Wed 08-03-17 21:57:25, Ted Tso wrote:
> > On Tue, Mar 07, 2017 at 11:26:22AM +0100, Jan Kara wrote:
> > > On a more general note (DAX is actually fine here), I find the current
> > > practice of clearing page dirty bits on error and reporting it just once
> > > problematic. It keeps the system running but data is lost and possibly
> > > without getting the error anywhere where it is useful. We get away with
> > > this because it is a rare event but it seems like a problematic behavior.
> > > But this is more for the discussion at LSF.
> > 
> > I'm actually running into this in the last day or two because some MM
> > folks at $WORK have been trying to push hard for GFP_NOFS removal in
> > ext4 (at least when we are holding some mutex/semaphore like
> > i_data_sem) because otherwise it's possible for the OOM killer to be
> > unable to kill processes because they are holding on to locks that
> > ext4 is holding.
> > 
> > I've done some initial investigation, and while it's not that hard to
> > remove GFP_NOFS from certain parts of the writepages() codepath (which
> > is where we had been are running into problems), a really, REALLY big
> > problem is if any_filesystem->writepages() returns ENOMEM, it causes
> > silent data loss, because the pages are marked clean, and so data
> > written using buffered writeback goes *poof*.
> > 
> > I confirmed this by creating a test kernel with a simple patch such
> > that if the ext4 file system is mounted with -o debug, there was a 1
> > in 16 chance that ext4_writepages will immediately return with ENOMEM
> > (and printk the inode number, so I knew which inodes had gotten the
> > ENOMEM treatment).  The result was **NOT** pretty.
> > 
> > What I think we should strongly consider is at the very least, special
> > case ENOMEM being returned by writepages() during background
> > writeback, and *not* mark the pages clean, and make sure the inode
> > stays on the dirty inode list, so we can retry the write later.  This
> > is especially important since the process that issued the write may
> > have gone away, so there might not even be a userspace process to
> > complain to.  By converting certain page allocations (most notably in
> > ext4_mb_load_buddy) from GFP_NOFS to GFP_KMALLOC, this allows us to
> > release the i_data_sem lock and return an error.  This should allow
> > allow the OOM killer to do its dirty deed, and hopefully we can retry
> > the writepages() for that inode later.
> 
> Yeah, so if we can hope the error is transient, keeping pages dirty and
> retrying the write is definitely better option. For start we can say that
> ENOMEM, EINTR, EAGAIN, ENOSPC errors are transient, anything else means
> there's no hope of getting data to disk and so we just discard them. It
> will be somewhat rough distinction but probably better than what we have
> now.
> 
> 								Honza

I'm not sure about ENOSPC there. That's a return code that is
specifically expected to be returned by fsync. It seems like that ought
not be considered a transient error?

-- 
Jeff Layton <jlayton@...hat.com>