linux-kernel - Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120826222607.GD19235@dastard>
Date:	Mon, 27 Aug 2012 08:26:07 +1000
From:	Dave Chinner <david@...morbit.com>
To:	Naoya Horiguchi <n-horiguchi@...jp.nec.com>
Cc:	Andi Kleen <andi.kleen@...el.com>,
	Wu Fengguang <fengguang.wu@...el.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Tony Luck <tony.luck@...el.com>,
	Rik van Riel <riel@...hat.com>,
	Jun'ichi Nomura <j-nomura@...jp.nec.com>, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep
 AS_HWPOISON sticky

On Fri, Aug 24, 2012 at 01:24:16PM -0400, Naoya Horiguchi wrote:
> On Fri, Aug 24, 2012 at 02:39:17PM +1000, Dave Chinner wrote:
> > On Thu, Aug 23, 2012 at 10:39:32PM -0400, Naoya Horiguchi wrote:
> > > On Fri, Aug 24, 2012 at 11:31:18AM +1000, Dave Chinner wrote:
> > > > On Wed, Aug 22, 2012 at 11:17:35AM -0400, Naoya Horiguchi wrote:
> > > > > "HWPOISON: report sticky EIO for poisoned file" still has a corner case
> > > > > where we have possibilities of data lost. This is because in this fix
> > > > > AS_HWPOISON is cleared when the inode cache is dropped.
> > ....
> > > > > --- v3.6-rc1.orig/mm/truncate.c
> > > > > +++ v3.6-rc1/mm/truncate.c
> > > > > @@ -574,6 +574,8 @@ void truncate_setsize(struct inode *inode, loff_t newsize)
> > > > >  
> > > > >  	oldsize = inode->i_size;
> > > > >  	i_size_write(inode, newsize);
> > > > > +	if (unlikely(mapping_hwpoison(inode->i_mapping) && !newsize))
> > > > > +		mapping_clear_hwpoison(inode->i_mapping);
> > > > 
> > > > So only a truncate to zero size will clear the poison flag?
> > > 
> > > Yes, this is because we only know if the file is affected by hwpoison,
> > > but not where the hwpoisoned page is in the file. We could remember it,
> > > but I did not do it for simplicity.
> > 
> > Surely the page has flags on it to say it is poisoned? That is,
> > after truncating the page cache, if the address space is poisoned,
> > then you can do a pass across the mapping tree checking if there are
> > any poisoned pages left? Or perhaps adding a new mapping tree tag so
> > that the poisoned status is automatically determined by the presence
> > of the poisoned page in the mapping tree?
> 
> The answer for the first question is yes. And for the second question,
> I don't think it's easy because the mapping tree has no reference to
> the error page (I explain more about this below, please see also it,)
> and it can cost too much to search poisoned pages over page cache in
> each request.

Which is my point about a radix tree tag - that's very efficient.

> And for the third question, I think we could do this, but to do it
> we need an additional space (8 bytes) in struct radix_tree_node.
> Considering that this new tag is not used so frequently, so I'm not
> sure that it's worth the cost.

A radix tree node is currently 560 bytes on x86_64, packed 7 to a
page. i.e. using 3920 bytes. We can add another 8 bytes to it
without increasing memory usage at all. So, no cost at all.

> > > > What happens if it is the last page in the mapping that is poisoned,
> > > > and we truncate that away? Shouldn't that clear the poisoned bit?
> > > 
> > > When we handle the hwpoisoned inode, the error page should already
> > > be removed from pagecache, so the remaining pages are not related
> > > to the error and we need not care about them when we consider bit
> > > clearing.
> > 
> > Sorry, I don't follow. What removed the page from the page cache?
> > The truncate_page_cache() call that follows the above code hunk is
> > what does that, so I don't see how it can already be removed from
> > the page cache....
> 
> Memory error handler (memory_failure() in mm/memory-failure.c) has
> removed the error page from the page cache.
> And truncate_page_cache() that follows this hunk removes all pages
> belonging to the page cache of the poisoned file (where the error
> page itself is not included in them.)
> 
> Let me explain more to clarify my whole scenario. If a memory error
> hits on a dirty pagecache, kernel works like below:
> 
>   1. handles a MCE interrupt (logging MCE events,)
>   2. calls memory error handler (doing 3 to 6,)
>   3. sets PageHWPoison flag on the error page,
>   4. unmaps all mappings to processes' virtual addresses,

So nothing in userspace sees the bad page after this.

>   5. sets AS_HWPOISON on mappings to which the error page belongs
>   6. invalidates the error page (unlinks it from LRU list and removes
>      it from pagecache,)
>   (memory error handler finished)

Ok, so the moment a memory error is handled, the page has been
removed from the inode's mapping, and it will never be seen by
aplications again. It's a transient error....

>   7. later accesses to the file returns -EIO,
>   8. AS_HWPOISON is cleared when the file is removed or completely
>      truncated.

.... so why do we have to keep an EIO on the inode forever?

If the page is not dirty, then just tossing it from the cache (as
is already done) and rereading it from disk next time it is accessed
removes the need for any error to be reported at all. It's
effectively a transient error at this point, and as such no errors
should be visible from userspace.

If the page is dirty, then it needs to be treated just like any
other failed page write - the page is invalidated and the address
space is marked with AS_EIO, and that is reported to the next
operation that waits on IO on that file (i.e. fsync)

If you have a second application that reads the files that depends
on a guarantee of good data, then the first step in that process is
that application that writes it needs to use fsync to check the data
was written correctly. That ensures that you only have clean pages
in the cache before the writer closes the file, and any h/w error
then devolves to the above transient clean page invalidation case.

Hence I fail to see why this type of IO error needs to be sticky.
The error on the mapping is transient - it is gone as soon as the
page is removed from the mapping. Hence the error can be dropped as
soon as it is reported to userspace because the mapping is now error
free.

> You may think it strange that the condition of clearing AS_HWPOISON
> is checked with file granularity.

I don't think it is strange, I think it is *wrong*.

> This is because currently userspace
> applications know the memory errors only with file granularity for
> simplicity, when they access via read(), write() and/or fsync().

Trying to report this error to every potential future access
regardless of whether the error still exists or the access is to the
poisoned range with magical sticky errors on inode address spaces
and hacks to memory the reclaim subsystems smells to me like a really
bad hack to work around applications that use bad data integrity
practices.

As such, I think you probably need to rethink the approach you are
taking to handling this error. The error is transient w.r.t. to the
mapping and page cache, and needs to be addressed consistently
compared to other transient IO errors that are reported through the
mapping....

Cheers,

Dave.
-- 
Dave Chinner
david@...morbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/