linux-kernel - Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20090602125134.GA20462@localhost>
Date:	Tue, 2 Jun 2009 20:51:34 +0800
From:	Wu Fengguang <fengguang.wu@...el.com>
To:	Nick Piggin <npiggin@...e.de>
Cc:	Andi Kleen <andi@...stfloor.org>,
	"hugh@...itas.com" <hugh@...itas.com>,
	"riel@...hat.com" <riel@...hat.com>,
	"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
	"chris.mason@...cle.com" <chris.mason@...cle.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linux-mm@...ck.org" <linux-mm@...ck.org>
Subject: Re: [PATCH] [13/16] HWPOISON: The high level memory error handler
	in the VM v3

On Tue, Jun 02, 2009 at 08:19:40PM +0800, Nick Piggin wrote:
> On Tue, Jun 02, 2009 at 07:14:07PM +0800, Wu Fengguang wrote:
> > On Mon, Jun 01, 2009 at 10:40:51PM +0800, Nick Piggin wrote:
> > > But you just said that you try to intercept the IO. So the underlying
> > > data is not necessarily corrupt. And even if it was then what if it
> > > was reinitialized to something else in the meantime (such as filesystem
> > > metadata blocks?) You'd just be introducing worse possibilities for
> > > coruption.
> > 
> > The IO interception will be based on PFN instead of file offset, so it
> > won't affect innocent pages such as your example of reinitialized data.
> 
> OK, if you could intercept the IO so it never happens at all, yes
> of course that could work.
> 
> > poisoned dirty page == corrupt data      => process shall be killed
> > poisoned clean page == recoverable data  => process shall survive
> > 
> > In the case of dirty hwpoison page, if we reload the on disk old data
> > and let application proceed with it, it may lead to *silent* data
> > corruption/inconsistency, because the application will first see v2
> > then v1, which is illogical and hence may mess up its internal data
> > structure.
> 
> Right, but how do you prevent that? There is no way to reconstruct the
> most updtodate data because it was destroyed.

To kill the application ruthlessly, rather than allow it go rotten quietly.

> > > You will need to demonstrate a *big* advantage before doing crazy things
> > > with writeback ;)
> > 
> > OK. We can do two things about poisoned writeback pages:
> > 
> > 1) to stop IO for them, thus avoid corrupted data to hit disk and/or
> >    trigger further machine checks
> 
> 1b) At which point, you invoke the end-io handlers, and the page is
> no longer writeback.
> 
> > 2) to isolate them from page cache, thus preventing possible
> >    references in the writeback time window
> 
> And then this is possible because you aren't violating mm
> assumptions due to 1b. This proceeds just as the existing
> pagecache mce error handler case which exists now.

Yeah that's a good scheme - we are talking about two interception
scheme. Mine is passive one and yours is active one.

passive: check hwpoison pages at __generic_make_request()/elv_next_request() 
         (the code will be enabled by an mce_bad_io_pages counter)

active:  iterate all queued requests for hwpoison pages

Each has its merits and complexities.

I'll list the merits(+) and complexities(-) of the passive approach,
with them you automatically get the merits of the active one:

+ works on generic code and don't have to touch all deadline/as/cfq elevators
- the wait_on_page_writeback() puzzle because of the writeback time window

+ could also intercept the "cannot de-dirty for now" pages when they
  eventually go to writeback IO
- have to avoid filesystem references on PG_hwpoison pages, eg.
  - zeroing partial EOF page when i_size is not page aligned
  - calculating checksums


> > > > Now it's obvious that reusing more code than truncate_complete_page()
> > > > is not easy (or natural).
> > > 
> > > Just lock the page and wait for writeback, then do the truncate
> > > work in another function. In your case if you've already unmapped
> > > the page then it won't try to unmap again so no problem.
> > > 
> > > Truncating from pagecache does not change ->index so you can
> > > move the loop logic out.
> > 
> > Right. So effectively the reusable function is exactly
> > truncate_complete_page(). As I said this reuse is not a big gain.
> 
> Anyway, we don't have to argue about it. I already send a patch
> because it was so hard to do, so let's move past this ;)
> 
> 
> > > > Yes it's kind of insane.  I'm interested in reasoning it out though.
> 
> Well with the IO interception (I missed this point), then it seems
> maybe no longer so insane. We could see how it looks.

OK.

Thanks,
Fengguang

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/