[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20090812074611.GC28848@basil.fritz.box>
Date: Wed, 12 Aug 2009 09:46:11 +0200
From: Andi Kleen <andi@...stfloor.org>
To: Hidehiro Kawai <hidehiro.kawai.ez@...achi.com>
Cc: Andi Kleen <andi@...stfloor.org>, tytso@....edu, hch@...radead.org,
mfasheh@...e.com, aia21@...tab.net, hugh.dickins@...cali.co.uk,
swhiteho@...hat.com, akpm@...ux-foundation.org, npiggin@...e.de,
linux-kernel@...r.kernel.org, linux-mm@...ck.org,
fengguang.wu@...el.com,
Satoshi OSHIMA <satoshi.oshima.fk@...achi.com>,
Taketoshi Sakuraba <taketoshi.sakuraba.hc@...achi.com>
Subject: Re: [PATCH] [16/19] HWPOISON: Enable .remove_error_page for
migration aware file systems
On Wed, Aug 12, 2009 at 11:49:56AM +0900, Hidehiro Kawai wrote:
> > I don't think there's much we can do if the application doesn't
> > check for IO errors properly. What would you do if it doesn't
> > check for IO errors at all? If it checks for IO errors it simply
> > has to check for them on all IO operations -- if they do
> > they will detect hwpoison errors correctly too.
>
> I believe it's not uncommon for applications to do buffered write
> and then exit without fsync(). And I think it's difficult to
> preclude such applications and commands from the system perfectly.
That's true, but for anything mission critical you would expect them
to use some transactional mechanism, either with O_SYNC or fsync().
Otherwise they always risk data loss anyways.
> > It's unclear to me this special mode is really desirable.
> > Does it bring enough value to the user to justify the complexity
> > of another exotic option? The case is relatively exotic,
> > as in dirty write cache that is mapped to a file.
> >
> > Try to explain it in documentation and you see how ridiculous it sounds; u
> > it simply doesn't have clean semantics
> >
> > ("In case you have applications with broken error IO handling on
> > your mission critical system ...")
>
> Generally, dropping unwritten dirty page caches is considered to be
> risky. So the "panic on IO error" policy has been used as usual
> practice for some systems. I just suggested that we adopted
> this policy into machine check errors.
Hmm, what we could possibly do -- as followon patches -- would be to
let error_remove_page check the per file system panic-on-io-error
super block setting for dirty pages and panic in this case too.
Unfortunately this setting is currently per file system, not generic,
so it would need to be a fs specific check (or the flag would need
to be moved into a generic fs superblock field first)
I think that would be relatively clean semantics wise. Would you be
interested in working on patches for that?
> Another option is to introduce "ignore all" policy instead of
> panicking at the beginig of memory_failure(). Perhaps it finally
> causes SRAR machine check, and then kernel will panic or a process
> will be killed. Anyway, this is a topic for the next stage.
The problem is memory_failure() would then need to start distingushing
between AR=1 and AR=0 which it doesn't today.
It could be done, but would need some more work.
> > If you want to have improved IO error handling feel free to
> > submit it separately. I agree this area could use some work.
> > But it probably needs more design work first.
>
> Well, this patch set itself looks good to me.
> I also looked into the other patches, I couldn't find any
> problems (although I'm not good judge of reviewing).
>
> Reviewed-by: Hidehiro Kawai <hidehiro.kawai.ez@...achi.com>
Thanks for your review and your comments.
-Andi
--
ak@...ux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists