linux-kernel - Re: [PATCH] [16/19] HWPOISON: Enable .remove_error

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20090812080540.GA32342@wotan.suse.de>
Date:	Wed, 12 Aug 2009 10:05:40 +0200
From:	Nick Piggin <npiggin@...e.de>
To:	Andi Kleen <andi@...stfloor.org>
Cc:	Hidehiro Kawai <hidehiro.kawai.ez@...achi.com>, tytso@....edu,
	hch@...radead.org, mfasheh@...e.com, aia21@...tab.net,
	hugh.dickins@...cali.co.uk, swhiteho@...hat.com,
	akpm@...ux-foundation.org, linux-kernel@...r.kernel.org,
	linux-mm@...ck.org, fengguang.wu@...el.com,
	Satoshi OSHIMA <satoshi.oshima.fk@...achi.com>,
	Taketoshi Sakuraba <taketoshi.sakuraba.hc@...achi.com>
Subject: Re: [PATCH] [16/19] HWPOISON: Enable .remove_error_page for migration aware file systems

On Tue, Aug 11, 2009 at 09:17:56AM +0200, Andi Kleen wrote:
> On Tue, Aug 11, 2009 at 12:50:59PM +0900, Hidehiro Kawai wrote:
> > > And application
> > > that doesn't handle current IO errors correctly will also
> > > not necessarily handle hwpoison correctly (it's not better and not worse)
> > 
> > This is my main concern.  I'd like to prevent re-corruption even if
> > applications don't have good manners.
> 
> I don't think there's much we can do if the application doesn't
> check for IO errors properly. What would you do if it doesn't
> check for IO errors at all? If it checks for IO errors it simply
> has to check for them on all IO operations -- if they do 
> they will detect hwpoison errors correctly too.

But will quite possibly do the wrong thing: ie. try to re-sync the
same page again, or try to write the page to a new location, etc.

This is the whole problem with -EIO semantics I brought up.

> > That is why I suggested this:
> > >>(2) merge this patch with new panic_on_dirty_page_cache_corruption
> 
> You probably mean panic_on_non_anonymous_dirty_page_cache
> Normally anonymous memory is dirty.
> 
> > >>    sysctl
> 
> It's unclear to me this special mode is really desirable.
> Does it bring enough value to the user to justify the complexity
> of another exotic option?  The case is relatively exotic,
> as in dirty write cache that is mapped to a file.
> 
> Try to explain it in documentation and you see how ridiculous it sounds; u
> it simply doesn't have clean semantics
> 
> ("In case you have applications with broken error IO handling on
> your mission critical system ...") 

Not broken error handling. It is very simple: if the application is
assuming EIO is an error with dirty data being sent to disk, rather
than an error with the data itself (which I think may be a common
assumption). Then it could have a problem.

If a database for example tries to write the data to another location
in response to EIO and then record it in a list of failed IOs before
halting the database. Then if it restarts it might try to again try
writing out these failed IOs (eg. give the administrator a chance to
fix IO devices). Completely made up scenario but it is not outlandish
and it would cause bad data corruption.

A mission critical server will *definitely* want to panic on dirty
page corruption, IMO, because by definition they should be able to
tolerate panic. But if they do not know about this change to -EIO
semantics, then it is quite possible to cause problems.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/