linux-kernel - Re: [RFC Patch 0/2] mm: Add parameters to make kernel behavior at memory error on dirty cache selectable

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Fri, 12 Apr 2013 22:38:43 +0900
From:	Mitsuhiro Tanino <mitsuhiro.tanino.gm@...achi.com>
To:	Andi Kleen <andi@...stfloor.org>,
	Naoya Horiguchi <n-horiguchi@...jp.nec.com>
Cc:	linux-kernel <linux-kernel@...r.kernel.org>,
	linux-mm <linux-mm@...ck.org>
Subject: Re: [RFC Patch 0/2] mm: Add parameters to make kernel behavior at
 memory error on dirty cache selectable

(2013/04/12 3:10), Andi Kleen wrote:
> On Thu, Apr 11, 2013 at 11:23:08AM -0400, Naoya Horiguchi wrote:
>> On Thu, Apr 11, 2013 at 03:49:16PM +0200, Andi Kleen wrote:
>>>> As a result, if the dirty cache includes user data, the data is lost,
>>>> and data corruption occurs if an application uses old data.
>>>
>>> The application cannot use old data, the kernel code kills it if it
>>> would do that. And if it's IO data there is an EIO triggered.
>>>
>>> iirc the only concern in the past was that the application may miss
>>> the asynchronous EIO because it's cleared on any fd access. 
>>>
>>> This is a general problem not specific to memory error handling, 
>>> as these asynchronous IO errors can happen due to other reason
>>> (bad disk etc.) 
>>>
>>> If you're really concerned about this case I think the solution
>>> is to make the EIO more sticky so that there is a higher chance
>>> than it gets returned.  This will make your data much more safe,
>>> as it will cover all kinds of IO errors, not just the obscure memory
>>> errors.

I agree with Andi. We need to care both memory error and asynchronous
I/O error.

>> I'm interested in this topic, and in previous discussion, what I was said
>> is that we can't expect user applications to change their behaviors when
>> they get EIO, so globally changing EIO's stickiness is not a great approach.
> 
> Not sure. Some of the current behavior may be dubious and it may 
> be possible to change it. But would need more analysis.
> 
> I don't think we're concerned that much about "correct" applications,
> but more ones that do not check everything. So returning more
> errors should be safer.
> 
> For example you could have a sysctl that enables always stick
> IO error -- that keeps erroring until it is closed.
> 
>> I'm working on a new pagecache tag based mechanism to solve this.
>> But it needs time and more discussions.
>> So I guess Tanino-san suggests giving up on dirty pagecache errors
>> as a quick solution.
> 
> A quick solution would be enabling panic for any asynchronous IO error.
> I don't think the memory error code is the right point to hook into.

Yes. I think both short term solution and long term solution is necessary
in order to enable hwpoison feature for Linux as KVM hypervisor.

So my proposal is as follows,
  For short term solution to care both memory error and I/O error:
    - I will resend a panic knob to handle data lost related to dirty cache
      which is caused by memory error and I/O error.

  For long term solution:
    - Andi's proposal or Horiguchi-san's new pagecache tag based mechanism

Regards,
Mitsuhiro Tanino

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/