linux-kernel - Re: [PATCH v5] mm/gup: check page hwposion status for coredump.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <f316ca3b-6f09-c51d-9661-66171f14ee33@redhat.com>
Date:   Fri, 26 Mar 2021 15:22:49 +0100
From:   David Hildenbrand <david@...hat.com>
To:     Aili Yao <yaoaili@...gsoft.com>,
        Matthew Wilcox <willy@...radead.org>,
        akpm@...ux-foundation.org, naoya.horiguchi@....com
Cc:     linux-mm@...ck.org, linux-kernel@...r.kernel.org,
        yangfeng1@...gsoft.com, sunhao2@...gsoft.com,
        Oscar Salvador <osalvador@...e.de>,
        Mike Kravetz <mike.kravetz@...cle.com>
Subject: Re: [PATCH v5] mm/gup: check page hwposion status for coredump.

On 26.03.21 15:09, David Hildenbrand wrote:
> On 22.03.21 12:33, Aili Yao wrote:
>> When we do coredump for user process signal, this may be one SIGBUS signal
>> with BUS_MCEERR_AR or BUS_MCEERR_AO code, which means this signal is
>> resulted from ECC memory fail like SRAR or SRAO, we expect the memory
>> recovery work is finished correctly, then the get_dump_page() will not
>> return the error page as its process pte is set invalid by
>> memory_failure().
>>
>> But memory_failure() may fail, and the process's related pte may not be
>> correctly set invalid, for current code, we will return the poison page,
>> get it dumped, and then lead to system panic as its in kernel code.
>>
>> So check the hwpoison status in get_dump_page(), and if TRUE, return NULL.
>>
>> There maybe other scenario that is also better to check hwposion status
>> and not to panic, so make a wrapper for this check, Thanks to David's
>> suggestion(<david@...hat.com>).
>>
>> Link: https://lkml.kernel.org/r/20210319104437.6f30e80d@alex-virtual-machine
>> Signed-off-by: Aili Yao <yaoaili@...gsoft.com>
>> Cc: David Hildenbrand <david@...hat.com>
>> Cc: Matthew Wilcox <willy@...radead.org>
>> Cc: Naoya Horiguchi <naoya.horiguchi@....com>
>> Cc: Oscar Salvador <osalvador@...e.de>
>> Cc: Mike Kravetz <mike.kravetz@...cle.com>
>> Cc: Aili Yao <yaoaili@...gsoft.com>
>> Cc: stable@...r.kernel.org
>> Signed-off-by: Andrew Morton <akpm@...ux-foundation.org>
>> ---
>>    mm/gup.c      |  4 ++++
>>    mm/internal.h | 20 ++++++++++++++++++++
>>    2 files changed, 24 insertions(+)
>>
>> diff --git a/mm/gup.c b/mm/gup.c
>> index e4c224c..6f7e1aa 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -1536,6 +1536,10 @@ struct page *get_dump_page(unsigned long addr)
>>    				      FOLL_FORCE | FOLL_DUMP | FOLL_GET);
>>    	if (locked)
>>    		mmap_read_unlock(mm);
> 
> Thinking again, wouldn't we get -EFAULT from __get_user_pages_locked()
> when stumbling over a hwpoisoned page?
> 
> See __get_user_pages_locked()->__get_user_pages()->faultin_page():
> 
> handle_mm_fault()->vm_fault_to_errno(), which translates
> VM_FAULT_HWPOISON to -EFAULT, unless FOLL_HWPOISON is set (-> -EHWPOISON)
> 
> ?

Or doesn't that happen as you describe "But memory_failure() may fail, 
and the process's related pte may not be correctly set invalid" -- but 
why does that happen?

On a similar thought, should get_user_pages() never return a page that 
has HWPoison set? E.g., check also for existing PTEs if the page is 
hwpoisoned?

@Naoya, Oscar

-- 
Thanks,

David / dhildenb