lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Mon, 6 Jun 2022 15:20:27 +0800
From:   zhenwei pi <pizhenwei@...edance.com>
To:     HORIGUCHI NAOYA(堀口 直也) 
        <naoya.horiguchi@....com>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        "linux-mm@...ck.org" <linux-mm@...ck.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Tony Luck <tony.luck@...el.com>,
        Wu Fengguang <fengguang.wu@...el.com>
Subject: Re: Re: Re: [PATCH] mm/memory-failure: don't allow to unpoison hw
 corrupted page



On 6/6/22 12:32, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Sun, Jun 05, 2022 at 12:24:24PM +0800, zhenwei pi wrote:
>>
>>
>> On 6/5/22 02:56, Andrew Morton wrote:
>>> On Sat,  4 Jun 2022 18:32:29 +0800 zhenwei pi <pizhenwei@...edance.com> wrote:
>>>
>>>> Currently unpoison_memory(unsigned long pfn) is designed for soft
>>>> poison(hwpoison-inject) only. Unpoisoning a hardware corrupted page
>>>> puts page back buddy only, this leads BUG during accessing on the
>>>> corrupted KPTE.
> 
> Thank you for the patch. I think this will be helpful for integration testing.
> 
> You mention "hardware corrupted page" as the condition of this bug, and I
> think that it means a real hardware error, but this BUG seems to be
> triggered when we use mce-inject or APEI (these are also software injection
> without corrupting the memory physically). So the actual condition is
> "when memory_failure() is called by MCE handler"?
> 

Yes, I use QEMU to emulate a 'real hardware error' by command:
virsh qemu-monitor-command vm --hmp mce 0 9 0xbd000000000000c0 0xd 
0x61234000 0x8c

>>>>
>>>> Do not allow to unpoison hardware corrupted page in unpoison_memory()
>>>> to avoid BUG like this:
>>>>
>>>>    Unpoison: Software-unpoisoned page 0x61234
>>>>    BUG: unable to handle page fault for address: ffff888061234000
>>>
>>> Thanks.
>>>
>>>> --- a/mm/memory-failure.c
>>>> +++ b/mm/memory-failure.c
>>>> @@ -2090,6 +2090,7 @@ int unpoison_memory(unsigned long pfn)
>>>>    {
>>>>    	struct page *page;
>>>>    	struct page *p;
>>>> +	pte_t *kpte;
>>>>    	int ret = -EBUSY;
>>>>    	int freeit = 0;
>>>>    	static DEFINE_RATELIMIT_STATE(unpoison_rs, DEFAULT_RATELIMIT_INTERVAL,
>>>> @@ -2101,6 +2102,13 @@ int unpoison_memory(unsigned long pfn)
>>>>    	p = pfn_to_page(pfn);
>>>>    	page = compound_head(p);
>>>> +	kpte = virt_to_kpte((unsigned long)page_to_virt(p));
>>>> +	if (kpte && !pte_present(*kpte)) {
>>>> +		unpoison_pr_info("Unpoison: Page was hardware poisoned %#lx\n",
>>>> +				 pfn, &unpoison_rs);
> 
> This can prevent unpoison for hwpoison on 4kB pages, but not for hugetlb pages,
> where I see the similar BUG as follows (even with applying your patch):
> 
>    [  917.806712] BUG: unable to handle page fault for address: ffff9f7bb3201000
>    [  917.810144] #PF: supervisor write access in kernel mode
>    [  917.812588] #PF: error_code(0x0002) - not-present page
>    [  917.815007] PGD 104801067 P4D 104801067 PUD 10006b063 PMD 1052d0063 PTE 800ffffeccdfe062
>    [  917.818768] Oops: 0002 [#1] PREEMPT SMP PTI
>    [  917.820759] CPU: 0 PID: 7774 Comm: test_alloc_gene Tainted: G   M       OE     5.18.0-v5.18-220606-0942-029-ge4dcc+ #47
>    [  917.825720] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1.fc35 04/01/2014
>    [  917.829762] RIP: 0010:clear_page_erms+0x7/0x10
>    [  917.831867] Code: 48 89 47 18 48 89 47 20 48 89 47 28 48 89 47 30 48 89 47 38 48 8d 7f 40 75 d9 90 c3 0f 1f 80 00 00 00 00 b9 00 10 00 00 31 c0 <f3> aa c3 cc cc cc cc cc cc 48 85 ff 0f 84 d3 00 00 00 0f b6 0f 4c
>    [  917.840540] RSP: 0000:ffffab49c25ebdf0 EFLAGS: 00010246
>    [  917.842839] RAX: 0000000000000000 RBX: ffffd538c4cc8000 RCX: 0000000000001000
>    [  917.845835] RDX: 0000000080000000 RSI: 00007f2aeb600000 RDI: ffff9f7bb3201000
>    [  917.848687] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
>    [  917.851377] R10: 0000000000000002 R11: ffff9f7b87e3a2a0 R12: 0000000000000000
>    [  917.854035] R13: 0000000000000001 R14: ffffd538c4cc8000 R15: ffff9f7bc002a5d8
>    [  917.856539] FS:  00007f2aebad3740(0000) GS:ffff9f7bbbc00000(0000) knlGS:0000000000000000
>    [  917.859229] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>    [  917.861149] CR2: ffff9f7bb3201000 CR3: 0000000107726003 CR4: 0000000000170ef0
>    [  917.863433] Call Trace:
>    [  917.864266]  <TASK>
>    [  917.864961]  clear_huge_page+0x147/0x270
>    [  917.866236]  hugetlb_fault+0x440/0xad0
>    [  917.867366]  handle_mm_fault+0x270/0x290
>    [  917.868532]  do_user_addr_fault+0x1c3/0x680
>    [  917.869768]  exc_page_fault+0x6c/0x160
>    [  917.870912]  ? asm_exc_page_fault+0x8/0x30
>    [  917.872082]  asm_exc_page_fault+0x1e/0x30
>    [  917.873220] RIP: 0033:0x7f2aeb8ba367
> 
> I don't think of a workaround for this now ...
> 

Could you please tell me how to reproduce this issue?

>>>> +		return -EPERM;
> 
> Is -EOPNOTSUPP a better error code?
> 

OK!

>>>> +	}
>>>> +
>>>>    	mutex_lock(&mf_mutex);
>>>>    	if (!PageHWPoison(p)) {
>>>
>>> I guess we don't want to let fault injection crash the kernel, so a
>>> cc:stable seems appropriate here.
>>>
>>> Can we think up a suitable Fixes: commit?  I'm suspecting this bug has
>>> been there for a long time?
>>>
>>
>> Sure!
>>
>> 2009-Dec-16, hwpoison_unpoison() was introduced into linux in commit:
>> 847ce401df392("HWPOISON: Add unpoisoning support")
>> ...
>> There is no hardware level unpoisioning, so this cannot be used for real
>> memory errors, only for software injected errors.
>> ...
>>
>> We can find that this function should be used for software level unpoisoning
>> only in both commit log and comment in source code. unfortunately there is
>> no check in function hwpoison_unpoison().
>>
>>
>> 2020-May-20, 17fae1294ad9d("x86/{mce,mm}: Unmap the entire page if the whole
>> page is affected and poisoned")
>>
>> This clears KPTE, and leads BUG(described in this patch) during unpoisoning
>> the hardware corrupted page.
>>
>>
>> Fixes: 847ce401df392("HWPOISON: Add unpoisoning support")
>> Fixes: 17fae1294ad9d("x86/{mce,mm}: Unmap the entire page if the whole page
>> is affected and poisoned")
>>
>> Cc: Wu Fengguang <fengguang.wu@...el.com>
>> Cc: Tony Luck <tony.luck@...el.com>.
> 
> Thanks for checking the history, I agree with sending to stable.
> 
> Thanks,
> Naoya Horiguchi

-- 
zhenwei pi

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ