linux-kernel - Re: [PATCH v2 0/5] mm/hwpoison: Fix regressions in memory failure handling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4e13bef2-7402-4f75-8f0c-4a3cc210c5a6@linux.alibaba.com>
Date: Fri, 21 Feb 2025 14:05:28 +0800
From: Shuai Xue <xueshuai@...ux.alibaba.com>
To: "Luck, Tony" <tony.luck@...el.com>, Borislav Petkov <bp@...en8.de>
Cc: "nao.horiguchi@...il.com" <nao.horiguchi@...il.com>,
 "tglx@...utronix.de" <tglx@...utronix.de>,
 "mingo@...hat.com" <mingo@...hat.com>,
 "dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>,
 "x86@...nel.org" <x86@...nel.org>, "hpa@...or.com" <hpa@...or.com>,
 "linmiaohe@...wei.com" <linmiaohe@...wei.com>,
 "akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
 "peterz@...radead.org" <peterz@...radead.org>,
 "jpoimboe@...nel.org" <jpoimboe@...nel.org>,
 "linux-edac@...r.kernel.org" <linux-edac@...r.kernel.org>,
 "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
 "linux-mm@...ck.org" <linux-mm@...ck.org>,
 "baolin.wang@...ux.alibaba.com" <baolin.wang@...ux.alibaba.com>,
 "tianruidong@...ux.alibaba.com" <tianruidong@...ux.alibaba.com>
Subject: Re: [PATCH v2 0/5] mm/hwpoison: Fix regressions in memory failure
 handling



在 2025/2/21 01:50, Luck, Tony 写道:
>>> We could, but I don't like it much. By taking the page offline from the relatively
>>> kind environment of a regular interrupt, we often avoid taking a machine check
>>> (which is an unfriendly environment for software).
>>
>> Right.
>>
>>> We could make the action in uc_decode_notifier() configurable. Default=off
>>> but with a command line option to enable for systems that are stuck with
>>> broadcast machine checks.
>>
>> So we can figure that out during boot - no need for yet another cmdline
>> option.
> 
> Yup. I think the boot time test might be something like:
> 
> 	// Enable UCNA offline for systems with broadcast machine check
> 	if (!(AMD || LMCE))
> 		mce_register_decode_chain(&mce_uc_nb);
>>
>> It still doesn't fix the race and I'd like to fix that instead, in the optimal
>> case.
>>
>> But looking at Shuai's patch, I guess fixing the reporting is fine too - we
>> need to fix the commit message to explain why this thing even happens.
>>
>> I.e., basically what you wrote and Shuai could use that explanation to write
>> a commit message explaining what the situation is along with the background so
>> that when we go back to this later, we will actually know what is going on.
> 
> Agreed. Shaui needs to harvest this thread to fill out the details in the commit
> messages.

Sure, I'd like to add more backgroud details with Tony's explanation.

> 
>>
>> But looking at
>>
>>    046545a661af ("mm/hwpoison: fix error page recovered but reported "not recovered"")
>>
>> That thing was trying to fix the same reporting fail. Why didn't it do that?
>>
>> Ooooh, now I see what the issue is. He doesn't want to kill the process which
>> gets the wrong SIGBUS. Maybe the commit title should've said that:
>>
>>    mm/hwpoison: Do not send SIGBUS to processes with recovered clean pages
>>
>> or so.
>>
>> But how/why is that ok?
>>
>> Are we confident that
>>
>> +        * ret = 0 when poison page is a clean page and it's dropped, no
>> +        * SIGBUS is needed.
>>
>> can *always* and *only* happen when there's a CMCI *and* a #MC race and the
>> CMCI has won the race?
> 
> There are probably other races. Two CPUs both take local #MC on the same page
> (maybe not all that rare in threaded processes ... or even with some hot code in
> a shared library).
> 
>> Can memory poison return 0 there too, for another reason and we end up *not
>> killing* a process which we should have?
>>
>> Hmmm.
> 
> Hmmm indeed. Needs some thought. Though failing to kill a process likely means
> it retries the access and comes right back to try again (without the race this time).
> 

Emmm, if two threaded processes consume a poisond data, there may three CPUs
race, two of which take local #MC on the same page and one take CMCI. For,
example:

#perf script
kworker/48:1-mm 25516 [048]  1713.893549: probe:memory_failure: (ffffffffaa622db4)
         ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
         ffffffffaa25aa93 uc_decode_notifier+0x73 ([kernel.kallsyms])
         ffffffffaa3068bb notifier_call_chain+0x5b ([kernel.kallsyms])
         ffffffffaa306ae1 blocking_notifier_call_chain+0x41 ([kernel.kallsyms])
         ffffffffaa25bbfe mce_gen_pool_process+0x3e ([kernel.kallsyms])
         ffffffffaa2f455f process_one_work+0x19f ([kernel.kallsyms])
         ffffffffaa2f509c worker_thread+0x20c ([kernel.kallsyms])
         ffffffffaa2fec89 kthread+0xd9 ([kernel.kallsyms])
         ffffffffaa245131 ret_from_fork+0x31 ([kernel.kallsyms])
         ffffffffaa2076ca ret_from_fork_asm+0x1a ([kernel.kallsyms])

einj_mem_uc 44530 [184]  1713.908089: probe:memory_failure: (ffffffffaa622db4)
         ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
         ffffffffaa2594fb kill_me_maybe+0x5b ([kernel.kallsyms])
         ffffffffaa2fac29 task_work_run+0x59 ([kernel.kallsyms])
         ffffffffaaf52347 irqentry_exit_to_user_mode+0x1c7 ([kernel.kallsyms])
         ffffffffaaf50bce noist_exc_machine_check+0x3e ([kernel.kallsyms])
         ffffffffaa001303 asm_exc_machine_check+0x33 ([kernel.kallsyms])
                   405046 thread+0xe (/home/shawn.xs/ras-tools/einj_mem_uc)

einj_mem_uc 44531 [089]  1713.916319: probe:memory_failure: (ffffffffaa622db4)
         ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
         ffffffffaa2594fb kill_me_maybe+0x5b ([kernel.kallsyms])
         ffffffffaa2fac29 task_work_run+0x59 ([kernel.kallsyms])
         ffffffffaaf52347 irqentry_exit_to_user_mode+0x1c7 ([kernel.kallsyms])
         ffffffffaaf50bce noist_exc_machine_check+0x3e ([kernel.kallsyms])
         ffffffffaa001303 asm_exc_machine_check+0x33 ([kernel.kallsyms])
                   405046 thread+0xe (/home/shawn.xs/ras-tools/einj_mem_uc)

It seems to complicate the issue further.

IMHO, we should focus on three main points:

- kill_accessing_process() is only called when the flags are set to
   MF_ACTION_REQUIRED, which means it is in the MCE path.
- Whether the page is clean determines the behavior of try_to_unmap. For a
   dirty page, try_to_unmap uses TTU_HWPOISON to unmap the PTE and convert the
   PTE entry to a swap entry. For a clean page, try_to_unmap uses ~TTU_HWPOISON
   and simply unmaps the PTE.
- When does walk_page_range() with hwpoison_walk_ops return 1?
   1. If the poison page still exists, we should of course kill the current
      process.
   2. If the poison page does not exist, but is_hwpoison_entry is true, meaning
      it is a dirty page, we should also kill the current process, too.
   3. Otherwise, it returns 0, which means the page is clean.


Thanks.
Shuai