[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9d9f1afe-c981-4df9-f012-89c4cb783cc3@amd.com>
Date: Tue, 15 Nov 2022 12:15:52 -0600
From: "Kalra, Ashish" <ashish.kalra@....com>
To: Vlastimil Babka <vbabka@...e.cz>, Borislav Petkov <bp@...en8.de>
Cc: x86@...nel.org, linux-kernel@...r.kernel.org, kvm@...r.kernel.org,
linux-coco@...ts.linux.dev, linux-mm@...ck.org,
linux-crypto@...r.kernel.org, tglx@...utronix.de, mingo@...hat.com,
jroedel@...e.de, thomas.lendacky@....com, hpa@...or.com,
ardb@...nel.org, pbonzini@...hat.com, seanjc@...gle.com,
vkuznets@...hat.com, jmattson@...gle.com, luto@...nel.org,
dave.hansen@...ux.intel.com, slp@...hat.com, pgonda@...gle.com,
peterz@...radead.org, srinivas.pandruvada@...ux.intel.com,
rientjes@...gle.com, dovmurik@...ux.ibm.com, tobin@....com,
michael.roth@....com, kirill@...temov.name, ak@...ux.intel.com,
tony.luck@...el.com, marcorr@...gle.com,
sathyanarayanan.kuppuswamy@...ux.intel.com, alpergun@...gle.com,
dgilbert@...hat.com, jarkko@...nel.org,
"Kaplan, David" <David.Kaplan@....com>,
Naoya Horiguchi <naoya.horiguchi@....com>,
Miaohe Lin <linmiaohe@...wei.com>,
Oscar Salvador <osalvador@...e.de>
Subject: Re: [PATCH Part2 v6 14/49] crypto: ccp: Handle the legacy TMR
allocation when SNP is enabled
On 11/15/2022 11:24 AM, Kalra, Ashish wrote:
> Hello Vlastimil,
>
> On 11/15/2022 9:14 AM, Vlastimil Babka wrote:
>> Cc'ing memory failure folks, the beinning of this subthread is here:
>>
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fall%2F3a51840f6a80c87b39632dc728dbd9b5dd444cd7.1655761627.git.ashish.kalra%40amd.com%2F&data=05%7C01%7Cashish.kalra%40amd.com%7C944b59f239c541a52ac808dac71c2089%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638041220947600149%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=do9zzyMlAErkKx5rguqnL2GoG4lhsWHDI74zgwLWaZU%3D&reserved=0
>>
>>
>> On 11/15/22 00:36, Kalra, Ashish wrote:
>>> Hello Boris,
>>>
>>> On 11/2/2022 6:22 AM, Borislav Petkov wrote:
>>>> On Mon, Oct 31, 2022 at 04:58:38PM -0500, Kalra, Ashish wrote:
>>>>> if (snp_lookup_rmpentry(pfn, &rmp_level)) {
>>>>> do_sigbus(regs, error_code, address, VM_FAULT_SIGBUS);
>>>>> return RMP_PF_RETRY;
>>>>
>>>> Does this issue some halfway understandable error message why the
>>>> process got killed?
>>>>
>>>>> Will look at adding our own recovery function for the same, but
>>>>> that will
>>>>> again mark the pages as poisoned, right ?
>>>>
>>>> Well, not poisoned but PG_offlimits or whatever the mm folks agree
>>>> upon.
>>>> Semantically, it'll be handled the same way, ofc.
>>>
>>> Added a new PG_offlimits flag and a simple corresponding handler for it.
>>
>> One thing is, there's not enough page flags to be adding more (except
>> aliases for existing) for cases that can avoid it, but as Boris says, if
>> using alias to PG_hwpoison it depends what will become confused with the
>> actual hwpoison.
>>
>>> But there is still added complexity of handling hugepages as part of
>>> reclamation failures (both HugeTLB and transparent hugepages) and that
>>> means calling more static functions in mm/memory_failure.c
>>>
>>> There is probably a more appropriate handler in mm/memory-failure.c:
>>>
>>> soft_offline_page() - this will mark the page as HWPoisoned and also has
>>> handling for hugepages. And we can avoid adding a new page flag too.
>>>
>>> soft_offline_page - Soft offline a page.
>>> Soft offline a page, by migration or invalidation, without killing
>>> anything.
>>>
>>> So, this looks like a good option to call
>>> soft_offline_page() instead of memory_failure() in case of
>>> failure to transition the page back to HV/shared state via
>>> SNP_RECLAIM_CMD
>>> and/or RMPUPDATE instruction.
>>
>> So it's a bit unclear to me what exact situation we are handling here.
>> The
>> original patch here seems to me to be just leaking back pages that are
>> unsafe for further use. soft_offline_page() seems to fit that scenario
>> of a
>> graceful leak before something is irrepairably corrupt and we page
>> fault on it.
>> But then in the thread you discus PF handling and killing. So what is the
>> case here? If we detect this need to call snp_leak_pages() does it mean:
>>
>> a) nobody that could page fault at them (the guest?) is running
>> anymore, we
>> are tearing it down, we just can't reuse the pages further on the host
>
> The host can page fault on them, if anything on the host tries to write
> to these pages. Host reads will return garbage data.
>
>> - seem like soft_offline_page() could work, but maybe we could just
>> put the
>> pages on some leaked lists without special page? The only thing that
>> should
>> matter is not to free the pages to the page allocator so they would be
>> reused by something else.
>>
>> b) something can stil page fault at them (what?) - AFAIU can't be
>> resolved
>> without killing something, memory_failure() might limit the damage
>
> As i mentioned above, host writes will cause RMP violation page fault.
>
And to add here, if its a guest private page, then the above fault
cannot be resolved, so the faulting process is terminated.
Thanks,
Ashish
>
>>>
>>>>
>>>>> Still waiting for some/more feedback from mm folks on the same.
>>>>
>>>> Just send the patch and they'll give it.
>>>>
>>>> Thx.
>>>>
>>
Powered by blists - more mailing lists