[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAPcxDJ5q8=pwqsNV4ydSPJWp35f886n1TB7dWOx9cst=cb2myA@mail.gmail.com>
Date: Wed, 14 Apr 2021 07:46:49 -0700
From: Jue Wang <juew@...gle.com>
To: Borislav Petkov <bp@...en8.de>
Cc: linux-kernel@...r.kernel.org, linux-mm@...ck.org, luto@...nel.org,
HORIGUCHI NAOYA(堀口 直也)
<naoya.horiguchi@....com>, "Luck, Tony" <tony.luck@...el.com>,
x86 <x86@...nel.org>, yaoaili@...gsoft.com
Subject: Re: [PATCH 3/4] mce/copyin: fix to not SIGBUS when copying from user
hits poison
On Wed, Apr 14, 2021 at 6:10 AM Borislav Petkov <bp@...en8.de> wrote:
>
> On Tue, Apr 13, 2021 at 10:47:21PM -0700, Jue Wang wrote:
> > This path is when EPT #PF finds accesses to a hwpoisoned page and
> > sends SIGBUS to user space (KVM exits into user space) with the same
> > semantic as if regular #PF found access to a hwpoisoned page.
> >
> > The KVM_X86_SET_MCE ioctl actually injects a machine check into the guest.
> >
> > We are in process to launch a product with MCE recovery capability in
> > a KVM based virtualization product and plan to expand the scope of the
> > application of it in the near future.
>
> Any pointers to code or is this all non-public? Any text on what that
> product does with the MCEs?
These are non-public at this point.
User-facing docs and blog post are expected to be released towards the
launch (i.e., in 3-4 months from now).
>
> > The in-memory database and analytical domain are definitely using it.
> > A couple examples:
> > SAP HANA - as we've tested and planned to launch as a strategic
> > enterprise use case with MCE recovery capability in our product
> > SQL server - https://support.microsoft.com/en-us/help/2967651/inf-sql-server-may-display-memory-corruption-and-recovery-errors
>
> Aha, so they register callbacks for the processes to exec on a memory
> error. Good to know, thanks for those.
My other 2 cents:
I can see this is useful in other types of domains, e.g., on multi-tenant cloud
servers where many VMs are collocated on the same host,
with proper recovery + live migration, a single MCE would only affect a single
VM at most.
Another type of generic use case may be services that can tolerate
abrupt crash,
i.e., they periodically save checkpoints to persistent storage or are stateless
services in nature and are managed by some process manager to automatically
restart and resume from where the work was left at when crashed.
Thanks,
-Jue
>
> Thx.
>
> --
> Regards/Gruss,
> Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette
Powered by blists - more mailing lists