linux-kernel - Re: [PATCH 2/3] x86/mce: Avoid infinite loop for copy from user recovery

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAPcxDJ6bB7GEhTq9fkHuT4chRTUk_s-crci=nh+COCwAzMP8Yw@mail.gmail.com>
Date:   Thu, 22 Jul 2021 16:30:44 -0700
From:   Jue Wang <juew@...gle.com>
To:     "Luck, Tony" <tony.luck@...el.com>
Cc:     Borislav Petkov <bp@...en8.de>, dinghui@...gfor.com.cn,
        huangcun@...gfor.com.cn, linux-edac@...r.kernel.org,
        linux-kernel@...r.kernel.org,
        HORIGUCHI NAOYA(堀口 直也) 
        <naoya.horiguchi@....com>, Oscar Salvador <osalvador@...e.de>,
        x86 <x86@...nel.org>, "Song, Youquan" <youquan.song@...el.com>
Subject: Re: [PATCH 2/3] x86/mce: Avoid infinite loop for copy from user recovery

I think the challenge being the uncorrectable errors are essentially
random. It's
just a matter of time for >1 UC errors to show up in sequential kernel accesses.

It's easy to create such cases with artificial error injections.

I suspect we want to design this part of the kernel to be able to handle generic
cases?

Thanks,
-Jue

On Thu, Jul 22, 2021 at 8:19 AM Luck, Tony <tony.luck@...el.com> wrote:
>
> On Thu, Jul 22, 2021 at 06:54:37AM -0700, Jue Wang wrote:
> > This patch assumes the UC error consumed in kernel is always the same UC.
> >
> > Yet it's possible two UCs on different pages are consumed in a row.
> > The patch below will panic on the 2nd MCE. How can we make the code works
> > on multiple UC errors?
> >
> >
> > > + int count = ++current->mce_count;
> > > +
> > > + /* First call, save all the details */
> > > + if (count == 1) {
> > > + current->mce_addr = m->addr;
> > > + current->mce_kflags = m->kflags;
> > > + current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
> > > + current->mce_whole_page = whole_page(m);
> > > + current->mce_kill_me.func = func;
> > > + }
> > > ......
> > > + /* Second or later call, make sure page address matches the one from first call */
> > > + if (count > 1 && (current->mce_addr >> PAGE_SHIFT) != (m->addr >> PAGE_SHIFT))
> > > + mce_panic("Machine checks to different user pages", m, msg);
>
> The issue is getting the information about the location
> of the error from the machine check handler to the "task_work"
> function that processes it. Currently there is a single place
> to store the address of the error in the task structure:
>
>         current->mce_addr = m->addr;
>
> Plausibly that could be made into an array, indexed by
> current->mce_count to save mutiple addresses (perhaps
> also need mce_kflags, mce_ripv, etc. to also be arrays).
>
> But I don't want to pre-emptively make such a change without
> some data to show that situations arise with multiple errors
> to different addresses:
> 1) Actually occur
> 2) Would be recovered if we made the change.
>
> The first would be indicated by seeing the:
>
>         "Machine checks to different user pages"
>
> panic. You'd have to code up the change to have arrays
> to confirm that would fix the problem.
>
> -Tony