linux-kernel - Re: [patch 01/10] x86/fpu/signal: Clarify exception handling in restore_fpregs_from

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAHk-=wh57tMaJxcH=kWE4xdKLjayKSDEVvMwHG4fKZ5tUHF6mg@mail.gmail.com>
Date:   Mon, 30 Aug 2021 15:00:06 -0700
From:   Linus Torvalds <torvalds@...ux-foundation.org>
To:     Al Viro <viro@...iv.linux.org.uk>
Cc:     Dan Williams <dan.j.williams@...el.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Borislav Petkov <bp@...en8.de>,
        LKML <linux-kernel@...r.kernel.org>,
        "the arch/x86 maintainers" <x86@...nel.org>
Subject: Re: [patch 01/10] x86/fpu/signal: Clarify exception handling in restore_fpregs_from_user()

On Mon, Aug 30, 2021 at 2:33 PM Al Viro <viro@...iv.linux.org.uk> wrote:
>
> There's a place where we care about #PF vs. #MC (see upthread)...

Interestingly (or perhaps not), that case is a problem case in general
for "fault_in_pages_readable()".

That function will only access data every PAGE_SIZE bytes, but if we
have other exceptions that can happen at a cacheline granularity, the
whole "retry after faulting pages in" may fail.

So that kind of

 - try to copy from user space

 - if that fails, do fault_in_pages_readable() and retry

loop can loop forever.

restore_fpregs_from_user() is odd and special in trying to deal with
it by looking at the error code. I'm n ot convinced it's the right
thing to do, since it just means that all the other places we do this
can be problematic.

But since the Intel machine check stuff is so misdesigned and doesn't
work on any normal machines, most people can't test any of this, none
of this matters, and it's only broken on those "serious enterprise
machines" setups that people think are better, but are actually just
almost entirely untested and thus don't work right.

I'm not sure what the right model here is. We might need to make
fault_in_pages_readable() do things a cacheline at a time, at which
point those repeat loops start working, and the error code thing
becomes pointless.

What I _am_ sure about is that the error code model doesn't work. It
may work in that one special case, but that just means that all the
non-special cases are broken.

So I'll argue that it's a fundamentally broken model, and that
_ASM_EXTABLE_FAULT thing is not just confusing, but actively hurtful.

            Linus