linux-kernel - Re: [RFC UKL 04/10] x86/entry: Create alternate entry path for system calls

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <b8e86bd9-f8a1-4f37-8f1a-ae0b6209d922@app.fastmail.com>
Date:   Tue, 04 Oct 2022 10:43:47 -0700
From:   "Andy Lutomirski" <luto@...nel.org>
To:     "Ali Raza" <aliraza@...edu>,
        "Linux Kernel Mailing List" <linux-kernel@...r.kernel.org>
Cc:     "Jonathan Corbet" <corbet@....net>, masahiroy@...nel.org,
        michal.lkml@...kovi.net,
        "Nick Desaulniers" <ndesaulniers@...gle.com>,
        "Thomas Gleixner" <tglx@...utronix.de>,
        "Ingo Molnar" <mingo@...hat.com>, "Borislav Petkov" <bp@...en8.de>,
        "Dave Hansen" <dave.hansen@...ux.intel.com>,
        "H. Peter Anvin" <hpa@...or.com>,
        "Eric W. Biederman" <ebiederm@...ssion.com>,
        "Kees Cook" <keescook@...omium.org>,
        "Peter Zijlstra (Intel)" <peterz@...radead.org>,
        "Al Viro" <viro@...iv.linux.org.uk>,
        "Arnd Bergmann" <arnd@...db.de>, juri.lelli@...hat.com,
        vincent.guittot@...aro.org, dietmar.eggemann@....com,
        "Steven Rostedt" <rostedt@...dmis.org>,
        "Ben Segall" <bsegall@...gle.com>, mgorman@...e.de,
        bristot@...hat.com, vschneid@...hat.com,
        "Paolo Bonzini" <pbonzini@...hat.com>, jpoimboe@...nel.org,
        linux-doc@...r.kernel.org, linux-kbuild@...r.kernel.org,
        linux-mm@...ck.org, linux-fsdevel@...r.kernel.org,
        linux-arch@...r.kernel.org,
        "the arch/x86 maintainers" <x86@...nel.org>, rjones@...hat.com,
        munsoner@...edu, tommyu@...edu, drepper@...hat.com,
        lwoodman@...hat.com, mboydmcse@...il.com, okrieg@...edu,
        rmancuso@...edu, "Daniel Bristot de Oliveira" <bristot@...nel.org>
Subject: Re: [RFC UKL 04/10] x86/entry: Create alternate entry path for system calls



On Mon, Oct 3, 2022, at 3:21 PM, Ali Raza wrote:
> If a UKL application makes a system call, it won't go through with the
> syscall assembly instruction. Instead, the application will use the call
> instruction to go to the kernel entry point. Instead of adding checks to
> the normal entry_SYSCALL_64 to see if we came here from a UKL task or a
> normal application task, we create a totally new entry point called
> ukl_entry_SYSCALL_64. This allows the normal entry point to be unchanged
> and simplifies the UKL specific code as well.
>
> ukl_entry_SYSCALL_64 is similar to entry_SYSCALL_64 except that it has to
> populate %rcx with return address manually (syscall instruction does that
> automatically for normal application tasks). This allows the pt_regs to be
> correct. Also, we have to push the flags onto the user stack, because on
> the return path, we first switch to user stack, then pop the flags and then
> return. Popping the flags would restart interrupts, so we dont want to be
> stuck on kernel stack when an interrupt hits. All this can be done with an
> iret instruction, but call/iret pair performans way slower than a call/ret
> pair.
>
> Also, on the entry path, we make sure the context flag i.e., in_user is set
> to 1 to indicate we are now in kernel context so any new interrupts dont
> have to go through kernel entry code again. This is normally done with the
> CS value on stack, but in UKL case that will always be a kernel value. On
> the way back, the in_user is switched back to 2 to indicate that now
> application context is being entered. All non-UKL tasks have the in_user
> value set to 0.


>
> The UKL application uses a slightly different value for CS, instead of
> 0x33, we use 0xC3. As most of the tests compare only the least significant
> nibble, they behave as expected. The C value in the second nibble allows us
> to distinguish between user space and UKL application code.

My intuition would be to try this the other way around.  Use an actual honest CS (specifically _KERNEL_CS) for pt_regs->cs.  Translate at the user ABI boundary instead.  After all, a UKL task is essentially just a kernel thread that happens to have a pt_regs area.


>
> Rest of the code makes sure the above mentioned in_user context tracking is
> done for all entry and exit cases i.e., for interrupts, exceptions etc.  If
> its a UKL task, if in_user value is 2, we treat it as an application task,
> and if it is 1, we treat it as coming from kernel context. We skip these
> checks if in_user is 0.

By "context tracking" are you referring to RCU?  Since a UKL task is essentially a kernel thread, what "entry" is there other than setting up pt_regs?

>
> swapgs_restore_regs_and_return_to_usermode changes also make sure that
> in_user is correct and then we iret back.
>
> Double fault handling is special case. Normally, if a user stack suffers a
> page fault, hardware switches to a kernel stack and pushes a frame onto the
> kernel stack. This switch only happens if the execution was in user
> privilege level when the page fault occurred. For UKL, execution is always
> in kernel level, so when the user stack suffers a page fault, no switch to
> a pinned kernel stack happens, and hardware tries to push state on the
> already faulting user stack. This generates a double fault. So we handle
> this case in the double fault handler by assuming any double fault is
> actually a user stack page fault. This can also be fixed by making all page
> faults go through a pinned stack using the IST mechanism. We have tried and
> tested that, but in the interest of touching as little code as possible, we
> chose this option instead.

Eww.  I guess this is a real problem, but eww.