[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <70e08bd5-187a-daee-2822-1d9a437a9cff@intel.com>
Date: Fri, 11 Mar 2022 12:59:17 -0800
From: Dave Hansen <dave.hansen@...el.com>
To: Nadav Amit <namit@...are.com>
Cc: Linux-MM <linux-mm@...ck.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Andrea Arcangeli <aarcange@...hat.com>,
Andrew Cooper <andrew.cooper3@...rix.com>,
Andy Lutomirski <luto@...nel.org>,
Dave Hansen <dave.hansen@...ux.intel.com>,
Peter Xu <peterx@...hat.com>,
Peter Zijlstra <peterz@...radead.org>,
Thomas Gleixner <tglx@...utronix.de>,
Will Deacon <will@...nel.org>, Yu Zhao <yuzhao@...gle.com>,
Nick Piggin <npiggin@...il.com>,
"x86@...nel.org" <x86@...nel.org>
Subject: Re: [RESEND PATCH v3 2/5] x86/mm: check exec permissions on fault
On 3/11/22 12:38, Nadav Amit wrote:
>> On Mar 11, 2022, at 11:41 AM, Dave Hansen <dave.hansen@...el.com> wrote:
...
>> Can any sane code trigger this?
>
> Well, regarding this question and the previous one: I do not think that
> this scenario is possible today since mprotect() holds the mmap_lock
> for write. There is no other code that I am aware of that toggles
> the NX bit on a present entry.
>
> But I will not bet my life on it. That’s the reason for the somewhat
> vague phrasing that I used.
>From the userspace perspective, mmap(MAP_FIXED) can do this too. But,
sane userspace can't rely on the syscall to have done any work and the
TLB flushing is currently done before the syscall returns.
I'd put it this way:
Today, it is possible for a thread to end up in access_error()
for a PF_INSN fault and observe a VM_EXEC VMA. If you are
generous, this could be considered a spurious fault.
However, the faulting thread would have had to race with the
thread which was changing the PTE and the VMA and is currently
*in* mprotect() (or some other syscall). In other words, the
faulting thread can encounter this situation, but it never had
any assurance from the kernel that it wouldn't fault. This is
because the faulting thread never had a chance to observe the
syscall return.
There is no evidence that the existing behavior can cause any
issues with sane userspace.
>>> index d0074c6ed31a..ad0ef0a6087a 100644
>>> --- a/arch/x86/mm/fault.c
>>> +++ b/arch/x86/mm/fault.c
>>> @@ -1107,10 +1107,28 @@ access_error(unsigned long error_code, struct vm_area_struct *vma)
>>> (error_code & X86_PF_INSTR), foreign))
>>> return 1;
>>>
>>> - if (error_code & X86_PF_WRITE) {
>>> + if (error_code & (X86_PF_WRITE | X86_PF_INSTR)) {
>>> + /*
>>> + * CPUs are not expected to set the two error code bits
>>> + * together, but to ensure that hypervisors do not misbehave,
>>> + * run an additional sanity check.
>>> + */
>>> + if ((error_code & (X86_PF_WRITE|X86_PF_INSTR)) ==
>>> + (X86_PF_WRITE|X86_PF_INSTR)) {
>>> + WARN_ON_ONCE(1);
>>> + return 1;
>>> + }
>>
>> access_error() is only used on the do_user_addr_fault() side of things.
>> Can we stick this check somewhere that also works for kernel address
>> faults? This is a generic sanity check. It can also be in a separate
>> patch.
>
> I can wrap it in a different function and also call it from
> do_kern_addr_fault() or spurious_kernel_fault().
>
> Anyhow, spurious_kernel_fault() should handle spurious faults on
> executable code correctly.
This is really about checking the sanity of the "hardware"-provided
error code. Let's just do it in handle_page_fault(), maybe hidden in a
function like:
void check_error_code_sanity(unsigned long error_code)
{
WARN_ON_ONCE(...);
}
You can leave the X86_PF_PK check in place for now. It's probably going
away soon anyway.
>> Also, we should *probably* stop talking about CPUs here. If there's
>> ever something wonky with error code bits, I'd put my money on a weird
>> hypervisor before any kind of CPU issue.
>
> I thought I manage to convey exactly that in the comment. Can you provide
> a better phrasing?
Maybe:
/*
* X86_PF_INSTR for instruction _fetches_. Fetches never write.
* X86_PF_WRITE should never be set with X86_PF_INSTR.
*
* This is most likely due to a buggy hypervisor.
*/
Powered by blists - more mailing lists