linux-kernel - Re: [PATCH] x86: Use __fpu_invalidate_fpregs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ebae9a75-c465-9385-b502-39e5a7eee436@intel.com>
Date:   Wed, 23 Aug 2023 19:04:29 -0500
From:   Lijun Pan <lijun.pan@...el.com>
To:     Rick Edgecombe <rick.p.edgecombe@...el.com>, <tglx@...utronix.de>,
        <mingo@...hat.com>, <bp@...en8.de>, <x86@...nel.org>,
        <hpa@...or.com>, <dave.hansen@...ux.intel.com>, <luto@...nel.org>,
        <peterz@...radead.org>
CC:     <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] x86: Use __fpu_invalidate_fpregs_state() in exec



On 8/18/2023 2:35 PM, Lijun Pan wrote:
> Hi Rick,
> 
> On 8/18/2023 12:03 PM, Rick Edgecombe wrote:
>> The thread flag TIF_NEED_FPU_LOAD indicates that the FPU saved state is
>> valid and should be reloaded when returning to userspace. However, the
>> kernel will skip doing this if the FPU registers are already valid as
>> determined by fpregs_state_valid(). The logic embedded there considers
>> the state valid if two cases are both true:
>>    1: fpu_fpregs_owner_ctx points to the current tasks FPU state
>>    2: the last CPU the registers were live in was the current CPU.
>>
>> This is usually correct logic. A CPU’s fpu_fpregs_owner_ctx is set to
>> the current FPU during the fpregs_restore_userregs() operation, so it
>> indicates that the registers have been restored on this CPU. But this
>> alone doesn’t preclude that the task hasn’t been rescheduled to a
>> different CPU, where the registers were modified, and then back to the
>> current CPU. To verify that this was not the case the logic relies on the
>> second condition. So the assumption is that if the registers have been
>> restored, AND they haven’t had the chance to be modified (by being
>> loaded on another CPU), then they MUST be valid on the current CPU.
>>
>> Besides the lazy FPU optimizations, the other cases where the FPU
>> registers might not be valid are when the kernel modifies the FPU 
>> register
>> state or the FPU saved buffer. In this case the operation modifying the
>> FPU state needs to let the kernel know the correspondence has been
>> broken. The comment in “arch/x86/kernel/fpu/context.h” has:
>> /*
>> ...
>>   * If the FPU register state is valid, the kernel can skip restoring the
>>   * FPU state from memory.
>>   *
>>   * Any code that clobbers the FPU registers or updates the in-memory
>>   * FPU state for a task MUST let the rest of the kernel know that the
>>   * FPU registers are no longer valid for this task.
>>   *
>>   * Either one of these invalidation functions is enough. Invalidate
>>   * a resource you control: CPU if using the CPU for something else
>>   * (with preemption disabled), FPU for the current task, or a task that
>>   * is prevented from running by the current task.
>>   */
>>
>> However, this is not completely true. When the kernel modifies the
>> registers or saved FPU state, it can only rely on
>> __fpu_invalidate_fpregs_state(), which wipes the FPU’s last_cpu
>> tracking. The exec path instead relies on fpregs_deactivate(), which sets
>> the CPU’s FPU context to NULL. This was observed to fail to restore the
>> reset FPU state to the registers when returning to userspace in the
>> following scenario:
>>
>> 1. A task is executing in userspace on CPU0
>>     - CPU0’s FPU context points to tasks
>>     - fpu->last_cpu=CPU0
>> 2. The task exec()’s
>> 3. While in the kernel the task reschedules to CPU1
>>     - CPU0 gets a thread executing in the kernel (such that no other
>>         FPU context is activated)
>>     - Scheduler sets task’s fpu->last_cpu=CPU0
>> 4. Continuing the exec, the task gets to
>>     fpu_flush_thread()->fpu_reset_fpregs()
>>     - Sets CPU1’s fpu context to NULL
>>     - Copies the init state to the task’s FPU buffer
>>     - Sets TIF_NEED_FPU_LOAD on the task
>> 5. The task reschedules back to CPU0 before completing the exec and
>>     returning to userspace
>>     - During the reschedule, scheduler finds TIF_NEED_FPU_LOAD is set
>>     - Skips saving the registers and updating task’s fpu→last_cpu,
>>         because TIF_NEED_FPU_LOAD is the canonical source.
>> 6. Now CPU0’s FPU context is still pointing to the task’s, and
>>     fpu->last_cpu is still CPU0. So fpregs_state_valid() returns true 
>> even
>>     though the reset FPU state has not been restored.
>>
>> So the root cause is that exec() is doing the wrong kind of 
>> invalidate. It
>> should reset fpu->last_cpu via __fpu_invalidate_fpregs_state(). Further,
>> fpu__drop() doesn't really seem appropriate as the task (and FPU) are not
>> going away, they are just getting reset as part of an exec. So switch to
>> __fpu_invalidate_fpregs_state().
>>
>> Also, delete the misleading comment that says that either kind of
>> invalidate will be enough, because it’s not always the case.
>>
>> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@...el.com>
>>
>> ---
>> Hi,
>>
>> This was spotted on a specific pre-production setup running an
>> out-of-tree glibc and the x86/shstk tip branch. The symptom observed
>> was a shadow stack segfault in ld-linux. The test case was a kernel
>> build with a high number of threads and it was able to generate the
>> segfault relatively reliably.
>>
>> I was surprised to find that the root cause was not related to supervisor
>> xsave and instead seems to be a general FPU bug where the FPU state will
>> not be reset during exec if rescheduling happens twice in certain points
>> during the operation. It seems to be so old that I had a hard time
>> figuring which commit to blame.
>>
>> A guess at how this was able to lurk so long is the combination of two
>> factors. One is that this specific test environment and workload seemed
>> to like to generate this specific pattern of scheduling for some reason.
>> So the fact it was reliably reproduced there could be not be 
>> indicative of
>> the typical case. The other factor is that CET features will rather 
>> loudly
>> alert to any corrupted FPU state due to the enforcement nature of that
>> state. So maybe this FPU reset miss during exec happened less commonly in
>> the wild, but most existing apps can survive it silently.
>>
>> But since it's still a bit surprising, I would appreciate some extra
>> scrutiny on the reasoning. I verified the FPU state was not getting reset
>> during exec’s that experienced rescheduling to another CPU and back at
>> times as described in the commit log. Then following the logic in the
>> code, failing to restore the FPU would be expected. And fixing that logic
>> fixed the observed issue. But still surprised this wasn't seen before 
>> now.
>>
>> Thanks,
>> Rick
>> ---
>>   arch/x86/kernel/fpu/context.h | 3 +--
>>   arch/x86/kernel/fpu/core.c    | 2 +-
>>   2 files changed, 2 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/x86/kernel/fpu/context.h 
>> b/arch/x86/kernel/fpu/context.h
>> index af5cbdd9bd29..f6d856bd50bc 100644
>> --- a/arch/x86/kernel/fpu/context.h
>> +++ b/arch/x86/kernel/fpu/context.h
>> @@ -19,8 +19,7 @@
>>    * FPU state for a task MUST let the rest of the kernel know that the
>>    * FPU registers are no longer valid for this task.
>>    *
>> - * Either one of these invalidation functions is enough. Invalidate
>> - * a resource you control: CPU if using the CPU for something else
>> + * Invalidate a resource you control: CPU if using the CPU for 
>> something else
>>    * (with preemption disabled), FPU for the current task, or a task that
>>    * is prevented from running by the current task.
>>    */
>> diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
>> index e03b6b107b20..a86d37052a64 100644
>> --- a/arch/x86/kernel/fpu/core.c
>> +++ b/arch/x86/kernel/fpu/core.c
>> @@ -713,7 +713,7 @@ static void fpu_reset_fpregs(void)
>>       struct fpu *fpu = &current->thread.fpu;
>>       fpregs_lock();
>> -    fpu__drop(fpu);
>> +    __fpu_invalidate_fpregs_state(fpu);
>>       /*
>>        * This does not change the actual hardware registers. It just
>>        * resets the memory image and sets TIF_NEED_FPU_LOAD so a
> 
> 
> Thanks for the patch. Let me get back to my server (offline now) next 
> Monday and will add a "Test-by: Lijun Pan <lijun.pan@...el.com>" if it 
> passes.

I have run some relevant tests (compiling kernel with high thread amount 
repeatedly, make -j 256),
glibc tests (https://gitlab.com/x86-glibc/glibc/-/blob/master/INSTALL),
fpu & cet (ibt, shadow stack) test (https://github.com/intel/lkvs),
and not yet found regressions.

Tested-by: Lijun Pan <lijun.pan@...el.com>
Acked-by: Lijun Pan <lijun.pan@...el.com>

Since this problem was first reported by Lei Wang, I suggest adding:
Reported-by: Lei Wang <lei4.wang@...el.com>

Thanks,
Lijun

> 
> In our bug case, probably just switching to
> __fpu_invalidate_fpregs_state(fpu) from fpu__drop(fpu) is enough.
> 
> Maybe there are some other cases that need
> __this_cpu_write(fpu_fpregs_owner_ctx, NULL) through fpu__drop() call, 
> which I don't know.
> 
> Here is the excerpt of the call sequence:
> fpu__drop() -> fpregs_deactivate() -> 
> __this_cpu_write(fpu_fpregs_owner_ctx, NULL);
> arch/x86/kernel/fpu/context.h has
> static inline void __cpu_invalidate_fpregs_state(void)
> {
>      __this_cpu_write(fpu_fpregs_owner_ctx, NULL);
> }
> static inline void __fpu_invalidate_fpregs_state(struct fpu *fpu)
> {
>      fpu->last_cpu = -1;
> }
> static inline int fpregs_state_valid(struct fpu *fpu, unsigned int cpu)
> {
>     return fpu == this_cpu_read(fpu_fpregs_owner_ctx) && cpu == 
> fpu->last_cpu;
> }
> 
> So, I am thinking if it is more rigorous to have both 
> (__cpu_invalidate_fpregs_state, __fpu_invalidate_fpregs_state) 
> invalidated, similarly as fpregs_state_valid checks both conditions.
> 
> code changes like below:
> diff --git a/arch/x86/kernel/fpu/context.h b/arch/x86/kernel/fpu/context.h
> index 958accf2ccf0..fd3304bed0a2 100644
> --- a/arch/x86/kernel/fpu/context.h
> +++ b/arch/x86/kernel/fpu/context.h
> @@ -19,8 +19,8 @@
>    * FPU state for a task MUST let the rest of the kernel know that the
>    * FPU registers are no longer valid for this task.
>    *
> - * Either one of these invalidation functions is enough. Invalidate
> - * a resource you control: CPU if using the CPU for something else
> + * To be more rigorous and to prevent from future corner case, Invalidate
> + * both resources you control: CPU if using the CPU for something else
>    * (with preemption disabled), FPU for the current task, or a task that
>    * is prevented from running by the current task.
>    */
> diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
> index 97a899bf98b9..08b9cef0e076 100644
> --- a/arch/x86/kernel/fpu/core.c
> +++ b/arch/x86/kernel/fpu/core.c
> @@ -725,6 +725,7 @@ static void fpu_reset_fpregs(void)
> 
>          fpregs_lock();
>          fpu__drop(fpu);
> +       __fpu_invalidate_fpregs_state(fpu);
>          /*
>           * This does not change the actual hardware registers. It just
>           * resets the memory image and sets TIF_NEED_FPU_LOAD so a
> 
> Thanks,
> Lijun
>