[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ebae9a75-c465-9385-b502-39e5a7eee436@intel.com>
Date: Wed, 23 Aug 2023 19:04:29 -0500
From: Lijun Pan <lijun.pan@...el.com>
To: Rick Edgecombe <rick.p.edgecombe@...el.com>, <tglx@...utronix.de>,
<mingo@...hat.com>, <bp@...en8.de>, <x86@...nel.org>,
<hpa@...or.com>, <dave.hansen@...ux.intel.com>, <luto@...nel.org>,
<peterz@...radead.org>
CC: <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] x86: Use __fpu_invalidate_fpregs_state() in exec
On 8/18/2023 2:35 PM, Lijun Pan wrote:
> Hi Rick,
>
> On 8/18/2023 12:03 PM, Rick Edgecombe wrote:
>> The thread flag TIF_NEED_FPU_LOAD indicates that the FPU saved state is
>> valid and should be reloaded when returning to userspace. However, the
>> kernel will skip doing this if the FPU registers are already valid as
>> determined by fpregs_state_valid(). The logic embedded there considers
>> the state valid if two cases are both true:
>> 1: fpu_fpregs_owner_ctx points to the current tasks FPU state
>> 2: the last CPU the registers were live in was the current CPU.
>>
>> This is usually correct logic. A CPU’s fpu_fpregs_owner_ctx is set to
>> the current FPU during the fpregs_restore_userregs() operation, so it
>> indicates that the registers have been restored on this CPU. But this
>> alone doesn’t preclude that the task hasn’t been rescheduled to a
>> different CPU, where the registers were modified, and then back to the
>> current CPU. To verify that this was not the case the logic relies on the
>> second condition. So the assumption is that if the registers have been
>> restored, AND they haven’t had the chance to be modified (by being
>> loaded on another CPU), then they MUST be valid on the current CPU.
>>
>> Besides the lazy FPU optimizations, the other cases where the FPU
>> registers might not be valid are when the kernel modifies the FPU
>> register
>> state or the FPU saved buffer. In this case the operation modifying the
>> FPU state needs to let the kernel know the correspondence has been
>> broken. The comment in “arch/x86/kernel/fpu/context.h” has:
>> /*
>> ...
>> * If the FPU register state is valid, the kernel can skip restoring the
>> * FPU state from memory.
>> *
>> * Any code that clobbers the FPU registers or updates the in-memory
>> * FPU state for a task MUST let the rest of the kernel know that the
>> * FPU registers are no longer valid for this task.
>> *
>> * Either one of these invalidation functions is enough. Invalidate
>> * a resource you control: CPU if using the CPU for something else
>> * (with preemption disabled), FPU for the current task, or a task that
>> * is prevented from running by the current task.
>> */
>>
>> However, this is not completely true. When the kernel modifies the
>> registers or saved FPU state, it can only rely on
>> __fpu_invalidate_fpregs_state(), which wipes the FPU’s last_cpu
>> tracking. The exec path instead relies on fpregs_deactivate(), which sets
>> the CPU’s FPU context to NULL. This was observed to fail to restore the
>> reset FPU state to the registers when returning to userspace in the
>> following scenario:
>>
>> 1. A task is executing in userspace on CPU0
>> - CPU0’s FPU context points to tasks
>> - fpu->last_cpu=CPU0
>> 2. The task exec()’s
>> 3. While in the kernel the task reschedules to CPU1
>> - CPU0 gets a thread executing in the kernel (such that no other
>> FPU context is activated)
>> - Scheduler sets task’s fpu->last_cpu=CPU0
>> 4. Continuing the exec, the task gets to
>> fpu_flush_thread()->fpu_reset_fpregs()
>> - Sets CPU1’s fpu context to NULL
>> - Copies the init state to the task’s FPU buffer
>> - Sets TIF_NEED_FPU_LOAD on the task
>> 5. The task reschedules back to CPU0 before completing the exec and
>> returning to userspace
>> - During the reschedule, scheduler finds TIF_NEED_FPU_LOAD is set
>> - Skips saving the registers and updating task’s fpu→last_cpu,
>> because TIF_NEED_FPU_LOAD is the canonical source.
>> 6. Now CPU0’s FPU context is still pointing to the task’s, and
>> fpu->last_cpu is still CPU0. So fpregs_state_valid() returns true
>> even
>> though the reset FPU state has not been restored.
>>
>> So the root cause is that exec() is doing the wrong kind of
>> invalidate. It
>> should reset fpu->last_cpu via __fpu_invalidate_fpregs_state(). Further,
>> fpu__drop() doesn't really seem appropriate as the task (and FPU) are not
>> going away, they are just getting reset as part of an exec. So switch to
>> __fpu_invalidate_fpregs_state().
>>
>> Also, delete the misleading comment that says that either kind of
>> invalidate will be enough, because it’s not always the case.
>>
>> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@...el.com>
>>
>> ---
>> Hi,
>>
>> This was spotted on a specific pre-production setup running an
>> out-of-tree glibc and the x86/shstk tip branch. The symptom observed
>> was a shadow stack segfault in ld-linux. The test case was a kernel
>> build with a high number of threads and it was able to generate the
>> segfault relatively reliably.
>>
>> I was surprised to find that the root cause was not related to supervisor
>> xsave and instead seems to be a general FPU bug where the FPU state will
>> not be reset during exec if rescheduling happens twice in certain points
>> during the operation. It seems to be so old that I had a hard time
>> figuring which commit to blame.
>>
>> A guess at how this was able to lurk so long is the combination of two
>> factors. One is that this specific test environment and workload seemed
>> to like to generate this specific pattern of scheduling for some reason.
>> So the fact it was reliably reproduced there could be not be
>> indicative of
>> the typical case. The other factor is that CET features will rather
>> loudly
>> alert to any corrupted FPU state due to the enforcement nature of that
>> state. So maybe this FPU reset miss during exec happened less commonly in
>> the wild, but most existing apps can survive it silently.
>>
>> But since it's still a bit surprising, I would appreciate some extra
>> scrutiny on the reasoning. I verified the FPU state was not getting reset
>> during exec’s that experienced rescheduling to another CPU and back at
>> times as described in the commit log. Then following the logic in the
>> code, failing to restore the FPU would be expected. And fixing that logic
>> fixed the observed issue. But still surprised this wasn't seen before
>> now.
>>
>> Thanks,
>> Rick
>> ---
>> arch/x86/kernel/fpu/context.h | 3 +--
>> arch/x86/kernel/fpu/core.c | 2 +-
>> 2 files changed, 2 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/x86/kernel/fpu/context.h
>> b/arch/x86/kernel/fpu/context.h
>> index af5cbdd9bd29..f6d856bd50bc 100644
>> --- a/arch/x86/kernel/fpu/context.h
>> +++ b/arch/x86/kernel/fpu/context.h
>> @@ -19,8 +19,7 @@
>> * FPU state for a task MUST let the rest of the kernel know that the
>> * FPU registers are no longer valid for this task.
>> *
>> - * Either one of these invalidation functions is enough. Invalidate
>> - * a resource you control: CPU if using the CPU for something else
>> + * Invalidate a resource you control: CPU if using the CPU for
>> something else
>> * (with preemption disabled), FPU for the current task, or a task that
>> * is prevented from running by the current task.
>> */
>> diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
>> index e03b6b107b20..a86d37052a64 100644
>> --- a/arch/x86/kernel/fpu/core.c
>> +++ b/arch/x86/kernel/fpu/core.c
>> @@ -713,7 +713,7 @@ static void fpu_reset_fpregs(void)
>> struct fpu *fpu = ¤t->thread.fpu;
>> fpregs_lock();
>> - fpu__drop(fpu);
>> + __fpu_invalidate_fpregs_state(fpu);
>> /*
>> * This does not change the actual hardware registers. It just
>> * resets the memory image and sets TIF_NEED_FPU_LOAD so a
>
>
> Thanks for the patch. Let me get back to my server (offline now) next
> Monday and will add a "Test-by: Lijun Pan <lijun.pan@...el.com>" if it
> passes.
I have run some relevant tests (compiling kernel with high thread amount
repeatedly, make -j 256),
glibc tests (https://gitlab.com/x86-glibc/glibc/-/blob/master/INSTALL),
fpu & cet (ibt, shadow stack) test (https://github.com/intel/lkvs),
and not yet found regressions.
Tested-by: Lijun Pan <lijun.pan@...el.com>
Acked-by: Lijun Pan <lijun.pan@...el.com>
Since this problem was first reported by Lei Wang, I suggest adding:
Reported-by: Lei Wang <lei4.wang@...el.com>
Thanks,
Lijun
>
> In our bug case, probably just switching to
> __fpu_invalidate_fpregs_state(fpu) from fpu__drop(fpu) is enough.
>
> Maybe there are some other cases that need
> __this_cpu_write(fpu_fpregs_owner_ctx, NULL) through fpu__drop() call,
> which I don't know.
>
> Here is the excerpt of the call sequence:
> fpu__drop() -> fpregs_deactivate() ->
> __this_cpu_write(fpu_fpregs_owner_ctx, NULL);
> arch/x86/kernel/fpu/context.h has
> static inline void __cpu_invalidate_fpregs_state(void)
> {
> __this_cpu_write(fpu_fpregs_owner_ctx, NULL);
> }
> static inline void __fpu_invalidate_fpregs_state(struct fpu *fpu)
> {
> fpu->last_cpu = -1;
> }
> static inline int fpregs_state_valid(struct fpu *fpu, unsigned int cpu)
> {
> return fpu == this_cpu_read(fpu_fpregs_owner_ctx) && cpu ==
> fpu->last_cpu;
> }
>
> So, I am thinking if it is more rigorous to have both
> (__cpu_invalidate_fpregs_state, __fpu_invalidate_fpregs_state)
> invalidated, similarly as fpregs_state_valid checks both conditions.
>
> code changes like below:
> diff --git a/arch/x86/kernel/fpu/context.h b/arch/x86/kernel/fpu/context.h
> index 958accf2ccf0..fd3304bed0a2 100644
> --- a/arch/x86/kernel/fpu/context.h
> +++ b/arch/x86/kernel/fpu/context.h
> @@ -19,8 +19,8 @@
> * FPU state for a task MUST let the rest of the kernel know that the
> * FPU registers are no longer valid for this task.
> *
> - * Either one of these invalidation functions is enough. Invalidate
> - * a resource you control: CPU if using the CPU for something else
> + * To be more rigorous and to prevent from future corner case, Invalidate
> + * both resources you control: CPU if using the CPU for something else
> * (with preemption disabled), FPU for the current task, or a task that
> * is prevented from running by the current task.
> */
> diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
> index 97a899bf98b9..08b9cef0e076 100644
> --- a/arch/x86/kernel/fpu/core.c
> +++ b/arch/x86/kernel/fpu/core.c
> @@ -725,6 +725,7 @@ static void fpu_reset_fpregs(void)
>
> fpregs_lock();
> fpu__drop(fpu);
> + __fpu_invalidate_fpregs_state(fpu);
> /*
> * This does not change the actual hardware registers. It just
> * resets the memory image and sets TIF_NEED_FPU_LOAD so a
>
> Thanks,
> Lijun
>
Powered by blists - more mailing lists