[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <fa4e5e3d-431c-4dcb-9ffc-b20e6ee66e43@intel.com>
Date: Mon, 21 Jul 2025 18:11:12 -0700
From: Sohil Mehta <sohil.mehta@...el.com>
To: Dave Hansen <dave.hansen@...el.com>, Dave Hansen
<dave.hansen@...ux.intel.com>, <x86@...nel.org>
CC: Borislav Petkov <bp@...en8.de>, "H . Peter Anvin" <hpa@...or.com>, "Thomas
Gleixner" <tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>, "Sean
Christopherson" <seanjc@...gle.com>, Peter Zijlstra <peterz@...radead.org>,
Vignesh Balasubramanian <vigbalas@....com>, Rick Edgecombe
<rick.p.edgecombe@...el.com>, Oleg Nesterov <oleg@...hat.com>, "Chang S .
Bae" <chang.seok.bae@...el.com>, Brian Gerst <brgerst@...il.com>, "Eric
Biggers" <ebiggers@...gle.com>, Kees Cook <kees@...nel.org>, Chao Gao
<chao.gao@...el.com>, Fushuai Wang <wangfushuai@...du.com>,
<linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v2 1/2] x86/fpu: Fix NULL dereference in avx512_status()
On 7/21/2025 3:33 PM, Dave Hansen wrote:
> On 7/21/25 14:53, Sohil Mehta wrote:
>> From: Fushuai Wang <wangfushuai@...du.com>
>>
>> When CONFIG_X86_DEBUG_FPU is set, reading /proc/[kthread]/arch_status
>> causes a NULL pointer dereference.
>>
>> For Kthreads tasks:
>> proc_pid_arch_status()
>> avx512_status()
>> x86_task_fpu() => returns NULL when CONFIG_X86_DEBUG_FPU=y
>> x86_task_fpu()->avx512_timestamp => NULL dereference
>
> This seems to imply that CONFIG_X86_DEBUG_FPU _triggers_ the bug.
>
> It certainly makes it obvious, but isn't there a bug with or without
> CONFIG_X86_DEBUG_FPU? Even without it, I think it'll access out of the
> init_task bounds.
>
Yeah, there might be a bug without CONFIG_X86_DEBUG_FPU as well. I am
not sure. The NULL dereference only happens with CONFIG_X86_DEBUG_FPU
because it explicitly passes a NULL pointer.
Maybe we can just focus on the WARN_ON_ONCE(task->flags & PF_KTHREAD)
instead of the NULL dereference. For the init_task (PID 0) we never
follow this path since it doesn't show up in /proc.
>> Kernel threads aren't expected to access FPU state directly. However,
>> avx512_timestamp resides within struct fpu which lead to this unique
>> situation.
>
> What does this mean? Most kernel threads have a 'struct fpu', right?
>
I meant to say that even though kernel threads have a struct fpu, they
never access it directly using x86_task_fpu(). Kernel threads typically
use kernel_fpu_begin()/_end but reading the avx512_timestamp value is
the only unique reason, a pointer to struct fpu is needed.
>> It is uncertain whether kernel threads use AVX-512 in a meaningful way
>> that needs userspace reporting. For now, avoid reporting AVX-512 usage
>> for kernel threads.
>
> It would be idea if this was more explicit about how this changes the
> ABI for kernel threads.
>
> Could we make this more precise, please?
>
> Report "AVX512_elapsed_ms: -1" for kernel tasks, whether they
> use AVX-512 or not. This is the same value that is reported for
> user tasks which have not been detected using AVX-512.
>
Sure, will do.
>> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
>> index 9aa9ac8399ae..a75077c645b6 100644
>> --- a/arch/x86/kernel/fpu/xstate.c
>> +++ b/arch/x86/kernel/fpu/xstate.c
>> @@ -1855,19 +1855,18 @@ long fpu_xstate_prctl(int option, unsigned long arg2)
>> #ifdef CONFIG_PROC_PID_ARCH_STATUS
>> /*
>> * Report the amount of time elapsed in millisecond since last AVX512
>> - * use in the task.
>> + * use in the task. Report -1 if no AVX-512 usage.
>> */
>> static void avx512_status(struct seq_file *m, struct task_struct *task)
>> {
>> - unsigned long timestamp = READ_ONCE(x86_task_fpu(task)->avx512_timestamp);
>> - long delta;
>> + unsigned long timestamp = 0;
>> + long delta = -1;
>>
>> - if (!timestamp) {
>> - /*
>> - * Report -1 if no AVX512 usage
>> - */
>> - delta = -1;
>> - } else {
>> + /* Do not report AVX-512 usage for kernel threads */
>> + if (!(task->flags & (PF_KTHREAD | PF_USER_WORKER)))
>> + timestamp = READ_ONCE(x86_task_fpu(task)->avx512_timestamp);
>> +
>> + if (timestamp) {
>> delta = (long)(jiffies - timestamp);
>> /*
>> * Cap to LONG_MAX if time difference > LONG_MAX
>
> Can we just do:
>
> unsigned long timestamp;
> long delta;
>
> if (task->flags & (PF_KTHREAD | PF_USER_WORKER))
> return;
>
I considered this, but it seemed like a bigger ABI change than the one
proposed.
1) We are already reporting -1 as the AVX512_elapsed_ms for most, if not
all kernel threads. On an SPR system with Ubuntu, I couldn't find any
kernel thread that showed anything other than -1. (But I wasn't running
any workloads either.)
2) Even if there are a few kernel threads that have AVX-512 usage,
reporting -1 instead of a valid number would only lead to slightly
suboptimal scheduler decisions.
But reporting AVX512_elapsed_ms only for some threads might cause
userspace to break unexpectedly if it isn't written to handle that.
> timestamp = READ_ONCE(x86_task_fpu(task)->avx512_timestamp);
>
> ...
>
> for now, please? That way, there's no made up value for kernel threads.
> The value just isn't present in the file.
I felt the current one is the lesser of the two evils. I am fine with
your approach if you still prefer it.
I have now realized that there are too many unknowns for me to make a
reliable call. This patch was a result of sunk-cost fallacy :)
Powered by blists - more mailing lists