lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <fa4e5e3d-431c-4dcb-9ffc-b20e6ee66e43@intel.com>
Date: Mon, 21 Jul 2025 18:11:12 -0700
From: Sohil Mehta <sohil.mehta@...el.com>
To: Dave Hansen <dave.hansen@...el.com>, Dave Hansen
	<dave.hansen@...ux.intel.com>, <x86@...nel.org>
CC: Borislav Petkov <bp@...en8.de>, "H . Peter Anvin" <hpa@...or.com>, "Thomas
 Gleixner" <tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>, "Sean
 Christopherson" <seanjc@...gle.com>, Peter Zijlstra <peterz@...radead.org>,
	Vignesh Balasubramanian <vigbalas@....com>, Rick Edgecombe
	<rick.p.edgecombe@...el.com>, Oleg Nesterov <oleg@...hat.com>, "Chang S .
 Bae" <chang.seok.bae@...el.com>, Brian Gerst <brgerst@...il.com>, "Eric
 Biggers" <ebiggers@...gle.com>, Kees Cook <kees@...nel.org>, Chao Gao
	<chao.gao@...el.com>, Fushuai Wang <wangfushuai@...du.com>,
	<linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v2 1/2] x86/fpu: Fix NULL dereference in avx512_status()

On 7/21/2025 3:33 PM, Dave Hansen wrote:
> On 7/21/25 14:53, Sohil Mehta wrote:
>> From: Fushuai Wang <wangfushuai@...du.com>
>>
>> When CONFIG_X86_DEBUG_FPU is set, reading /proc/[kthread]/arch_status
>> causes a NULL pointer dereference.
>>
>> For Kthreads tasks:
>>   proc_pid_arch_status()
>>     avx512_status()
>>       x86_task_fpu() => returns NULL when CONFIG_X86_DEBUG_FPU=y
>>       x86_task_fpu()->avx512_timestamp => NULL dereference
> 
> This seems to imply that CONFIG_X86_DEBUG_FPU _triggers_ the bug.
> 
> It certainly makes it obvious, but isn't there a bug with or without
> CONFIG_X86_DEBUG_FPU? Even without it, I think it'll access out of the
> init_task bounds.
> 

Yeah, there might be a bug without CONFIG_X86_DEBUG_FPU as well. I am
not sure. The NULL dereference only happens with CONFIG_X86_DEBUG_FPU
because it explicitly passes a NULL pointer.

Maybe we can just focus on the WARN_ON_ONCE(task->flags & PF_KTHREAD)
instead of the NULL dereference. For the init_task (PID 0) we never
follow this path since it doesn't show up in /proc.

>> Kernel threads aren't expected to access FPU state directly. However,
>> avx512_timestamp resides within struct fpu which lead to this unique
>> situation.
> 
> What does this mean? Most kernel threads have a 'struct fpu', right?
> 

I meant to say that even though kernel threads have a struct fpu, they
never access it directly using x86_task_fpu(). Kernel threads typically
use kernel_fpu_begin()/_end but reading the avx512_timestamp value is
the only unique reason, a pointer to struct fpu is needed.

>> It is uncertain whether kernel threads use AVX-512 in a meaningful way
>> that needs userspace reporting. For now, avoid reporting AVX-512 usage
>> for kernel threads.
> 
> It would be idea if this was more explicit about how this changes the
> ABI for kernel threads.
> 
> Could we make this more precise, please?
> 
> 	Report "AVX512_elapsed_ms: -1" for kernel tasks, whether they
> 	use AVX-512 or not. This is the same value that is reported for
> 	user tasks which have not been detected using AVX-512.
> 

Sure, will do.

>> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
>> index 9aa9ac8399ae..a75077c645b6 100644
>> --- a/arch/x86/kernel/fpu/xstate.c
>> +++ b/arch/x86/kernel/fpu/xstate.c
>> @@ -1855,19 +1855,18 @@ long fpu_xstate_prctl(int option, unsigned long arg2)
>>  #ifdef CONFIG_PROC_PID_ARCH_STATUS
>>  /*
>>   * Report the amount of time elapsed in millisecond since last AVX512
>> - * use in the task.
>> + * use in the task. Report -1 if no AVX-512 usage.
>>   */
>>  static void avx512_status(struct seq_file *m, struct task_struct *task)
>>  {
>> -	unsigned long timestamp = READ_ONCE(x86_task_fpu(task)->avx512_timestamp);
>> -	long delta;
>> +	unsigned long timestamp = 0;
>> +	long delta = -1;
>>  
>> -	if (!timestamp) {
>> -		/*
>> -		 * Report -1 if no AVX512 usage
>> -		 */
>> -		delta = -1;
>> -	} else {
>> +	/* Do not report AVX-512 usage for kernel threads */
>> +	if (!(task->flags & (PF_KTHREAD | PF_USER_WORKER)))
>> +		timestamp = READ_ONCE(x86_task_fpu(task)->avx512_timestamp);
>> +
>> +	if (timestamp) {
>>  		delta = (long)(jiffies - timestamp);
>>  		/*
>>  		 * Cap to LONG_MAX if time difference > LONG_MAX
> 
> Can we just do:
> 
>         unsigned long timestamp;
>         long delta;
> 
> 	if (task->flags & (PF_KTHREAD | PF_USER_WORKER))
> 		return;
> 

I considered this, but it seemed like a bigger ABI change than the one
proposed.

1) We are already reporting -1 as the AVX512_elapsed_ms for most, if not
all kernel threads. On an SPR system with Ubuntu, I couldn't find any
kernel thread that showed anything other than -1. (But I wasn't running
any workloads either.)

2) Even if there are a few kernel threads that have AVX-512 usage,
reporting -1 instead of a valid number would only lead to slightly
suboptimal scheduler decisions.
But reporting AVX512_elapsed_ms only for some threads might cause
userspace to break unexpectedly if it isn't written to handle that.


> 	timestamp = READ_ONCE(x86_task_fpu(task)->avx512_timestamp);
> 
> 	...
> 
> for now, please? That way, there's no made up value for kernel threads.
> The value just isn't present in the file.

I felt the current one is the lesser of the two evils. I am fine with
your approach if you still prefer it.

I have now realized that there are too many unknowns for me to make a
reliable call. This patch was a result of sunk-cost fallacy :)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ