lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b11a5599-fda4-8747-8878-e743b8a501e0@intel.com>
Date:   Mon, 12 Nov 2018 07:46:25 -0800
From:   Dave Hansen <dave.hansen@...el.com>
To:     "Li, Aubrey" <aubrey.li@...ux.intel.com>,
        Aubrey Li <aubrey.li@...el.com>, tglx@...utronix.de,
        mingo@...hat.com, peterz@...radead.org, hpa@...or.com
Cc:     ak@...ux.intel.com, tim.c.chen@...ux.intel.com,
        arjan@...ux.intel.com, linux-kernel@...r.kernel.org
Subject: [RFC PATCH v2 1/2] x86/fpu: detect AVX task

On 11/11/18 9:38 PM, Li, Aubrey wrote:
> If there is a valid state in the AVX registers, we can say the tasks contains
> AVX instructions, can't we?

XRSTOR, for instance, can take XSAVE state out of the init state, but it
is not necessarily an AVX instruction.  In fact, we had a kernel bug
along the way that was accidentally doing this on systems without AVX.

>>> +#define	AVX_STATE_DECAY_COUNT	3
>>
>> How was this number chosen?  What does this mean?
>>
>> It appears that this is saying that after 3 non-AVX-512-using context
>> switches, the task is not considered to be using AVX512 any more.  That
>> seems a bit goofy because the context switch rate is highly dependent on
>> HZ, and on how often the task yields.
>>
>> Do we want this, or do we want something more time-based?
>>
> This counter is introduced here to solve the race of context switch and
> VZEROUPPER. 3 context switches mean the same thread is on-off CPU 3 times.
> Due to scheduling latency, 3 jiffies could only happen AVX task on-off just
> 1 time. So IMHO the context switches number is better here.

Imagine we have a HZ=1000 system where AVX_STATE_DECAY_COUNT=3.  That
means that a task can be marked as a non-AVX-512-user after not using it
for ~3 ms.  But, with HZ=250, that's ~12ms.

Also, don't forget that we have context switches from the timer
interrupt, but also from normal old operations that sleep.

Let's say our AVX-512 app was doing:

	while (foo) {
		do_avx_512();
		read(pipe, buf, len);
		read(pipe, buf, len);
		read(pipe, buf, len);
	}

And all three pipe reads context-switched the task.  That loop could
finish in way under 3HZ, but still end up in do_avx_512() each time with
fpu...avx->state=0.

BTW, I don't have a great solution for this.  I was just pointing out
one of the pitfalls from using context switch counts so strictly.

>> I'd really like if this looked like this:
>>
>> 	if (!have_avx_state_detect()) {
>> 		avx->state = 0;
>> 		return;
>> 	}
>>
>> Then, in have_avx_state_detect(), explain what XGETBV1 does.  BTW, I
>> don't think we *totally* need to duplicate the SDM definitions in kernel
>> code for each instruction.  It's fine to just say that it set 1 for
>> features not in the init state.
>>
> 
> Thomas suggested open this inline since this is only usage of xgetbv1. So I'll
> use static_cpu_has() here directly. Does it sound good?

Yeah, that's OK.  But, I'd still limit the comments to what it is doing
instead of defining the instruction.

>>> +	/*
>>> +	 * XINUSE is dynamic to track component state because VZEROUPPER
>>> +	 * happens on every function end and reset the bitmap to the
>>> +	 * initial configuration.
>>
>> This is confusing to me because VZEROUPPER is not apparently involved
>> here.  What is this trying to say?
>>
> VZERROUPPER instruction in the task resets the bitmap.

I think this is too much detail for a kernel comment.  Let's just say
that the task itself can get back to the 'init state'.

>>> +	 * State decay is introduced to solve the race condition between
>>> +	 * context switch and a function end. State is aggressively set
>>> +	 * once it's detected but need to be cleared by decay 3 context
>>> +	 * switches
>>> +	 */
>>
>> I'd probably say:
>>
>> 	AVX512-using processes frequently set AVX512 back to the init 	
>> 	state themselves.  Thus, this detection mechanism can miss.
>> 	The decay ensures that false-negatives do not immediately make
>> 	a task be considered as not using AVX512.
> 
> Thanks, will refine it in the next version.
> And yes, AVX512-using processoes set AVX512 back to the init state themselves
> by VZEROUPPER instructions.

... or XRSTOR, or an in-kernel XRSTOR from a signal entry.

The point is, I'd probably not actually mention VZEROUPPER.  It's
certainly the common way to init the state, but let's not the *only* way
we have to be concerned with.

>>> +	if (xgetbv(XINUSE_STATE_BITMAP_INDEX) & XFEATURE_MASK_Hi16_ZMM) {
>>
>> This is *just* an AVX512 state, right?  That isn't reflected in any
>> comments or naming.  Also, why just this state, and not any of the other
>> AVX512-related states?
> 
> Only this state causes significant frequency drop, other states not.

OK.  This is an extremely crucial bit of information to capture,
including the background.

>>> +/*
>>>   * Highest level per task FPU state data structure that
>>>   * contains the FPU register state plus various FPU
>>>   * state fields:
>>> @@ -303,6 +312,14 @@ struct fpu {
>>>  	unsigned char			initialized;
>>>  
>>>  	/*
>>> +	 * @avx_state:
>>> +	 *
>>> +	 * This data structure indicates whether this context
>>> +	 * contains AVX states
>>> +	 */
>>
>> Yeah, that's precisely what fpu->state.xsave.xfeatures does. :)
>> I see, will refine in the next version

One other thought about the new 'avx_state':

fxregs_state (which is a part of the XSAVE state) has some padding and
'sw_reserved' areas.  You *might* be able to steal some space there.
Not that this is a huge space eater, but why waste the space if we don't
have to?


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ