linux-kernel - Re: [PATCH v7 1/3] x86/fpu: Measure the Latency of XSAVE and XRSTOR

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ZQv92QD7AtodfDPB@ysun46-mobl.ccr.corp.intel.com>
Date:   Thu, 21 Sep 2023 16:24:57 +0800
From:   Yi Sun <yi.sun@...el.com>
To:     Ingo Molnar <mingo@...nel.org>
CC:     <dave.hansen@...el.com>, <linux-kernel@...r.kernel.org>,
        <x86@...nel.org>, <sohil.mehta@...el.com>, <ak@...ux.intel.com>,
        <ilpo.jarvinen@...ux.intel.com>, <heng.su@...el.com>,
        <tony.luck@...el.com>, <yi.sun@...ux.intel.com>,
        <yu.c.chen@...el.com>
Subject: Re: [PATCH v7 1/3] x86/fpu: Measure the Latency of XSAVE and XRSTOR

On 21.09.2023 09:03, Ingo Molnar wrote:
>
>* Yi Sun <yi.sun@...el.com> wrote:
>
>> -#define XSTATE_XRESTORE(st, lmask, hmask)				\
>> +#define __XSTATE_XRESTORE(st, lmask, hmask)				\
>>  	asm volatile(ALTERNATIVE(XRSTOR,				\
>>  				 XRSTORS, X86_FEATURE_XSAVES)		\
>>  		     "\n"						\
>> @@ -140,6 +143,35 @@ static inline u64 xfeatures_mask_independent(void)
>>  		     : "D" (st), "m" (*st), "a" (lmask), "d" (hmask)	\
>>  		     : "memory")
>>
>> +#if defined(CONFIG_X86_DEBUG_FPU)
>> +#define XSTATE_XSAVE(fps, lmask, hmask, err)				\
>> +	do {								\
>> +		struct fpstate *f = fps;				\
>> +		u64 tc = -1;						\
>> +		if (tracepoint_enabled(x86_fpu_latency_xsave))		\
>> +			tc = trace_clock();				\
>> +		__XSTATE_XSAVE(&f->regs.xsave, lmask, hmask, err);	\
>> +		if (tracepoint_enabled(x86_fpu_latency_xsave))		\
>> +			trace_x86_fpu_latency_xsave(f, trace_clock() - tc);\
>> +	} while (0)
>> +
>> +#define XSTATE_XRESTORE(fps, lmask, hmask)				\
>> +	do {								\
>> +		struct fpstate *f = fps;				\
>> +		u64 tc = -1;						\
>> +		if (tracepoint_enabled(x86_fpu_latency_xrstor))		\
>> +			tc = trace_clock();				\
>> +		__XSTATE_XRESTORE(&f->regs.xsave, lmask, hmask);	\
>> +		if (tracepoint_enabled(x86_fpu_latency_xrstor))		\
>> +			trace_x86_fpu_latency_xrstor(f, trace_clock() - tc);\
>
>This v7 version does not adequately address the review feedback I gave for
>v6: it adds tracing overhead to potential hot paths, and putting it behind
>CONFIG_X86_DEBUG_FPU is not a solution either: it's default-y, so de-facto
>enabled on all major distros...
Hi Ingo,

Appreciate your comments!

What do you think if I were to add a new config here instead of
CONFIG_X86_DEBUG_FPU, and set it as the default value of n?
This way, all the overhead will be removed from the hot paths.
```
#else
#define XSTATE_XSAVE(fps, lmask, hmask, err)
                 __XSTATE_XSAVE(&(fps)->regs.xsave, lmask, hmask, err)
#define XSTATE_XRESTORE(fps, lmask, hmask)
                 __XSTATE_XRESTORE(&(fps)->regs.xsave, lmask, hmask)
#endif
```
>
>It seems unnecessarily complex: why does it have to measure latency
>directly? Tracepoints *by default* come with event timestamps. A latency
>measurement tool should be able to subtract two timestamps to extract the
>latency between two tracepoints...
>
>In fact, function tracing is enabled on all major Linux distros:
>
>  kepler:~/tip> grep FUNCTION_TRACER /boot/config-6.2.0-33-generic
>  CONFIG_HAVE_FUNCTION_TRACER=y
>  CONFIG_FUNCTION_TRACER=y
>
>Why not just enable function tracing for the affected FPU context switching
>functions?
>
>The relevant functions are already standalone in a typical kernel:
>
>   xsave:                # ffffffff8103cfe0 T save_fpregs_to_fpstate
>   xrstor:               # ffffffff8103d160 T restore_fpregs_from_fpstate
>   xrstor_supervisor:    # ffffffff8103dc50 T fpu__clear_user_states
>
>... and FPU context switching overhead dominates the cost of these
>functions.
>
I am aware of these trace points, but we add new ones for two reasons:
1. As I described in the patch comments. It would be more accurate to
calculate the latency in a single trace event, which will NOT include the
latency of the trace functions themselves. The cost of the trace
functions is much longer than the latency of xsaves, by our experiments.

Calculating the latency by subtracting two timestamps has to
contain the latency of trace points, that would make the actual latency
data less prominent.

2. We want to measure with finer granularity. The new added trace points
dump XINUSE and RFBM each time they are called. By doing so, we can
collect the latency and analyze the data grouped according to each FPU
instruction indicated in the XCR0 bits.

Thanks
    --Sun, Yi