linux-kernel - Re: [patch 3/3] x86/fpu/xsave: Optimize XSAVEC/S when XGETBV1 is supported

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <87bkws6hmc.ffs@tglx>
Date:   Fri, 22 Apr 2022 21:30:19 +0200
From:   Thomas Gleixner <tglx@...utronix.de>
To:     Tom Lendacky <thomas.lendacky@....com>,
        Dave Hansen <dave.hansen@...el.com>,
        LKML <linux-kernel@...r.kernel.org>
Cc:     x86@...nel.org, Andrew Cooper <andrew.cooper3@...rix.com>,
        "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
Subject: Re: [patch 3/3] x86/fpu/xsave: Optimize XSAVEC/S when XGETBV1 is
 supported

On Wed, Apr 20 2022 at 13:15, Tom Lendacky wrote:
> On 4/19/22 16:22, Thomas Gleixner wrote:
>>> That was bare metal and I just checked that this was a production config
>>> and not some weird debug muck which breaks large pages. I'll look deeper
>>> into that.
>> 
>> I can't find any reasonable explanation. The pages are definitely large
>> pages, so yes the dTLB miss count does not make sense, but it's
>> consistently faster and it's always the dTLB miss count which makes the
>> big difference according to perf.
>> 
>> For enhanced fun, I ran the lot on a AMD Zen3 machine and with the same
>> test case (hackbench -l 10000) repeated 10 times by perf stat this is
>> consistently slower than the non optimized variant. There is at least an
>> explanation for that. A tight loop of 1 Mio xgetbv(1) invocations takes
>> 9 Mio cycles on a SKL-X and 50 Mio cycles on a AMD Zen3.
>
> I'll take a look into this and see what I find. Might be interesting to 
> see if the actual XSAVES is slower or quicker, too, based on the input mask.
>
> If the performance slowdown shows up in real world benchmarks, we might 
> want to consider not using the xgetbv() call on AMD.

As things stand now, I'm not going to pursue this further at the moment.

The effect on SKL-X is not explainable especially the dTLB miss count
decrease does not make any sense. Aside of that I just figured out that
it is very sensitive to kernel configurations and I have no idea yet
what exactly is the screw to turn to make the effect come and go.

So I just go and add the XSAVEC support alone as that's actually
something which _is_ beneficial for guests.

Thanks,

        tglx