linux-kernel - Re: [patch 3/3] x86/fpu/xsave: Optimize XSAVEC/S when XGETBV1 is supported

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <60e5a4d1-df7c-d3bd-2730-e528cd75c351@amd.com>
Date:   Wed, 20 Apr 2022 13:15:14 -0500
From:   Tom Lendacky <thomas.lendacky@....com>
To:     Thomas Gleixner <tglx@...utronix.de>,
        Dave Hansen <dave.hansen@...el.com>,
        LKML <linux-kernel@...r.kernel.org>
Cc:     x86@...nel.org, Andrew Cooper <andrew.cooper3@...rix.com>,
        "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
Subject: Re: [patch 3/3] x86/fpu/xsave: Optimize XSAVEC/S when XGETBV1 is
 supported

On 4/19/22 16:22, Thomas Gleixner wrote:
> On Tue, Apr 19 2022 at 15:43, Thomas Gleixner wrote:
>> On Thu, Apr 14 2022 at 10:24, Dave Hansen wrote:
>>> On 4/4/22 05:11, Thomas Gleixner wrote:
>>>> which is suboptimal. Prefetch works better when the access is linear. But
>>>> what's worse is that PKRU can be located in a different page which
>>>> obviously affects dTLB.
>>>
>>> The numbers don't lie, but I'm still surprised by this.  Was this in a
>>> VM that isn't backed with large pages?  task_struct.thread.fpu is
>>> kmem_cache_alloc()'d and is in the direct map, which should be 2M/1G
>>> pages almost all the time.
>>
>> Hmm. Indeed, that's weird.
>>
>> That was bare metal and I just checked that this was a production config
>> and not some weird debug muck which breaks large pages. I'll look deeper
>> into that.
> 
> I can't find any reasonable explanation. The pages are definitely large
> pages, so yes the dTLB miss count does not make sense, but it's
> consistently faster and it's always the dTLB miss count which makes the
> big difference according to perf.
> 
> For enhanced fun, I ran the lot on a AMD Zen3 machine and with the same
> test case (hackbench -l 10000) repeated 10 times by perf stat this is
> consistently slower than the non optimized variant. There is at least an
> explanation for that. A tight loop of 1 Mio xgetbv(1) invocations takes
> 9 Mio cycles on a SKL-X and 50 Mio cycles on a AMD Zen3.

I'll take a look into this and see what I find. Might be interesting to 
see if the actual XSAVES is slower or quicker, too, based on the input mask.

If the performance slowdown shows up in real world benchmarks, we might 
want to consider not using the xgetbv() call on AMD.

Thanks,
Tom

> 
> XSAVE is wonderful, isn't it?
> 
> Thanks,
> 
>          tglx