[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <39f5588fe3394c88bcf139c820af2033@AcuMS.aculab.com>
Date: Thu, 11 Apr 2019 09:06:37 +0000
From: David Laight <David.Laight@...LAB.COM>
To: 'Paolo Bonzini' <pbonzini@...hat.com>,
Sean Christopherson <sean.j.christopherson@...el.com>
CC: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"kvm@...r.kernel.org" <kvm@...r.kernel.org>
Subject: RE: [PATCH] KVM: x86: optimize check for valid PAT value
From: Paolo Bonzini
> Sent: 10 April 2019 18:09
> On 10/04/19 16:57, Sean Christopherson wrote:
> > On Wed, Apr 10, 2019 at 12:55:53PM +0000, David Laight wrote:
> >> From: Paolo Bonzini
> >>> Sent: 10 April 2019 10:55
> >>>
> >>> This check will soon be done on every nested vmentry and vmexit,
> >>> "parallelize" it using bitwise operations.
...
> >> How about:
> >> /*
> >> * Each byte must be 0, 1, 4, 5, 6 or 7.
> >> * Convert 001x to 011x then 100x so 2 and 3 fail the test.
> >> */
> >> data |= (data ^ 0x0404040404040404ULL)) + 0x0202020202020202ULL;
> >> if (data & 0xF8F8F8F8F8F8F8F8ULL)
> >> return false;
...
> > Really fancy:
> > 0x0000000000048447 <+247>: movabs $0x404040404040404,%rcx
> > 0x0000000000048451 <+257>: movabs $0x202020202020202,%rax
> > 0x000000000004845b <+267>: xor %rdx,%rcx
> > 0x000000000004845e <+270>: add %rax,%rcx
> > 0x0000000000048461 <+273>: movabs $0xf8f8f8f8f8f8f8f8,%rax
> > 0x000000000004846b <+283>: or %rcx,%rdx
> > 0x000000000004846e <+286>: test %rax,%rdx
> > 0x0000000000048471 <+289>: sete %al
> > 0x0000000000048474 <+292>: retq
>
> Yeah, the three constants are expensive. Too bad the really fancy
> version sums twos and xors fours; if it were the opposite, it could have
> used lea and then I would have chosen that one just for the coolness factor.
The constants are just big, if the code is inlined they are probably
otherwise free (they won't change the length of the dependency chain).
The twos can be generated from the fours without increasing the length
of the dependency chain and saving some bytes.
It may be possible to generate shorter code that executes just as
fast by generating a single constant and deriving the others from it.
- generate 4s - needed first
- shift right 2 to get 1s (in parallel with the xor)
- use lea to get 6s (in parallel with an lea to do the add)
- invert the 1s to get FEs (also in parallel with the add)
- xor the FEs with the 6s to get F8s (in parallel with the or)
- and/test for the result
I'm not going to try to persuade gcc to go that.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Powered by blists - more mailing lists