[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <VI1PR08MB3662267AF55E8D0D9DE57858EEDA0@VI1PR08MB3662.eurprd08.prod.outlook.com>
Date: Wed, 21 Nov 2018 17:04:57 +0000
From: Matt Sealey <Matt.Sealey@....com>
To: "Markus F.X.J. Oberhumer" <markus@...rhumer.com>,
Dave Rodgman <dave.rodgman@....com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
CC: nd <nd@....com>,
"herbert@...dor.apana.org.au" <herbert@...dor.apana.org.au>,
"davem@...emloft.net" <davem@...emloft.net>,
"nitingupta910@...il.com" <nitingupta910@...il.com>,
"rpurdie@...nedhand.com" <rpurdie@...nedhand.com>,
"minchan@...nel.org" <minchan@...nel.org>,
"sergey.senozhatsky.work@...il.com"
<sergey.senozhatsky.work@...il.com>,
Sonny Rao <sonnyrao@...gle.com>
Subject: RE: [PATCH 0/6] lib/lzo: performance improvements
Hi Markus.
> Hi Dave,
>
> thanks for your patch set. Just some initial comments:
>
>
> I think the three patches
>
> [PATCH 2/6] lib/lzo: enable 64-bit CTZ on Arm
> [PATCH 3/6] lib/lzo: 64-bit CTZ on Arm aarch64
> [PATCH 4/6] lib/lzo: fast 8-byte copy on arm64
>
> should be applied in any case - could you please make an extra
> pull request out of these and try to get them merged as fast
> as possible. Thanks.
>
>
> The first patch
>
> [PATCH 1/6] lib/lzo: clean-up by introducing COPY16
>
> does not really affect the resulting code at the moment, but please
> note that in one case the actual copy unit is not allowed to
> be greater 8 bytes (which might be implied by the name "COPY16").
> So this needs more work like an extra COPY16_BY_8() macro.
I agree in principle but not in practice ;)
The original intent of those patches was to avail ourselves of
known architectural features, but also give the compiler a helping
hand where LZO isn't supposed to be that knowledgeable of the
underlying processor implementation. The end result is I think
an architecture having a discrete 16-byte movement is more
beneficial than dictating how things are being copied under it.
What came out is that there is a difference in codegen in compress
vs. decompress literally because of a difference in coding style:
decompress:
COPY8(op, ip);
op += 8;
ip += 8;
COPY8(op, ip);
op += 8;
ip += 8;
compress:
COPY8(op, ip);
COPY8(op+8, ip+8);
ip += 16;
op += 16;
Compilers do very well with the latter, and terribly with the former.
Once I aligned them to the same sequence I noticed the only time
LZO ever does 8-byte copies is in 16-byte chunks, and this aligned
with my expectation that on arm64 LDP/STP with writeback gets it done
in two instructions. I want to go to there!
COPY16 being defined as COPY8,COPY8 is a low hanging fruit that
does improve codegen on arm64 slightly (because it cleans the
decompression code up) and performance even more slightly as a
result.
Just for elucidation, it goes from (decompress)
add x0, x0, 16 // op, op,
ldr w2, [x1] //, MEM[base: ii_17, offset: 0B]
add x1, x1, 16 // ii, ii,
cmp x0, x3 // op, _490
str w2, [x0, -16] // _89, MEM[base: op_8, offset: 0B]
ldr w2, [x1, -12] //, MEM[base: ii_17, offset: 4B]
str w2, [x0, -12] // _105, MEM[base: op_8, offset: 4B]
ldr w2, [x1, -8] //, MEM[base: ii_17, offset: 8B]
str w2, [x0, -8] // _145, MEM[base: op_8, offset: 8B]
ldr w2, [x1, -4] //, MEM[base: ii_17, offset: 12B]
str w2, [x0, -4] // _156, MEM[base: op_8, offset: 12B]
To this (same code as compress):
add x0, x0, 16 // op, op,
ldr x2, [x1] // _153, MEM[base: ii_17, offset: 0B]
add x1, x1, 16 // ii, ii,
cmp x0, x3 // op, _418
str x2, [x0, -16] // _153, MEM[base: op_8, offset: 0B]
ldr x2, [x1, -8] // _181, MEM[base: ii_17, offset: 8B]
str x2, [x0, -8] // _181, MEM[base: op_8, offset: 8B]
But it's going to want to be this way:
ldp x2, x3, [x1], 16 // _182, MEM[base: ii_17, offset: 0B]
stp x2, x3, [x0], 16 // _182, MEM[base: op_8, offset: 0B]
Which is a COPY16_BY_8 but only because of the architecture rules
on element sizes.. COPY16_BY_8 doesn't fit the idiom we're
trying to create.
For arm64 the easiest way to get LDP/STP with writeback just
happens to be allowing compilers to lift memcpy(o,i,16) which
is always going to be alignment-safe, but this doesn't hold true
on arm32 or elsewhere and we've not done enough testing on other
arches for that concept to bear fruit, yet.
I'm confused by the 'one case the actual copy unit is not allowed to
be greater 8 bytes' statement: this is not evidenced by the loop
structure. COPY8 is only ever implemented in pairs in the kernel -
can you point out where this might not be the case in the future
(and why they couldn't still use COPY8 in that case)?
Ta!
Matt Sealey <matt.sealey@....com>
Arm Partner Enablement Group
Powered by blists - more mailing lists