[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <992dd863-0143-38c9-6f6d-7cb1bb6fd15d@arm.com>
Date: Wed, 21 Nov 2018 16:55:01 +0000
From: Dave Rodgman <dave.rodgman@....com>
To: "Markus F.X.J. Oberhumer" <markus@...rhumer.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
CC: nd <nd@....com>,
"herbert@...dor.apana.org.au" <herbert@...dor.apana.org.au>,
"davem@...emloft.net" <davem@...emloft.net>,
Matt Sealey <Matt.Sealey@....com>,
"nitingupta910@...il.com" <nitingupta910@...il.com>,
"rpurdie@...nedhand.com" <rpurdie@...nedhand.com>,
"minchan@...nel.org" <minchan@...nel.org>,
"sergey.senozhatsky.work@...il.com"
<sergey.senozhatsky.work@...il.com>,
Sonny Rao <sonnyrao@...gle.com>
Subject: Re: [PATCH 0/6] lib/lzo: performance improvements
On 21/11/2018 1:44 pm, Markus F.X.J. Oberhumer wrote:
> I think the three patches
>
> [PATCH 2/6] lib/lzo: enable 64-bit CTZ on Arm
> [PATCH 3/6] lib/lzo: 64-bit CTZ on Arm aarch64
> [PATCH 4/6] lib/lzo: fast 8-byte copy on arm64
>
> should be applied in any case - could you please make an extra
> pull request out of these and try to get them merged as fast
> as possible. Thanks.
The three patches you mention give around 10-25% performance uplift
(mostly on compression). I'll look at generating a pull request for these.
> [PATCH 1/6] lib/lzo: clean-up by introducing COPY16
>
> does not really affect the resulting code at the moment, but please
> note that in one case the actual copy unit is not allowed to
> be greater 8 bytes (which might be implied by the name "COPY16").
> So this needs more work like an extra COPY16_BY_8() macro.
I'll leave Matt to comment on this one, as it's his patch.
> As for your your "lzo-rle" improvements I'll have a look.
>
> Please note that the first byte value 17 is actually valid when using
> external dictionaries ("lzo1x_decompress_dict_safe()" in the LZO source
> code). While this functionality is not present in the Linux kernel at
> the moment it might be worrisome wrt future enhancements.
I wasn't aware of the external dictionary concern. Do you have any
suggestions for an alternative instruction that we could use instead
that would not be used by the existing lzo algorithm at the start of the
stream? If there isn't anything suitable, then we'd have to choose
between backwards compatibility (not a huge issue, if lzo-rle were to be
kept as a separate algorithm to lzo, but certainly nice to have) vs.
allowing for the possibility of introducing external dictionaries in future.
> Finally I'm wondering if your chart comparisions just compares the "lzo-rle"
> patch or also includes the ARM64 improvments - I cannot understand where a
> 20% speedup should come from if you have 0% zeros.
The chart does indeed include the other improvements, so this is where
the performance uplift on the left hand side of the chart (i.e., random
data) comes from.
Thanks for taking a look at this.
Dave
>
> Cheers,
> Markus
>
>
>
> On 2018-11-21 13:06, Dave Rodgman wrote:
>> This patch series introduces performance improvements for lzo.
>>
>> The improvements fall into two categories: general Arm-specific optimisations
>> (e.g., more efficient memory access); and the introduction of a special case
>> for handling runs of zeros (which is a common case for zram) using run-length
>> encoding.
>>
>> The introduction of RLE modifies the bitstream such that it can't be decoded
>> by old versions of lzo (the new lzo-rle can correctly decode old bitstreams).
>> To avoid possible issues where data is persisted on disk (e.g., squashfs), the
>> final patch in this series separates lzo-rle into a separate algorithm
>> alongside lzo, so that the new lzo-rle is (by default) only used for zram and
>> must be explicitly selected for other use-cases. This final patch could be
>> omitted if the consensus is that we'd rather avoid proliferation of lzo
>> variants.
>>
>> Overall, performance is improved by around 1.1 - 4.8x (data-dependent: data
>> with many zero runs shows higher improvement). Under real-world testing with
>> zram, time spent in (de)compression during swapping is reduced by around 27%.
>> The graph below shows the weighted round-trip throughput of lzo, lz4 and
>> lzo-rle, for randomly generated 4k chunks of data with varying levels of
>> entropy. (To calculate weighted round-trip throughput, compression performance
>> is emphasised to reflect the fact that zram does around 2.25x more compression
>> than decompression. (Results and overall trends are fairly similar for
>> unweighted).
>>
>> https://drive.google.com/file/d/18GU4pgRVCLNN7wXxynz-8R2ygrY2IdyE/view
>>
>> Contributors:
>> Dave Rodgman <dave.rodgman@....com>
>> Matt Sealey <matt.sealey@....com>
>>
>
Powered by blists - more mailing lists