linux-kernel - Re: [PATCH 1/1] LZ4: Port LZ4 1.9.x FAST_DEC

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <0eefc0b8-c749-2a2a-2935-bd703389943e@sony.com>
Date:   Thu, 16 May 2019 14:52:41 +0800
From:   "Mao, Chenxi" <Chenxi.Mao@...y.com>
To:     Gao Xiang <gaoxiang25@...wei.com>
CC:     Cyan <yann.collet.73@...il.com>,
        "akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "Feng, Roy" <Roy.Feng@...y.com>,
        "Xu, Yuanli 2" <Yuanli.Xu@...y.com>,
        "Alm, Robert 2" <Robert.Alm@...y.com>,
        "Takahashi, Masaya A (Sony Mobile)" <Masaya.A.Takahashi@...y.com>,
        Miao Xie <miaoxie@...wei.com>,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>
Subject: Re: [PATCH 1/1] LZ4: Port LZ4 1.9.x FAST_DEC_LOOP and enable it on
 x86 and ARM64

Hi Xiang:

I checked my patch, this issue fixes already included in my first E-mail.
I am working on the patch PS2, I will send it later.

Chenxi

On 5/16/19 10:10 AM, Gao Xiang wrote:
> 
> 
> On 2019/5/16 10:07, Mao, Chenxi wrote:
>> Hi Xiang:
>>
>> I checked the deliver history.
>> There is only below delivery related with lz4.c
>> Pull request:616
>> 4e3accc Fix Dict Size Test in `LZ4_compress_fast_continue()`
>> 535636f Don't Attach Very Small Dictionaries
>> This 2 changes seems like ONLY bug fixes for dictionary fix baesd on 1.8.3
> 
> Have you taken a glance at
> 
> https://github.com/lz4/lz4/commit/5a6d72447ae998b387794df1135023689aa89995   at least?
> 
> and issues,
> 
> https://github.com/lz4/lz4/issues/676
> https://github.com/lz4/lz4/issues/681
> 
> That is what Yann said before.
> 
> Thanks,
> Gao Xiang
> 
>>
>> Based on current investigation result, I think pick decompress patches is safe for v1.8.3
>>
>> If there is any misunderstanding or faults, please feel free to let me know.
>>
>> Chenxi
>>
>> On 5/16/19 9:35 AM, Gao Xiang wrote:
>>>
>>>
>>> On 2019/5/16 7:48, Mao, Chenxi wrote:
>>>> Hi Yann and Xiang:
>>>> For this FAST_DEC_LOOP change, I only pick up decompress related patches to current kernel LZ4 implementation(based on 1.8.3).
>>>> Here is my cherry-pick list:
>>>> 2589c44 created LZ4_FAST_DEC_LOOP build macro
>>>> 605d811 enable LZ4_FAST_DEC_LOOP build macro on aarch64/GCC by default
>>>> 5d7d116 decompress_generic: Limit fastpath to x86
>>>> 75fb878 decompress_generic: Add fastpath for small offsets
>>>> faac110 decompress_generic: Unroll loops a bit more
>>>> 1fbaf84 decompress_generic: remove msan write
>>>> 28b8249 decompress_generic: re-add fastpath
>>>> 232f1e2 decompress_generic: drop partial copy check in fast loop
>>>> 59332a3 decompress_generic: Optimize literal copies
>>>> 5dfa7d4 decompress_generic: optimize match copy
>>>> 28356e0 decompress_generic: Add a loop fastpath
>>>> 4da3360 decompress_generic: Refactor variable length fields
>>>>
>>>> Only cherry-pick these changes to LZ4 1.8.3 would not be introduce risks.
>>>
>>> The fact is that ... LZ4_FAST_DEC_LOOP almost influences all decompression apis.
>>> You should keep LZ4_decompress_generic from all known issues.
>>>
>>>>
>>>> @Xiang:
>>>> Do you prefer to upgrade kernel lz4 to 1.9.1 directly or only pick some optimization patches?
>>>
>>> I think you should pick all patches which are related to LZ4_decompress_generic()
>>> and LZ4_FAST_DEC_LOOP since known issues of LZ4_FAST_DEC_LOOP should be fixed of course.
>>>
>>> Thanks,
>>> Gao Xiang
>>>
>>>>
>>>> Chenxi
>>>>
>>>>
>>>> On 5/16/19 1:47 AM, Gao Xiang wrote:
>>>>> Hi Yann,
>>>>>
>>>>> On 2019/5/16 1:03, Cyan wrote:
>>>>>> Re-posted,
>>>>>> it seems the previous message was rejected by the linux-kernel server
>>>>>> due to some kind of format limitation (no html).
>>>>>>
>>>>>>
>>>>>> Le mer. 15 mai 2019 à 09:56, Cyan <yann.collet.73@...il.com> a écrit :
>>>>>>>
>>>>>>> The v1.9.0 version has a bug which makes it read a few bytes out of bound in certain cases.
>>>>>>> This was fixed in v1.9.1.
>>>>>>> Therefore, if you plan upgrading version, skip directly to v1.9.1.
>>>>>>>
>>>>>>> Note that, in v1.9.1, LZ4_decompress_fast() is now deprecated.
>>>>>>> LZ4_decompress_fast() is a security liability, since it's unable to cope with malicious inputs.
>>>>>>> While this is not a problem when the producer is under direct control (local production and consumption, no intermediate storage),
>>>>>>> this entry point seems misused in multiple scenarios where input cannot be decently trusted.
>>>>>>> The deprecation warning is meant to "scare away" such usages.
>>>>>>> It's still possible to intentionally remove the warning, for users who know what they are doing.
>>>>>>>
>>>>>>> Note that, on top of that, the _fast() variant is now slower that LZ4_decompress_safe(), removing its most important positive differentiator.
>>>>>>> That's because LZ4_decompress_fast() is "blind", it doesn't know the input size.
>>>>>>> It just guesses it, from the requested output size, and by relying on end-of-block conditions.
>>>>>>> This effectively limits its capability to copy up to 8 bytes at a time,
>>>>>>> which proves slower than the 32-bytes at a time used in LZ4_decompress_safe().
>>>>>>>
>>>>>>> LZ4_decompress_fast() still has some possible usage :
>>>>>>> it doesn't need the compressed size.
>>>>>>> Therefore, in a scenario where decompressed size is known, but compressed size isn't,
>>>>>>> it cannot be replaced by `LZ4_decompress_safe()`.
>>>>>>> This is pretty rare, but can happen in some cases, erofs for example.
>>>>>>> `LZ4_decompress_safe_partial()` *might* be able to replace it for such scenario,
>>>>>>> but I haven't thoroughly tested it, so can't yet vouch for it.
>>>>>
>>>>> Yes, that is my real concern...
>>>>>
>>>>> In fact, we have shipped erofs with LZ4_decompress_safe_partial() of lz4-1.8.3
>>>>> for our HUAWEI P30/P30 Pro and other HUAWEI EMUI 9.1 phones on the market...
>>>>>
>>>>> It seemd lz4-1.8.3 is rather stable since no stability report from our internal
>>>>> beta test and real consumers...
>>>>>
>>>>> I haven't tested lz4-1.9.0 yet since I am working on the erofs paper [1]...
>>>>>
>>>>> But anyway, the patch looks good to me since it has obvious performance gain...
>>>>>
>>>>>>>
>>>>>>> If it doesn't, then it will be necessary to create a new entry point.
>>>>>>> The main difference is that this new entry point will need a bound for the input size, in order to guarantee it never reads beyond input buffer.
>>>>>>> But as said, for the time being, such dedicated entry point doesn't exist.
>>>>>
>>>>> I agree, but current erofs still have to choose LZ4_decompress_fast...
>>>>> I will test these 2 apis later and will report if something is wrong as well.
>>>>>
>>>>> [1] https://www.usenix.org/conference/atc19/presentation/gao
>>>>>     https://www.usenix.org/conference/atc19/technical-sessions
>>>>>
>>>>> Thanks,
>>>>> Gao Xiang
>>>>>
>>>>>
>>>>>>>
>>>>>>> The new FAST_DEC_LOOP setting works well on x86 and x64.
>>>>>>> Recently, it has been extended to gcc+arm64 (`dev` branch).
>>>>>>> The combination of clang+arm64 is less successful though (impacts vary, depending on version and exact hardware), so it's still disabled by default.
>>>>>>> But anyone can test it by setting FAST_DEC_LOOP definition to 1, thus enabling the new decoder loop.
>>>>>>>
>>>>>>>
>>>>>>> Y.
>>>>>>>
>>>>>>> Le mar. 14 mai 2019 à 20:28, Mao, Chenxi <Chenxi.Mao@...y.com> a écrit :
>>>>>>>>
>>>>>>>> Hi Xiang:
>>>>>>>>
>>>>>>>> Thanks for your reply, I will have a stress test on my device later.
>>>>>>>> I didn't have chance to test LZ4 with clang build because of device limitation. I think I could do it later.
>>>>>>>> I guess the clang performance downgrade is caused by some compiler optimization options.
>>>>>>>> I will double check it later if I got the device which can build kernel with clang.
>>>>>>>>
>>>>>>>> BTW, the FAST_DEC_LOOP leverage LZ4_wildCopy8 instead of LZ4_wildCopy,
>>>>>>>> however based on my test, original LZ4_wildCopy has the better performance.
>>>>>>>> Original LZ4_wildCopy API still invoked by FAST_DEC_LOOP apis.
>>>>>>>> That might be a problem for X86 devices for current patch.
>>>>>>>>
>>>>>>>> Chenxi
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Gao Xiang [mailto:gaoxiang25@...wei.com]
>>>>>>>> Sent: Wednesday, May 15, 2019 10:21 AM
>>>>>>>> To: Mao, Chenxi <Chenxi.Mao@...y.com>
>>>>>>>> Cc: akpm@...ux-foundation.org; linux-kernel@...r.kernel.org; Feng, Roy <Roy.Feng@...y.com>; Xu, Yuanli 2 <Yuanli.Xu@...y.com>; Alm, Robert 2 <Robert.Alm@...y.com>; Takahashi, Masaya A (Sony Mobile) <Masaya.A.Takahashi@...y.com>; Miao Xie <miaoxie@...wei.com>
>>>>>>>> Subject: Re: [PATCH 1/1] LZ4: Port LZ4 1.9.x FAST_DEC_LOOP and enable it on x86 and ARM64
>>>>>>>>
>>>>>>>> Hi Chenxi,
>>>>>>>>
>>>>>>>> On 2019/5/15 8:43, Chenxi Mao wrote:
>>>>>>>>> FAST_DEC_LOOP was introduced from LZ4 1.9.
>>>>>>>>> This change would be introduce 10% on decompress operation according
>>>>>>>>> to LZ4 benchmark result on X86 devices.
>>>>>>>>> Meanwhile, LZ4 with FAST_DEC_LOOP could get improvements, however
>>>>>>>>> clang compiler has downgrade if FAST_DEC_LOOP enabled.
>>>>>>>>
>>>>>>>> I noticed this optimization and lz4-1.9.0 [1] was just released month ago with this big change therefore I'm a little afraid of its current stability (especially some rare used decompression apis...)
>>>>>>>>
>>>>>>>> Could you Cc more people (e.g. lz4 original author Yann Collet) in order to get some more ideas about this if possible?
>>>>>>>>
>>>>>>>> Anyway, I like this optimization as well since FAST_DEC_LOOP improves decompression speed a lot :)
>>>>>>>>
>>>>>>>> [1] https://github.com/lz4/lz4/releases/tag/v1.9.0
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Gao Xiang
>>>>>>>>
>>>>>>>>>
>>>>>>>>> So FAST_DEC_LOOP only enabled on X86/X86-64 or ARM64 with GCC build.
>>>>>>>>>
>>>>>>>>> Here is the test result on ARM64(cortex-A53)
>>>>>>>>>
>>>>>>>>> Benchmark via ZRAM:
>>>>>>>>>
>>>>>>>>> Test case:
>>>>>>>>> fio --bs=32k --randrepeat=1 --randseed=100 --refill_buffers \
>>>>>>>>> --buffer_compress_percentage=75 \
>>>>>>>>> --scramble_buffers=1 --direct=1 --loops=100 --numjobs=8 \
>>>>>>>>> --filename=/dev/block/zram0 --name=seq-write --rw=write --stonewall \
>>>>>>>>> --name=seq-read --rw=read --stonewall --name=seq-readwrite \ --rw=rw
>>>>>>>>> --stonewall --name=rand-readwrite --rw=randrw --stonewall
>>>>>>>>>
>>>>>>>>> Patched:
>>>>>>>>>    READ: bw=7077MiB/s (7421MB/s)
>>>>>>>>> Vanilla:
>>>>>>>>>    READ: bw=5134MiB/s (5384MB/s)
>>>>>>>>>
>>>>>>>>> Reference:
>>>>>>>>> 1. https://github.com/lz4/lz4/pull/645
>>>>>>>>> 2. https://github.com/lz4/lz4/pull/707
>>>>>>>>>
>>>>>>>>> Signed-off-by: chenxi.mao <chenxi.mao@...y.com>
>>>>>>>>> ---
>>>>>>>>>  lib/lz4/lz4_decompress.c | 425
>>>>>>>>> +++++++++++++++++++++++++++++++++------
>>>>>>>>>  1 file changed, 361 insertions(+), 64 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/lib/lz4/lz4_decompress.c b/lib/lz4/lz4_decompress.c index
>>>>>>>>> 0c9d3ad17e0f..a4c87e32b3c0 100644
>>>>>>>>> --- a/lib/lz4/lz4_decompress.c
>>>>>>>>> +++ b/lib/lz4/lz4_decompress.c
>>>>>>>>> @@ -50,6 +50,131 @@
>>>>>>>>>  #define assert(condition) ((void)0)
>>>>>>>>>  #endif
>>>>>>>>>
>>>>>>>>> +#ifndef LZ4_FAST_DEC_LOOP
>>>>>>>>> +#if defined(__i386__) || defined(__x86_64__) #define
>>>>>>>>> +LZ4_FAST_DEC_LOOP 1 #elif defined(__aarch64__) && !defined(__clang__)
>>>>>>>>> +     /* On aarch64, we disable this optimization for clang because on certain
>>>>>>>>> +      * mobile chipsets and clang, it reduces performance. For more information
>>>>>>>>> +      * refer to https://github.com/lz4/lz4/pull/707. */ #define
>>>>>>>>> +LZ4_FAST_DEC_LOOP 1 #else #define LZ4_FAST_DEC_LOOP 0 #endif #endif
>>>>>>>>> +
>>>>>>>>> +static const unsigned inc32table[8] = { 0, 1, 2, 1, 0, 4, 4, 4 };
>>>>>>>>> +static const int dec64table[8] = { 0, 0, 0, -1, -4, 1, 2, 3 };
>>>>>>>>> +
>>>>>>>>> +#if LZ4_FAST_DEC_LOOP
>>>>>>>>> +#define FASTLOOP_SAFE_DISTANCE 64
>>>>>>>>> +FORCE_INLINE void
>>>>>>>>> +LZ4_memcpy_using_offset_base(BYTE * dstPtr, const BYTE * srcPtr, BYTE * dstEnd,
>>>>>>>>> +                          const size_t offset)
>>>>>>>>> +{
>>>>>>>>> +     if (offset < 8) {
>>>>>>>>> +             dstPtr[0] = srcPtr[0];
>>>>>>>>> +
>>>>>>>>> +             dstPtr[1] = srcPtr[1];
>>>>>>>>> +             dstPtr[2] = srcPtr[2];
>>>>>>>>> +             dstPtr[3] = srcPtr[3];
>>>>>>>>> +             srcPtr += inc32table[offset];
>>>>>>>>> +             memcpy(dstPtr + 4, srcPtr, 4);
>>>>>>>>> +             srcPtr -= dec64table[offset];
>>>>>>>>> +             dstPtr += 8;
>>>>>>>>> +     } else {
>>>>>>>>> +             memcpy(dstPtr, srcPtr, 8);
>>>>>>>>> +             dstPtr += 8;
>>>>>>>>> +             srcPtr += 8;
>>>>>>>>> +     }
>>>>>>>>> +
>>>>>>>>> +     LZ4_wildCopy(dstPtr, srcPtr, dstEnd); }
>>>>>>>>> +
>>>>>>>>> +/* customized variant of memcpy, which can overwrite up to 32 bytes
>>>>>>>>> +beyond dstEnd
>>>>>>>>> + * this version copies two times 16 bytes (instead of one time 32
>>>>>>>>> +bytes)
>>>>>>>>> + * because it must be compatible with offsets >= 16. */ FORCE_INLINE
>>>>>>>>> +void LZ4_wildCopy32(void *dstPtr, const void *srcPtr, void *dstEnd) {
>>>>>>>>> +     BYTE *d = (BYTE *) dstPtr;
>>>>>>>>> +     const BYTE *s = (const BYTE *)srcPtr;
>>>>>>>>> +     BYTE *const e = (BYTE *) dstEnd;
>>>>>>>>> +
>>>>>>>>> +     do {
>>>>>>>>> +             memcpy(d, s, 16);
>>>>>>>>> +             memcpy(d + 16, s + 16, 16);
>>>>>>>>> +             d += 32;
>>>>>>>>> +             s += 32;
>>>>>>>>> +     } while (d < e);
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>> +FORCE_INLINE void
>>>>>>>>> +LZ4_memcpy_using_offset(BYTE *dstPtr, const BYTE *srcPtr, BYTE *dstEnd,
>>>>>>>>> +                     const size_t offset)
>>>>>>>>> +{
>>>>>>>>> +     BYTE v[8];
>>>>>>>>> +     switch (offset) {
>>>>>>>>> +
>>>>>>>>> +     case 1:
>>>>>>>>> +             memset(v, *srcPtr, 8);
>>>>>>>>> +             goto copy_loop;
>>>>>>>>> +     case 2:
>>>>>>>>> +             memcpy(v, srcPtr, 2);
>>>>>>>>> +             memcpy(&v[2], srcPtr, 2);
>>>>>>>>> +             memcpy(&v[4], &v[0], 4);
>>>>>>>>> +             goto copy_loop;
>>>>>>>>> +     case 4:
>>>>>>>>> +             memcpy(v, srcPtr, 4);
>>>>>>>>> +             memcpy(&v[4], srcPtr, 4);
>>>>>>>>> +             goto copy_loop;
>>>>>>>>> +     default:
>>>>>>>>> +             LZ4_memcpy_using_offset_base(dstPtr, srcPtr, dstEnd, offset);
>>>>>>>>> +             return;
>>>>>>>>> +     }
>>>>>>>>> +
>>>>>>>>> +      copy_loop:
>>>>>>>>> +     memcpy(dstPtr, v, 8);
>>>>>>>>> +     dstPtr += 8;
>>>>>>>>> +     while (dstPtr < dstEnd) {
>>>>>>>>> +             memcpy(dstPtr, v, 8);
>>>>>>>>> +             dstPtr += 8;
>>>>>>>>> +     }
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>> +/* Read the variable-length literal or match length.
>>>>>>>>> + *
>>>>>>>>> + * ip - pointer to use as input.
>>>>>>>>> + * lencheck - end ip.  Return an error if ip advances >= lencheck.
>>>>>>>>> + * loop_check - check ip >= lencheck in body of loop.  Returns loop_error if so.
>>>>>>>>> + * initial_check - check ip >= lencheck before start of loop.  Returns initial_error if so.
>>>>>>>>> + * error (output) - error code.  Should be set to 0 before call.
>>>>>>>>> + */
>>>>>>>>> +typedef enum { loop_error = -2, initial_error = -1, ok = 0}
>>>>>>>>> +variable_length_error; FORCE_INLINE unsigned read_variable_length(const BYTE **ip,
>>>>>>>>> +                                        const BYTE *lencheck,
>>>>>>>>> +                                        int loop_check, int initial_check,
>>>>>>>>> +                                        variable_length_error *error) {
>>>>>>>>> +     unsigned length = 0;
>>>>>>>>> +     unsigned s;
>>>>>>>>> +     if (initial_check && unlikely((*ip) >= lencheck)) {     /* overflow detection */
>>>>>>>>> +             *error = initial_error;
>>>>>>>>> +             return length;
>>>>>>>>> +     }
>>>>>>>>> +     do {
>>>>>>>>> +             s = **ip;
>>>>>>>>> +             (*ip)++;
>>>>>>>>> +             length += s;
>>>>>>>>> +             if (loop_check && unlikely((*ip) >= lencheck)) {        /* overflow detection */
>>>>>>>>> +                     *error = loop_error;
>>>>>>>>> +                     return length;
>>>>>>>>> +             }
>>>>>>>>> +     } while (s == 255);
>>>>>>>>> +
>>>>>>>>> +     return length;
>>>>>>>>> +}
>>>>>>>>> +#endif
>>>>>>>>> +
>>>>>>>>>  /*
>>>>>>>>>   * LZ4_decompress_generic() :
>>>>>>>>>   * This generic decompression function covers all use cases.
>>>>>>>>> @@ -80,25 +205,28 @@ static FORCE_INLINE int LZ4_decompress_generic(
>>>>>>>>>        const size_t dictSize
>>>>>>>>>        )
>>>>>>>>>  {
>>>>>>>>> -     const BYTE *ip = (const BYTE *) src;
>>>>>>>>> -     const BYTE * const iend = ip + srcSize;
>>>>>>>>> +     const BYTE *ip = (const BYTE *)src;
>>>>>>>>> +     const BYTE *const iend = ip + srcSize;
>>>>>>>>>
>>>>>>>>>       BYTE *op = (BYTE *) dst;
>>>>>>>>> -     BYTE * const oend = op + outputSize;
>>>>>>>>> +     BYTE *const oend = op + outputSize;
>>>>>>>>>       BYTE *cpy;
>>>>>>>>>
>>>>>>>>> -     const BYTE * const dictEnd = (const BYTE *)dictStart + dictSize;
>>>>>>>>> -     static const unsigned int inc32table[8] = {0, 1, 2, 1, 0, 4, 4, 4};
>>>>>>>>> -     static const int dec64table[8] = {0, 0, 0, -1, -4, 1, 2, 3};
>>>>>>>>> +     const BYTE *const dictEnd = (const BYTE *)dictStart + dictSize;
>>>>>>>>>
>>>>>>>>>       const int safeDecode = (endOnInput == endOnInputSize);
>>>>>>>>>       const int checkOffset = ((safeDecode) && (dictSize < (int)(64 *
>>>>>>>>> KB)));
>>>>>>>>>
>>>>>>>>>       /* Set up the "end" pointers for the shortcut. */
>>>>>>>>>       const BYTE *const shortiend = iend -
>>>>>>>>> -             (endOnInput ? 14 : 8) /*maxLL*/ - 2 /*offset*/;
>>>>>>>>> +         (endOnInput ? 14 : 8) /*maxLL*/ - 2 /*offset*/;
>>>>>>>>>       const BYTE *const shortoend = oend -
>>>>>>>>> -             (endOnInput ? 14 : 8) /*maxLL*/ - 18 /*maxML*/;
>>>>>>>>> +         (endOnInput ? 14 : 8) /*maxLL*/ - 18 /*maxML*/;
>>>>>>>>> +
>>>>>>>>> +     const BYTE *match;
>>>>>>>>> +     size_t offset;
>>>>>>>>> +     unsigned int token;
>>>>>>>>> +     size_t length;
>>>>>>>>>
>>>>>>>>>       DEBUGLOG(5, "%s (srcSize:%i, dstSize:%i)", __func__,
>>>>>>>>>                srcSize, outputSize);
>>>>>>>>> @@ -117,15 +245,194 @@ static FORCE_INLINE int LZ4_decompress_generic(
>>>>>>>>>       if ((endOnInput) && unlikely(srcSize == 0))
>>>>>>>>>               return -1;
>>>>>>>>>
>>>>>>>>> -     /* Main Loop : decode sequences */
>>>>>>>>> +#if LZ4_FAST_DEC_LOOP
>>>>>>>>> +     if ((oend - op) < FASTLOOP_SAFE_DISTANCE) {
>>>>>>>>> +             DEBUGLOG(6, "skip fast decode loop");
>>>>>>>>> +             goto safe_decode;
>>>>>>>>> +     }
>>>>>>>>> +
>>>>>>>>> +     /* Fast loop : decode sequences as long as output <
>>>>>>>>> +iend-FASTLOOP_SAFE_DISTANCE */
>>>>>>>>>       while (1) {
>>>>>>>>> -             size_t length;
>>>>>>>>> -             const BYTE *match;
>>>>>>>>> -             size_t offset;
>>>>>>>>> +             /* Main fastloop assertion: We can always wildcopy FASTLOOP_SAFE_DISTANCE */
>>>>>>>>> +             assert(oend - op >= FASTLOOP_SAFE_DISTANCE);
>>>>>>>>> +             if (endOnInput) {
>>>>>>>>> +                     assert(ip < iend);
>>>>>>>>> +             }
>>>>>>>>> +             token = *ip++;
>>>>>>>>> +             length = token >> ML_BITS;      /* literal length */
>>>>>>>>> +
>>>>>>>>> +             assert(!endOnInput || ip <= iend);      /* ip < iend before the increment */
>>>>>>>>> +
>>>>>>>>> +             /* decode literal length */
>>>>>>>>> +             if (length == RUN_MASK) {
>>>>>>>>> +                     variable_length_error error = ok;
>>>>>>>>> +                     length +=
>>>>>>>>> +                         read_variable_length(&ip, iend - RUN_MASK,
>>>>>>>>> +                                              endOnInput, endOnInput,
>>>>>>>>> +                                              &error);
>>>>>>>>> +                     if (error == initial_error) {
>>>>>>>>> +                             goto _output_error;
>>>>>>>>> +                     }
>>>>>>>>> +                     if ((safeDecode)
>>>>>>>>> +                         && unlikely((uptrval) (op) + length <
>>>>>>>>> +                                     (uptrval) (op))) {
>>>>>>>>> +                             goto _output_error;
>>>>>>>>> +                     }       /* overflow detection */
>>>>>>>>> +                     if ((safeDecode)
>>>>>>>>> +                         && unlikely((uptrval) (ip) + length <
>>>>>>>>> +                                     (uptrval) (ip))) {
>>>>>>>>> +                             goto _output_error;
>>>>>>>>> +                     }
>>>>>>>>> +
>>>>>>>>> +                     /* overflow detection */
>>>>>>>>> +                     /* copy literals */
>>>>>>>>> +                     cpy = op + length;
>>>>>>>>> +                     LZ4_STATIC_ASSERT(MFLIMIT >= WILDCOPYLENGTH);
>>>>>>>>> +                     if (endOnInput) {       /* LZ4_decompress_safe() */
>>>>>>>>> +                             if ((cpy > oend - 32)
>>>>>>>>> +                                 || (ip + length > iend - 32)) {
>>>>>>>>> +                                     goto safe_literal_copy;
>>>>>>>>> +                             }
>>>>>>>>> +                             LZ4_wildCopy32(op, ip, cpy);
>>>>>>>>> +                     } else {        /* LZ4_decompress_fast() */
>>>>>>>>> +                             if (cpy > oend - 8) {
>>>>>>>>> +                                     goto safe_literal_copy;
>>>>>>>>> +                             }
>>>>>>>>> +                             LZ4_wildCopy(op, ip, cpy);
>>>>>>>>> +                             /* LZ4_decompress_fast() cannot copy more than 8 bytes at a time */
>>>>>>>>> +                             /* it doesn't know input length, and only relies on end-of-block */
>>>>>>>>> +                             /* properties */
>>>>>>>>> +                     }
>>>>>>>>> +                     ip += length;
>>>>>>>>> +                     op = cpy;
>>>>>>>>> +             } else {
>>>>>>>>> +                     cpy = op + length;
>>>>>>>>> +                     if (endOnInput) {       /* LZ4_decompress_safe() */
>>>>>>>>> +                             DEBUGLOG(7,
>>>>>>>>> +                                      "copy %u bytes in a 16-bytes stripe",
>>>>>>>>> +                                      (unsigned)length);
>>>>>>>>> +                             /* We don't need to check oend */
>>>>>>>>> +                             /* since we check it once for each loop below */
>>>>>>>>> +                             if (ip > iend - (16 + 1)) {     /*max lit + offset + nextToken */
>>>>>>>>> +                                     goto safe_literal_copy;
>>>>>>>>> +                             }
>>>>>>>>> +                             /* Literals can only be 14, but hope compilers optimize */
>>>>>>>>> +                             /*if we copy by a register size */
>>>>>>>>> +                             memcpy(op, ip, 16);
>>>>>>>>> +                     } else {
>>>>>>>>> +                             /* LZ4_decompress_fast() cannot copy more than 8 bytes at a time */
>>>>>>>>> +                             /* it doesn't know input length, and relies on end-of-block */
>>>>>>>>> +                             /* properties */
>>>>>>>>> +                             memcpy(op, ip, 8);
>>>>>>>>> +                             if (length > 8) {
>>>>>>>>> +                                     memcpy(op + 8, ip + 8, 8);
>>>>>>>>> +                             }
>>>>>>>>> +                     }
>>>>>>>>> +                     ip += length;
>>>>>>>>> +                     op = cpy;
>>>>>>>>> +             }
>>>>>>>>> +
>>>>>>>>> +             /* get offset */
>>>>>>>>> +             offset = LZ4_readLE16(ip);
>>>>>>>>> +             ip += 2;        /* end-of-block condition violated */
>>>>>>>>> +             match = op - offset;
>>>>>>>>> +
>>>>>>>>> +             /* get matchlength */
>>>>>>>>> +             length = token & ML_MASK;
>>>>>>>>>
>>>>>>>>> -             /* get literal length */
>>>>>>>>> -             unsigned int const token = *ip++;
>>>>>>>>> -             length = token>>ML_BITS;
>>>>>>>>> +             if ((checkOffset) && (unlikely(match + dictSize < lowPrefix))) {
>>>>>>>>> +                     goto _output_error;
>>>>>>>>> +             }
>>>>>>>>> +             /* Error : offset outside buffers */
>>>>>>>>> +             if (length == ML_MASK) {
>>>>>>>>> +                     variable_length_error error = ok;
>>>>>>>>> +                     length +=
>>>>>>>>> +                         read_variable_length(&ip, iend - LASTLITERALS + 1,
>>>>>>>>> +                                              endOnInput, 0, &error);
>>>>>>>>> +                     if (error != ok) {
>>>>>>>>> +                             goto _output_error;
>>>>>>>>> +                     }
>>>>>>>>> +                     if ((safeDecode)
>>>>>>>>> +                         && unlikely((uptrval) (op) + length < (uptrval) op)) {
>>>>>>>>> +                             goto _output_error;
>>>>>>>>> +                     }       /* overflow detection */
>>>>>>>>> +                     length += MINMATCH;
>>>>>>>>> +                     if (op + length >= oend - FASTLOOP_SAFE_DISTANCE) {
>>>>>>>>> +                             goto safe_match_copy;
>>>>>>>>> +                     }
>>>>>>>>> +             } else {
>>>>>>>>> +                     length += MINMATCH;
>>>>>>>>> +                     if (op + length >= oend - FASTLOOP_SAFE_DISTANCE) {
>>>>>>>>> +                             goto safe_match_copy;
>>>>>>>>> +                     }
>>>>>>>>> +
>>>>>>>>> +                     /* Fastpath check: Avoids a branch in LZ4_wildCopy32 if true */
>>>>>>>>> +                     if (!(dict == usingExtDict) || (match >= lowPrefix)) {
>>>>>>>>> +                             if (offset >= 8) {
>>>>>>>>> +                                     memcpy(op, match, 8);
>>>>>>>>> +                                     memcpy(op + 8, match + 8, 8);
>>>>>>>>> +                                     memcpy(op + 16, match + 16, 2);
>>>>>>>>> +                                     op += length;
>>>>>>>>> +                                     continue;
>>>>>>>>> +                             }
>>>>>>>>> +                     }
>>>>>>>>> +             }
>>>>>>>>> +
>>>>>>>>> +             /* match starting within external dictionary */
>>>>>>>>> +             if ((dict == usingExtDict) && (match < lowPrefix)) {
>>>>>>>>> +                     if (unlikely(op + length > oend - LASTLITERALS)) {
>>>>>>>>> +                             if (partialDecoding) {
>>>>>>>>> +                                     /* reach end of buffer */
>>>>>>>>> +                                     length =
>>>>>>>>> +                                         min(length, (size_t) (oend - op));
>>>>>>>>> +                             } else {
>>>>>>>>> +                                     /* end-of-block condition violated */
>>>>>>>>> +                                     goto _output_error;
>>>>>>>>> +                             }
>>>>>>>>> +                     }
>>>>>>>>> +
>>>>>>>>> +                     if (length <= (size_t) (lowPrefix - match)) {
>>>>>>>>> +                             /* match fits entirely within external dictionary : just copy */
>>>>>>>>> +                             memmove(op, dictEnd - (lowPrefix - match), length);
>>>>>>>>> +                             op += length;
>>>>>>>>> +                     } else {
>>>>>>>>> +                             /* match stretches into both external dict and current block */
>>>>>>>>> +                             size_t const copySize =
>>>>>>>>> +                                 (size_t) (lowPrefix - match);
>>>>>>>>> +                             size_t const restSize = length - copySize;
>>>>>>>>> +                             memcpy(op, dictEnd - copySize, copySize);
>>>>>>>>> +                             op += copySize;
>>>>>>>>> +                             if (restSize > (size_t) (op - lowPrefix)) {     /* overlap copy */
>>>>>>>>> +                                     BYTE *const endOfMatch = op + restSize;
>>>>>>>>> +                                     const BYTE *copyFrom = lowPrefix;
>>>>>>>>> +                                     while (op < endOfMatch) {
>>>>>>>>> +                                             *op++ = *copyFrom++;
>>>>>>>>> +                                     }
>>>>>>>>> +                             } else {
>>>>>>>>> +                                     memcpy(op, lowPrefix, restSize);
>>>>>>>>> +                                     op += restSize;
>>>>>>>>> +                             }
>>>>>>>>> +                     }
>>>>>>>>> +                     continue;
>>>>>>>>> +             }
>>>>>>>>> +
>>>>>>>>> +             /* copy match within block */
>>>>>>>>> +             cpy = op + length;
>>>>>>>>> +
>>>>>>>>> +             assert((op <= oend) && (oend - op >= 32));
>>>>>>>>> +             if (unlikely(offset < 16)) {
>>>>>>>>> +                     LZ4_memcpy_using_offset(op, match, cpy, offset);
>>>>>>>>> +             } else {
>>>>>>>>> +                     LZ4_wildCopy32(op, match, cpy);
>>>>>>>>> +             }
>>>>>>>>> +
>>>>>>>>> +             op = cpy;       /* wildcopy correction */
>>>>>>>>> +     }
>>>>>>>>> +      safe_decode:
>>>>>>>>> +#endif
>>>>>>>>> +     /* Main Loop : decode sequences */
>>>>>>>>> +     while (1) {
>>>>>>>>> +             length = token >> ML_BITS;
>>>>>>>>>
>>>>>>>>>               /* ip < iend before the increment */
>>>>>>>>>               assert(!endOnInput || ip <= iend);
>>>>>>>>> @@ -143,26 +450,27 @@ static FORCE_INLINE int LZ4_decompress_generic(
>>>>>>>>>                * combined check for both stages).
>>>>>>>>>                */
>>>>>>>>>               if ((endOnInput ? length != RUN_MASK : length <= 8)
>>>>>>>>> -                /*
>>>>>>>>> -                 * strictly "less than" on input, to re-enter
>>>>>>>>> -                 * the loop with at least one byte
>>>>>>>>> -                 */
>>>>>>>>> -                && likely((endOnInput ? ip < shortiend : 1) &
>>>>>>>>> -                          (op <= shortoend))) {
>>>>>>>>> +                 /*
>>>>>>>>> +                  * strictly "less than" on input, to re-enter
>>>>>>>>> +                  * the loop with at least one byte
>>>>>>>>> +                  */
>>>>>>>>> +                 && likely((endOnInput ? ip < shortiend : 1) &
>>>>>>>>> +                           (op <= shortoend))) {
>>>>>>>>>                       /* Copy the literals */
>>>>>>>>>                       memcpy(op, ip, endOnInput ? 16 : 8);
>>>>>>>>> -                     op += length; ip += length;
>>>>>>>>> +                     op += length;
>>>>>>>>> +                     ip += length;
>>>>>>>>>
>>>>>>>>>                       /*
>>>>>>>>>                        * The second stage:
>>>>>>>>>                        * prepare for match copying, decode full info.
>>>>>>>>>                        * If it doesn't work out, the info won't be wasted.
>>>>>>>>>                        */
>>>>>>>>> -                     length = token & ML_MASK; /* match length */
>>>>>>>>> +                     length = token & ML_MASK;       /* match length */
>>>>>>>>>                       offset = LZ4_readLE16(ip);
>>>>>>>>>                       ip += 2;
>>>>>>>>>                       match = op - offset;
>>>>>>>>> -                     assert(match <= op); /* check overflow */
>>>>>>>>> +                     assert(match <= op);    /* check overflow */
>>>>>>>>>
>>>>>>>>>                       /* Do not deal with overlapping matches. */
>>>>>>>>>                       if ((length != ML_MASK) &&
>>>>>>>>> @@ -187,28 +495,24 @@ static FORCE_INLINE int LZ4_decompress_generic(
>>>>>>>>>
>>>>>>>>>               /* decode literal length */
>>>>>>>>>               if (length == RUN_MASK) {
>>>>>>>>> -                     unsigned int s;
>>>>>>>>>
>>>>>>>>> -                     if (unlikely(endOnInput ? ip >= iend - RUN_MASK : 0)) {
>>>>>>>>> -                             /* overflow detection */
>>>>>>>>> +                     variable_length_error error = ok;
>>>>>>>>> +                     length +=
>>>>>>>>> +                         read_variable_length(&ip, iend - RUN_MASK,
>>>>>>>>> +                                              endOnInput, endOnInput,
>>>>>>>>> +                                              &error);
>>>>>>>>> +                     if (error == initial_error)
>>>>>>>>>                               goto _output_error;
>>>>>>>>> -                     }
>>>>>>>>> -                     do {
>>>>>>>>> -                             s = *ip++;
>>>>>>>>> -                             length += s;
>>>>>>>>> -                     } while (likely(endOnInput
>>>>>>>>> -                             ? ip < iend - RUN_MASK
>>>>>>>>> -                             : 1) & (s == 255));
>>>>>>>>>
>>>>>>>>>                       if ((safeDecode)
>>>>>>>>> -                         && unlikely((uptrval)(op) +
>>>>>>>>> -                                     length < (uptrval)(op))) {
>>>>>>>>> +                         && unlikely((uptrval) (op) +
>>>>>>>>> +                                     length < (uptrval) (op))) {
>>>>>>>>>                               /* overflow detection */
>>>>>>>>>                               goto _output_error;
>>>>>>>>>                       }
>>>>>>>>>                       if ((safeDecode)
>>>>>>>>> -                         && unlikely((uptrval)(ip) +
>>>>>>>>> -                                     length < (uptrval)(ip))) {
>>>>>>>>> +                         && unlikely((uptrval) (ip) +
>>>>>>>>> +                                     length < (uptrval) (ip))) {
>>>>>>>>>                               /* overflow detection */
>>>>>>>>>                               goto _output_error;
>>>>>>>>>                       }
>>>>>>>>> @@ -216,11 +520,15 @@ static FORCE_INLINE int LZ4_decompress_generic(
>>>>>>>>>
>>>>>>>>>               /* copy literals */
>>>>>>>>>               cpy = op + length;
>>>>>>>>> +#if LZ4_FAST_DEC_LOOP
>>>>>>>>> +           safe_literal_copy:
>>>>>>>>> +#endif
>>>>>>>>>               LZ4_STATIC_ASSERT(MFLIMIT >= WILDCOPYLENGTH);
>>>>>>>>>
>>>>>>>>>               if (((endOnInput) && ((cpy > oend - MFLIMIT)
>>>>>>>>> -                     || (ip + length > iend - (2 + 1 + LASTLITERALS))))
>>>>>>>>> -                     || ((!endOnInput) && (cpy > oend - WILDCOPYLENGTH))) {
>>>>>>>>> +                                   || (ip + length >
>>>>>>>>> +                                       iend - (2 + 1 + LASTLITERALS))))
>>>>>>>>> +                 || ((!endOnInput) && (cpy > oend - WILDCOPYLENGTH))) {
>>>>>>>>>                       if (partialDecoding) {
>>>>>>>>>                               if (cpy > oend) {
>>>>>>>>>                                       /*
>>>>>>>>> @@ -231,7 +539,7 @@ static FORCE_INLINE int LZ4_decompress_generic(
>>>>>>>>>                                       length = oend - op;
>>>>>>>>>                               }
>>>>>>>>>                               if ((endOnInput)
>>>>>>>>> -                                     && (ip + length > iend)) {
>>>>>>>>> +                                 && (ip + length > iend)) {
>>>>>>>>>                                       /*
>>>>>>>>>                                        * Error :
>>>>>>>>>                                        * read attempt beyond
>>>>>>>>> @@ -241,7 +549,7 @@ static FORCE_INLINE int LZ4_decompress_generic(
>>>>>>>>>                               }
>>>>>>>>>                       } else {
>>>>>>>>>                               if ((!endOnInput)
>>>>>>>>> -                                     && (cpy != oend)) {
>>>>>>>>> +                                 && (cpy != oend)) {
>>>>>>>>>                                       /*
>>>>>>>>>                                        * Error :
>>>>>>>>>                                        * block decoding must
>>>>>>>>> @@ -250,7 +558,7 @@ static FORCE_INLINE int LZ4_decompress_generic(
>>>>>>>>>                                       goto _output_error;
>>>>>>>>>                               }
>>>>>>>>>                               if ((endOnInput)
>>>>>>>>> -                                     && ((ip + length != iend)
>>>>>>>>> +                                 && ((ip + length != iend)
>>>>>>>>>                                       || (cpy > oend))) {
>>>>>>>>>                                       /*
>>>>>>>>>                                        * Error :
>>>>>>>>> @@ -288,29 +596,14 @@ static FORCE_INLINE int LZ4_decompress_generic(
>>>>>>>>>                       goto _output_error;
>>>>>>>>>               }
>>>>>>>>>
>>>>>>>>> -             /* costs ~1%; silence an msan warning when offset == 0 */
>>>>>>>>> -             /*
>>>>>>>>> -              * note : when partialDecoding, there is no guarantee that
>>>>>>>>> -              * at least 4 bytes remain available in output buffer
>>>>>>>>> -              */
>>>>>>>>> -             if (!partialDecoding) {
>>>>>>>>> -                     assert(oend > op);
>>>>>>>>> -                     assert(oend - op >= 4);
>>>>>>>>> -
>>>>>>>>> -                     LZ4_write32(op, (U32)offset);
>>>>>>>>> -             }
>>>>>>>>> -
>>>>>>>>>               if (length == ML_MASK) {
>>>>>>>>> -                     unsigned int s;
>>>>>>>>> -
>>>>>>>>> -                     do {
>>>>>>>>> -                             s = *ip++;
>>>>>>>>> -
>>>>>>>>> -                             if ((endOnInput) && (ip > iend - LASTLITERALS))
>>>>>>>>> -                                     goto _output_error;
>>>>>>>>>
>>>>>>>>> -                             length += s;
>>>>>>>>> -                     } while (s == 255);
>>>>>>>>> +                     variable_length_error error = ok;
>>>>>>>>> +                     length +=
>>>>>>>>> +                         read_variable_length(&ip, iend - LASTLITERALS + 1,
>>>>>>>>> +                                              endOnInput, 0, &error);
>>>>>>>>> +                     if (error != ok)
>>>>>>>>> +                             goto _output_error;
>>>>>>>>>
>>>>>>>>>                       if ((safeDecode)
>>>>>>>>>                               && unlikely(
>>>>>>>>> @@ -322,6 +615,10 @@ static FORCE_INLINE int LZ4_decompress_generic(
>>>>>>>>>
>>>>>>>>>               length += MINMATCH;
>>>>>>>>>
>>>>>>>>> +#if LZ4_FAST_DEC_LOOP
>>>>>>>>> +safe_match_copy:
>>>>>>>>> +#endif
>>>>>>>>> +
>>>>>>>>>               /* match starting within external dictionary */
>>>>>>>>>               if ((dict == usingExtDict) && (match < lowPrefix)) {
>>>>>>>>>                       if (unlikely(op + length > oend - LASTLITERALS)) {
>>>>>>>>>