netdev - Re: [PATCH 19/29] nft_set_pipapo: Introduce AVX2-based lookup implementation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CACRpkdbK0dZ87beU8qPSHmRMxTWog-8WbiDQvM-ec06_hAjkoQ@mail.gmail.com>
Date:   Thu, 19 Mar 2020 11:20:28 +0100
From:   Linus Walleij <linus.walleij@...aro.org>
To:     Pablo Neira Ayuso <pablo@...filter.org>
Cc:     netfilter-devel@...r.kernel.org,
        "David S. Miller" <davem@...emloft.net>,
        netdev <netdev@...r.kernel.org>,
        Ard Biesheuvel <ardb@...nel.org>, Arnd Bergmann <arnd@...db.de>
Subject: Re: [PATCH 19/29] nft_set_pipapo: Introduce AVX2-based lookup implementation

Hi Pablo,

First: I really like this type of optimizations. It's really cool to
see this hardware being put to good use. So for the record,
I'm impressed with your work here.

On Wed, Mar 18, 2020 at 1:40 AM Pablo Neira Ayuso <pablo@...filter.org> wrote:

> +ifdef CONFIG_X86_64
> +ifneq (,$(findstring -DCONFIG_AS_AVX2=1,$(KBUILD_CFLAGS)))
> +nf_tables-objs += nft_set_pipapo_avx2.o
> +endif
> +endif

So this is the first time I see some x86-specific asm optimizations
in the middle of nftables. That's pretty significant, so it should be
pointed out in the commit message I think.

I have a question around this:

> +#define NFT_PIPAPO_LONGS_PER_M256      (XSAVE_YMM_SIZE / BITS_PER_LONG)
> +
> +/* Load from memory into YMM register with non-temporal hint ("stream load"),
> + * that is, don't fetch lines from memory into the cache. This avoids pushing
> + * precious packet data out of the cache hierarchy, and is appropriate when:
> + *
> + * - loading buckets from lookup tables, as they are not going to be used
> + *   again before packets are entirely classified
> + *
> + * - loading the result bitmap from the previous field, as it's never used
> + *   again
> + */
> +#define NFT_PIPAPO_AVX2_LOAD(reg, loc)                                 \
> +       asm volatile("vmovntdqa %0, %%ymm" #reg : : "m" (loc))

(...)

> +/* Bitwise AND: the staple operation of this algorithm */
> +#define NFT_PIPAPO_AVX2_AND(dst, a, b)                                 \
> +       asm volatile("vpand %ymm" #a ", %ymm" #b ", %ymm" #dst)
> +
> +/* Jump to label if @reg is zero */
> +#define NFT_PIPAPO_AVX2_NOMATCH_GOTO(reg, label)                       \
> +       asm_volatile_goto("vptest %%ymm" #reg ", %%ymm" #reg ";"        \
> +                         "je %l[" #label "]" : : : : label)
> +
> +/* Store 256 bits from YMM register into memory. Contrary to bucket load
> + * operation, we don't bypass the cache here, as stored matching results
> + * are always used shortly after.
> + */
> +#define NFT_PIPAPO_AVX2_STORE(loc, reg)                                        \
> +       asm volatile("vmovdqa %%ymm" #reg ", %0" : "=m" (loc))
> +
> +/* Zero out a complete YMM register, @reg */
> +#define NFT_PIPAPO_AVX2_ZERO(reg)                                      \
> +       asm volatile("vpxor %ymm" #reg ", %ymm" #reg ", %ymm" #reg)

The usual practice for this kind of asm optimizations is to store it
in the arch.

See for example
arch/x86/include/asm/bitops.h
arch/arm64/include/asm/bitrev.h
which optimize a few bit operations with inline assembly.

The upside is that bitwise operations can be optimized per-arch
depending on available arch instructions.

If other archs have similar instructions to AVX2 which can
slot in and optimize the same code, it would make sense to
move the assembly to the arch and define some new
bitops for loading, storing, zero and bitwise AND, possibly even
if restricted to 256 bits bitmaps.

We have lib/bitmap.c I can see that this library contain
things such as:

int __bitmap_and(unsigned long *dst, const unsigned long *bitmap1,
                                const unsigned long *bitmap2, unsigned int bits)

Which intuitively seems like something that could use
these optimizations. It should be fine to augment the kernel
to handle arch-specific optimizations of bitmap operations
just like we do for setting bits or finding the first set bit
in a bitmap etc. Today only bitops.h contain arch optimizations
but if needed surely we can expand on that?

So I would like to see an explanation why we cannot take
an extra step and make this code something that is entire
abstract from x86 and will optimize any arch that can to
256 bit bitwise acceleration such as this.

Yours,
Linus Walleij