[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHmME9oejs+USiGJX2P0TgvnR6XRzV1HYZPrq=c-CjGzq59=NQ@mail.gmail.com>
Date: Mon, 7 Nov 2016 19:08:22 +0100
From: "Jason A. Donenfeld" <Jason@...c4.com>
To: Eric Biggers <ebiggers@...gle.com>
Cc: David Miller <davem@...emloft.net>,
Herbert Xu <herbert@...dor.apana.org.au>,
linux-crypto@...r.kernel.org, LKML <linux-kernel@...r.kernel.org>,
Martin Willi <martin@...ongswan.org>,
WireGuard mailing list <wireguard@...ts.zx2c4.com>,
René van Dorst <opensource@...rst.com>
Subject: Re: [PATCH] poly1305: generic C can be faster on chips with slow
unaligned access
Hi Eric,
On Fri, Nov 4, 2016 at 6:37 PM, Eric Biggers <ebiggers@...gle.com> wrote:
> I agree, and the current code is wrong; but do note that this proposal is
> correct for poly1305_setrkey() but not for poly1305_setskey() and
> poly1305_blocks(). In the latter two cases, 4-byte alignment of the source
> buffer is *not* guaranteed. Although crypto_poly1305_update() will be called
> with a 4-byte aligned buffer due to the alignmask set on poly1305_alg, the
> algorithm operates on 16-byte blocks and therefore has to buffer partial blocks.
> If some number of bytes that is not 0 mod 4 is buffered, then the buffer will
> fall out of alignment on the next update call. Hence, get_unaligned_le32() is
> actually needed on all the loads, since the buffer will, in general, be of
> unknown alignment.
Hmm... The general data flow that strikes me as most pertinent is
something like:
struct sk_buff *skb = get_it_from_somewhere();
skb = skb_share_check(skb, GFP_ATOMIC);
num_frags = skb_cow_data(skb, ..., ...);
struct scatterlist sg[num_frags];
sg_init_table(sg, num_frags);
skb_to_sgvec(skb, sg, ..., ...);
blkcipher_walk_init(&walk, sg, sg, len);
blkcipher_walk_virt_block(&desc, &walk, BLOCK_SIZE);
while (walk.nbytes >= BLOCK_SIZE) {
size_t chunk_len = rounddown(walk.nbytes, BLOCK_SIZE);
poly1305_update(&poly1305_state, walk.src.virt.addr, chunk_len);
blkcipher_walk_done(&desc, &walk, walk.nbytes % BLOCK_SIZE);
}
if (walk.nbytes) {
poly1305_update(&poly1305_state, walk.src.virt.addr, walk.nbytes);
blkcipher_walk_done(&desc, &walk, 0);
}
Is your suggestion that that in the final if block, walk.src.virt.addr
might be unaligned? Like in the case of the last fragment being 67
bytes long?
If so, what a hassle. I hope the performance overhead isn't too
awful... I'll resubmit taking into account your suggestions.
By the way -- offlist benchmarks sent to me concluded that using the
unaligned load helpers like David suggested is just as fast as that
handrolled bit magic in the v1.
Regards,
Jason
Powered by blists - more mailing lists