[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALx6S368Hxc_VpL5=Rq5Ybpe-JzQr-3X9hUOoNU6yvY_1ao7PA@mail.gmail.com>
Date: Thu, 21 Jan 2016 08:38:38 -0800
From: Tom Herbert <tom@...bertland.com>
To: Jesper Dangaard Brouer <brouer@...hat.com>
Cc: Eric Dumazet <eric.dumazet@...il.com>,
Or Gerlitz <gerlitz.or@...il.com>,
David Miller <davem@...emloft.net>,
Eric Dumazet <edumazet@...gle.com>,
Linux Netdev List <netdev@...r.kernel.org>,
Alexander Duyck <alexander.duyck@...il.com>,
Alexei Starovoitov <alexei.starovoitov@...il.com>,
Daniel Borkmann <borkmann@...earbox.net>,
Marek Majkowski <marek@...udflare.com>,
Hannes Frederic Sowa <hannes@...essinduktion.org>,
Florian Westphal <fw@...len.de>,
Paolo Abeni <pabeni@...hat.com>,
John Fastabend <john.r.fastabend@...el.com>,
Amir Vadai <amirva@...il.com>
Subject: Re: Optimizing instruction-cache, more packets at each stage
On Thu, Jan 21, 2016 at 4:23 AM, Jesper Dangaard Brouer
<brouer@...hat.com> wrote:
> On Wed, 20 Jan 2016 15:27:38 -0800
> Tom Herbert <tom@...bertland.com> wrote:
>
>> weaknesses of Toeplitz we talked about recently and that fact that
>> Jenkins is really fast to compute, I am starting to think maybe we
>> should always do a software hash and not rely on HW for it...
>
> Please don't enforce a software hash. You are proposing a hash
> computation per packet which cost in the area 50-100 nanosec (?). And
> on data which is cache cold (even with DDIO, you take the L3 cache
> cost/hit).
>
I clock Jenkins hash computation itself at ~6nsecs (not taking cache
miss), but your point is taken.
> Consider the increase in network hardware speeds.
>
> Worst-case (pkt size 64 bytes) time between packets:
> * 10 Gbit/s -> 67.2 nanosec
> * 40 Gbit/s -> 16.8 nanosec
> * 100 Gbit/s -> 6.7 nanosec
>
> Adding such a per packet cost is not going to fly.
>
Sure, but the receive path is parallelized. Improving parallelism has
continuously shown to have much more impact than attempting to
optimize for cache misses. The primary goal is not to drive 100Gbps
with 64 packets from a single CPU. It is one benchmark of many we
should look at to measure efficiency of the data path, but I've yet to
see any real workload that requires that...
Regardless of anything, we need to load packet headers into CPU cache
to do protocol processing. I'm not sure I see how trying to defer that
as long as possible helps except in cases where the packet is crossing
CPU cache boundaries and can eliminate cache misses completely (not
just move them around from one function to another).
Tom
> --
> Best regards,
> Jesper Dangaard Brouer
> MSc.CS, Principal Kernel Engineer at Red Hat
> Author of http://www.iptv-analyzer.org
> LinkedIn: http://www.linkedin.com/in/brouer
Powered by blists - more mailing lists