[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALx6S36_FWeWg5+ryGEpjVtMu37P=doNzmdz5np6dxzOVNBWWQ@mail.gmail.com>
Date: Thu, 28 Jan 2016 08:43:06 -0800
From: Tom Herbert <tom@...bertland.com>
To: Jesper Dangaard Brouer <brouer@...hat.com>
Cc: Alexei Starovoitov <alexei.starovoitov@...il.com>,
John Fastabend <john.fastabend@...il.com>,
"Michael S. Tsirkin" <mst@...hat.com>,
David Miller <davem@...emloft.net>,
Eric Dumazet <eric.dumazet@...il.com>,
Or Gerlitz <gerlitz.or@...il.com>,
Eric Dumazet <edumazet@...gle.com>,
Linux Kernel Network Developers <netdev@...r.kernel.org>,
Alexander Duyck <alexander.duyck@...il.com>,
Daniel Borkmann <borkmann@...earbox.net>,
Marek Majkowski <marek@...udflare.com>,
Hannes Frederic Sowa <hannes@...essinduktion.org>,
Florian Westphal <fw@...len.de>,
Paolo Abeni <pabeni@...hat.com>,
John Fastabend <john.r.fastabend@...el.com>,
Amir Vadai <amirva@...il.com>,
Daniel Borkmann <daniel@...earbox.net>,
Vladislav Yasevich <vyasevich@...il.com>
Subject: Re: Bypass at packet-page level (Was: Optimizing instruction-cache,
more packets at each stage)
On Thu, Jan 28, 2016 at 1:52 AM, Jesper Dangaard Brouer
<brouer@...hat.com> wrote:
> On Wed, 27 Jan 2016 13:56:03 -0800
> Alexei Starovoitov <alexei.starovoitov@...il.com> wrote:
>
>> On Wed, Jan 27, 2016 at 09:47:50PM +0100, Jesper Dangaard Brouer wrote:
>> > Sum: 18.75 % => calc: 30.0 ns (sum: 30.0 ns) => Total: 159.9 ns
>> >
>> > To get around the cache-miss in eth_type_trans(), I created a
>> > "icache-loop" in mlx5e_poll_rx_cq() and pull all RX-ring packets "out",
>> > before calling eth_type_trans(), reducing cost to 2.45%.
>> >
>> > To mitigate the SLUB slowpath, I used my slab + SKB-napi bulk API . And
>> > also tuned SLUB (with slub_nomerge slub_min_objects=128) to get bigger
>> > slab-pages, thus bigger bulk opportunities.
>> >
>> > This helped a lot, I can now drop 12Mpps (12,088,767 => 82.7 ns).
>>
>> great stuff. I think such batching loop will reduce the cost of
>> eth_type_trans() for all use cases.
>> Only unfortunate that it would need to be implemented in every driver,
>> but there is only a handful that people care about in high performance
>> setups, so I think it's worth getting this patch in for mlx5 and
>> the other drivers will catch up.
>
> I'm still in flux/undecided how long we should delay the first touching
> of pkt-data, which happens when calling eth_type_trans(). Should it
> stay in the driver or not(?).
>
> In the extreme case, for optimize for RPS sending to remote CPUs, delay
> calling eth_type_trans() as long as possible.
>
> 1. In driver only start prefetch data to L2/L3 cache
> 2. Stack calls get_rps_cpu() and assume skb_get_hash() have HW hash
> 3. (Bulk) enqueue on remote_cpu->sd->input_pkt_queue
> 4. On remote CPU in process_backlog call eth_type_trans() on sd->input_pkt_queue
>
There is also GRO to consider which still might be better to do before
packet steering? One thing that could be exploited is that we probably
don't need to look at packet data for GRO until we get a second packet
that matches the hash.
>
> On the other hand, if the HW desc can provide skb->proto, and we can
> lazy eval skb->pkt_type, then it is okay to keep that responsibility in
> the driver (as the call to eth_type_trans() basically disappears).
>
> --
> Best regards,
> Jesper Dangaard Brouer
> MSc.CS, Principal Kernel Engineer at Red Hat
> Author of http://www.iptv-analyzer.org
> LinkedIn: http://www.linkedin.com/in/brouer
Powered by blists - more mailing lists