[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180326134256.2cc7cf13@redhat.com>
Date: Mon, 26 Mar 2018 13:42:56 +0200
From: Jesper Dangaard Brouer <brouer@...hat.com>
To: Alexander Duyck <alexander.duyck@...il.com>
Cc: Netdev <netdev@...r.kernel.org>,
BjörnTöpel <bjorn.topel@...el.com>,
"Karlsson, Magnus" <magnus.karlsson@...el.com>,
Eugenia Emantayev <eugenia@...lanox.com>,
Jason Wang <jasowang@...hat.com>,
John Fastabend <john.fastabend@...il.com>,
Eran Ben Elisha <eranbe@...lanox.com>,
Saeed Mahameed <saeedm@...lanox.com>,
Gal Pressman <galp@...lanox.com>,
Daniel Borkmann <borkmann@...earbox.net>,
Alexei Starovoitov <alexei.starovoitov@...il.com>,
Tariq Toukan <tariqt@...lanox.com>, brouer@...hat.com
Subject: Re: [bpf-next V5 PATCH 14/15] xdp: transition into using xdp_frame
for return API
On Fri, 23 Mar 2018 10:29:29 -0700 Alexander Duyck <alexander.duyck@...il.com> wrote:
> On Fri, Mar 23, 2018 at 5:19 AM, Jesper Dangaard Brouer
> <brouer@...hat.com> wrote:
> > Changing API xdp_return_frame() to take struct xdp_frame as argument,
> > seems like a natural choice. But there are some subtle performance
> > details here that needs extra care, which is a deliberate choice.
> >
> > When de-referencing xdp_frame on a remote CPU during DMA-TX
> > completion, result in the cache-line is change to "Shared"
> > state. Later when the page is reused for RX, then this xdp_frame
> > cache-line is written, which change the state to "Modified".
> >
> > This situation already happens (naturally) for, virtio_net, tun and
> > cpumap as the xdp_frame pointer is the queued object. In tun and
> > cpumap, the ptr_ring is used for efficiently transferring cache-lines
> > (with pointers) between CPUs. Thus, the only option is to
> > de-referencing xdp_frame.
> >
> > It is only the ixgbe driver that had an optimization, in which it can
> > avoid doing the de-reference of xdp_frame. The driver already have
> > TX-ring queue, which (in case of remote DMA-TX completion) have to be
> > transferred between CPUs anyhow. In this data area, we stored a
> > struct xdp_mem_info and a data pointer, which allowed us to avoid
> > de-referencing xdp_frame.
> >
> > To compensate for this, a prefetchw is used for telling the cache
> > coherency protocol about our access pattern. My benchmarks show that
> > this prefetchw is enough to compensate the ixgbe driver.
> >
> > Signed-off-by: Jesper Dangaard Brouer <brouer@...hat.com>
>
> I'm really not a fan of this patch. It seems like it is adding a ton
> of overhead for one case that is going to penalize all of the other
> use cases for XDP. Just the extra prefetch is going to have a
> significant impact on things like the XDP_DROP case.
>
[...]
> > diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> > index ff069597fccf..e6e9b28ecfba 100644
> > --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> > +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> > @@ -1207,7 +1207,7 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector *q_vector,
> >
> > /* free the skb */
> > if (ring_is_xdp(tx_ring))
> > - xdp_return_frame(tx_buffer->data, &tx_buffer->xdp_mem);
> > + xdp_return_frame(tx_buffer->xdpf);
> > else
> > napi_consume_skb(tx_buffer->skb, napi_budget);
> >
> > @@ -2376,6 +2376,7 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
> > xdp.data_hard_start = xdp.data -
> > ixgbe_rx_offset(rx_ring);
> > xdp.data_end = xdp.data + size;
> > + prefetchw(xdp.data_hard_start); /* xdp_frame write */
> >
> > skb = ixgbe_run_xdp(adapter, rx_ring, &xdp);
> > }
>
> Do we really need to be prefetching yet another cache line? This is
> going to hurt general performance since for most cases you are now
> prefetching a cache line that will never be used.
My intuition were also that this would hurt performance. But I've done
many tests, and they all show no performance regression from this change!
XDP_DROP testing with xdp1 on ixgbe:
Baseline: 14,703,344 pkt/s
Patched: 14,711,574 pkt/s
For people reproducing, notice that it requires tuning on the generator
side (with pktgen) to reach these extreme speeds.
As you might have noticed, the prefetchw is only active when a XDP
program is loaded. Thus, I created a benchmark[1] that loads an XDP
program and always returns XDP_PASS (via --action)
Then, I drop packets in iptables raw table:
Baseline: 5,783,509 pps
Patched: 5,789,195 pps
Then unload netfilter modules and let packets reach UDP step
"UdpNoPorts" listening stage:
Baseline: 3,832,956 pps
Patched: 3,855,470 pps
Then, add a udp_sink (--recvmsg) to allow packets to travel deeper into
the network stack (and force udp_sink to run on another CPU):
Baseline: 2,278,855 pps
Patched: 2,270,373 pps
By measurements, it should be clear that this patchset does not add "a
ton of overhead" and does not "penalize all of the other use cases for
XDP".
For the XDP_REDIRECT use-case, I actually find the prefetchw() quite
beautiful, because xdp_buff (which is stack variable) is used by the
bpf_prog, and the prefetch have time to fetch the memory area that the
conversion of xdp_buff to xdp_frame goes into, and xdp_buff is known to
be hot.
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/xdp_bench01_mem_access_cost_kern.c
Powered by blists - more mailing lists