[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20210127160029.73f22659@carbon>
Date: Wed, 27 Jan 2021 16:00:29 +0100
From: Jesper Dangaard Brouer <brouer@...hat.com>
To: Maciej Fijalkowski <maciej.fijalkowski@...el.com>
Cc: Toke Høiland-Jørgensen <toke@...hat.com>,
John Fastabend <john.fastabend@...il.com>,
Hangbin Liu <liuhangbin@...il.com>, bpf@...r.kernel.org,
netdev@...r.kernel.org, Jiri Benc <jbenc@...hat.com>,
Eelco Chaudron <echaudro@...hat.com>, ast@...nel.org,
Daniel Borkmann <daniel@...earbox.net>,
Lorenzo Bianconi <lorenzo.bianconi@...hat.com>,
David Ahern <dsahern@...il.com>,
Andrii Nakryiko <andrii.nakryiko@...il.com>,
Alexei Starovoitov <alexei.starovoitov@...il.com>,
brouer@...hat.com
Subject: Re: [PATCHv17 bpf-next 1/6] bpf: run devmap xdp_prog on flush
instead of bulk enqueue
On Wed, 27 Jan 2021 13:20:50 +0100
Maciej Fijalkowski <maciej.fijalkowski@...el.com> wrote:
> On Wed, Jan 27, 2021 at 10:41:44AM +0100, Toke Høiland-Jørgensen wrote:
> > John Fastabend <john.fastabend@...il.com> writes:
> >
> > > Hangbin Liu wrote:
> > >> From: Jesper Dangaard Brouer <brouer@...hat.com>
> > >>
> > >> This changes the devmap XDP program support to run the program when the
> > >> bulk queue is flushed instead of before the frame is enqueued. This has
> > >> a couple of benefits:
> > >>
> > >> - It "sorts" the packets by destination devmap entry, and then runs the
> > >> same BPF program on all the packets in sequence. This ensures that we
> > >> keep the XDP program and destination device properties hot in I-cache.
> > >>
> > >> - It makes the multicast implementation simpler because it can just
> > >> enqueue packets using bq_enqueue() without having to deal with the
> > >> devmap program at all.
> > >>
> > >> The drawback is that if the devmap program drops the packet, the enqueue
> > >> step is redundant. However, arguably this is mostly visible in a
> > >> micro-benchmark, and with more mixed traffic the I-cache benefit should
> > >> win out. The performance impact of just this patch is as follows:
> > >>
> > >> The bq_xmit_all's logic is also refactored and error label is removed.
> > >> When bq_xmit_all() is called from bq_enqueue(), another packet will
> > >> always be enqueued immediately after, so clearing dev_rx, xdp_prog and
> > >> flush_node in bq_xmit_all() is redundant. Let's move the clear to
> > >> __dev_flush(), and only check them once in bq_enqueue() since they are
> > >> all modified together.
> > >>
> > >> By using xdp_redirect_map in sample/bpf and send pkts via pktgen cmd:
> > >> ./pktgen_sample03_burst_single_flow.sh -i eno1 -d $dst_ip -m $dst_mac -t 10 -s 64
> > >>
> > >> There are about +/- 0.1M deviation for native testing, the performance
> > >> improved for the base-case, but some drop back with xdp devmap prog attached.
> > >>
> > >> Version | Test | Generic | Native | Native + 2nd xdp_prog
> > >> 5.10 rc6 | xdp_redirect_map i40e->i40e | 2.0M | 9.1M | 8.0M
> > >> 5.10 rc6 | xdp_redirect_map i40e->veth | 1.7M | 11.0M | 9.7M
> > >> 5.10 rc6 + patch | xdp_redirect_map i40e->i40e | 2.0M | 9.5M | 7.5M
> > >> 5.10 rc6 + patch | xdp_redirect_map i40e->veth | 1.7M | 11.6M | 9.1M
> > >>
> > >
> > > [...]
> > >
> > >> +static int dev_map_bpf_prog_run(struct bpf_prog *xdp_prog,
> > >> + struct xdp_frame **frames, int n,
> > >> + struct net_device *dev)
> > >> +{
> > >> + struct xdp_txq_info txq = { .dev = dev };
> > >> + struct xdp_buff xdp;
> > >> + int i, nframes = 0;
> > >> +
> > >> + for (i = 0; i < n; i++) {
> > >> + struct xdp_frame *xdpf = frames[i];
> > >> + u32 act;
> > >> + int err;
> > >> +
> > >> + xdp_convert_frame_to_buff(xdpf, &xdp);
> > >> + xdp.txq = &txq;
> > >> +
> > >> + act = bpf_prog_run_xdp(xdp_prog, &xdp);
> > >> + switch (act) {
> > >> + case XDP_PASS:
> > >> + err = xdp_update_frame_from_buff(&xdp, xdpf);
> > >> + if (unlikely(err < 0))
> > >> + xdp_return_frame_rx_napi(xdpf);
> > >> + else
> > >> + frames[nframes++] = xdpf;
> > >> + break;
> > >> + default:
> > >> + bpf_warn_invalid_xdp_action(act);
> > >> + fallthrough;
> > >> + case XDP_ABORTED:
> > >> + trace_xdp_exception(dev, xdp_prog, act);
> > >> + fallthrough;
> > >> + case XDP_DROP:
> > >> + xdp_return_frame_rx_napi(xdpf);
> > >> + break;
> > >> + }
> > >> + }
> > >> + return nframes; /* sent frames count */
> > >> +}
> > >> +
> > >> static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
> > >> {
> > >> struct net_device *dev = bq->dev;
> > >> - int sent = 0, drops = 0, err = 0;
> > >> + unsigned int cnt = bq->count;
> > >> + int drops = 0, err = 0;
> > >> + int to_send = cnt;
> > >> + int sent = cnt;
> > >> int i;
> > >>
> > >> - if (unlikely(!bq->count))
> > >> + if (unlikely(!cnt))
> > >> return;
> > >>
> > >> - for (i = 0; i < bq->count; i++) {
> > >> + for (i = 0; i < cnt; i++) {
> > >> struct xdp_frame *xdpf = bq->q[i];
> > >>
> > >> prefetch(xdpf);
> > >> }
> > >>
> > >> - sent = dev->netdev_ops->ndo_xdp_xmit(dev, bq->count, bq->q, flags);
> > >> + if (bq->xdp_prog) {
> > >> + to_send = dev_map_bpf_prog_run(bq->xdp_prog, bq->q, cnt, dev);
> > >> + if (!to_send) {
> > >> + sent = 0;
> > >> + goto out;
> > >> + }
> > >> + drops = cnt - to_send;
> > >> + }
> > >
> > > I might be missing something about how *bq works here. What happens when
> > > dev_map_bpf_prog_run returns to_send < cnt?
> > >
> > > So I read this as it will send [0, to_send] and [to_send, cnt] will be
> > > dropped? How do we know the bpf prog would have dropped the set,
> > > [to_send+1, cnt]?
>
> You know that via recalculation of 'drops' value after you returned from
> dev_map_bpf_prog_run() which later on is provided onto trace_xdp_devmap_xmit.
>
> >
> > Because dev_map_bpf_prog_run() compacts the array:
> >
> > + case XDP_PASS:
> > + err = xdp_update_frame_from_buff(&xdp, xdpf);
> > + if (unlikely(err < 0))
> > + xdp_return_frame_rx_napi(xdpf);
> > + else
> > + frames[nframes++] = xdpf;
> > + break;
>
> To expand this a little, 'frames' array is reused and 'nframes' above is
> the value that is returned and we store it onto 'to_send' variable.
>
> >
> > [...]
> >
> > >> int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
> > >> @@ -489,12 +516,7 @@ int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
> > >> {
> > >> struct net_device *dev = dst->dev;
> > >>
> > >> - if (dst->xdp_prog) {
> > >> - xdp = dev_map_run_prog(dev, xdp, dst->xdp_prog);
> > >> - if (!xdp)
> > >> - return 0;
> > >
> > > So here it looks like dev_map_run_prog will not drop extra
> > > packets, but will see every single packet.
> > >
> > > Are we changing the semantics subtle here? This looks like
> > > a problem to me. We should not drop packets in the new case
> > > unless bpf program tells us to.
> >
> > It's not a change in semantics (see above), but I'll grant you that it's
> > subtle :)
>
> dev map xdp prog still sees all of the frames.
>
> Maybe you were referring to a fact that for XDP_PASS action you might fail
> with xdp->xdpf conversion?
>
> I'm wondering if we could actually do a further optimization and avoid
> xdpf/xdp juggling.
>
> What if xdp_dev_bulk_queue would be storing the xdp_buffs instead of
> xdp_frames ?
Not possible. Remember that struct xdp_buff is "allocated" on the call
stack. Thus, you cannot store a pointer to the xdp_buffs in
xdp_dev_bulk_queue.
The xdp_frame also avoids allocation, via using memory placed in top of
data-frame. Thus, you can store a pointer to the xdp_frame, as it is
actually backed by real memory.
See[1] slide-11 ("Fundamental structs")
> Then you hit bq_xmit_all and if prog is present it doesn't have to do that
> dance like we have right now. After that you walk through xdp_buff array
> and do the conversion so that xdp_frame array will be passed do
> ndo_xdp_xmit.
If you want to performance optimize this, I suggest that we detect if
we need to call xdp_update_frame_from_buff(&xdp, xdpf) after the 2nd
XDP-prog ran. In many case the BPF-prog don't move head/tail/metadata,
so that call becomes unnecessary.
> I had a bad sleep so maybe I'm talking nonsense over here, will take
> another look in the evening though :)
:)
[1] https://people.netfilter.org/hawk/presentations/KernelRecipes2019/xdp-netstack-concert.pdf
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
Powered by blists - more mailing lists