lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAJ+HfNhcD22CtH+xF3AUdtwpRW558riWzg3tVPfgEEhWz3gdcA@mail.gmail.com>
Date:   Tue, 23 Apr 2019 08:14:28 +0200
From:   Björn Töpel <bjorn.topel@...il.com>
To:     Jonathan Lemon <jonathan.lemon@...il.com>
Cc:     Netdev <netdev@...r.kernel.org>,
        "Karlsson, Magnus" <magnus.karlsson@...el.com>, kernel-team@...com,
        Jesper Dangaard Brouer <brouer@...hat.com>,
        maciej.fijalkowski@...el.com
Subject: Re: [PATCH RFC] xdp: Support zero-copy XDP_TX from AF_XDP sockets.

On Thu, 18 Apr 2019 at 19:55, Jonathan Lemon <jonathan.lemon@...il.com> wrote:
>
> On 18 Apr 2019, at 3:10, Björn Töpel wrote:
>
> > On Wed, 17 Apr 2019 at 21:58, Jonathan Lemon
> > <jonathan.lemon@...il.com> wrote:
> >>
> >> When the XDP program attached to a zero-copy AF_XDP socket returns
> >> XDP_TX,
> >> queue the umem frame on the XDP TX ring, and arrange for it to be
> >> released
> >> via the ZCA free routine, which should place it back onto the reuseq.
> >>
> >
> > There are a bunch of compiler errors, but I'll leave them out from the
> > comments!
>
> Yes, sigh - generated from the wrong tree. I did compile it, honest!
>
>
> >> +static int i40e_xmit_rcvd_zc(struct i40e_ring *rx_ring, struct
> >> xdp_buff *xdp)
> >> +{
> >> +       struct i40e_ring *xdp_ring;
> >> +       struct i40e_tx_desc *tx_desc;
> >> +       struct i40e_tx_buffer *tx_bi;
> >> +       struct xdp_frame *xdpf;
> >> +       dma_addr_t dma;
> >> +
> >> +       xdp_ring = rx_ring->vsi->xdp_rings[rx_ring->queue_index];
> >> +
> >> +       if (!unlikely(I40E_DESC_UNUSED(xdp_ring))) {
> >> +               xdp_ring->tx_stats.tx_busy++;
> >> +               return I40E_XDP_CONSUMED;
> >> +       }
> >> +       xpdf = convert_to_xdp_frame_keep_zc(xdp);
> >> +       if (unlikely(!xpdf)
> >> +               return I40E_XDP_CONSUMED;
> >> +       xpdf->handle = xdp->handle;
> >> +
> >> +       dma = xdp_umem_get_dma(rx_ring->xsk_umem, xdp->handle);
> >> +       tx_bi = &xdp_ring->tx_bi[xdp_ring->next_to_use];
> >> +       tx_bi->bytecount = xpdf->len;
> >> +       tx_bi->gso_segs = 1;
> >> +       tx_bi->xdpf = xdpf;
> >> +       tx_bi->tx_flags = I40E_TX_FLAGS_ZC_FRAME;
> >> +
> >> +       tx_desc = I40E_TX_DESC(xdp_ring, xdp_ring->next_to_use);
> >> +       tx_desc->buffer_addr = cpu_to_le64(dma);
> >> +       tx_desc->cmd_type_offset_bsz =
> >> build_ctob(I40E_TX_DESC_CMD_ICRC |
> >> +
> >> I40E_TX_DESC_CMD_EOP,
> >> +                                                 0, xpdf->len, 0);
> >> +       smp_wmb();
> >> +
> >> +       xdp_ring->next_to_use++;
> >> +       if (xdp_ring->next_to_use == xdp_ring->count)
> >> +               xdp_ring->next_to_use = 0;
> >> +
> >> +       tx_bi->next_to_watch = tx_desc;
> >> +
> >> +       return I40E_XDP_TX;
> >> +}
> >
> > What you're basically doing here is a AF_XDP Tx, but triggered from
> > the Rx path, and instead of completion (after the packet is sent) to
> > the completion ring, it's recycled to the Rx HW ring (via zca_free). I
> > like the idea but we need more plumbing first. Let me expand;
> > Unfortunately, the current recycle mechanism requires that at the
> > point of recycling, there has to be space in Rx ring. In the XDP_TX
> > case, there's no completion ring, and we cannot guarantee that there's
> > space on Rx ring (since Rx and Tx are asynchronous).
>
> Yes - this is why it's an RFC. Your current design keeps the RX reuse
> context in the driver itself, instead of using the reuse queue provided
> by umem. I believe the driver only uses it when the ring is torn down,
> so this needs to be addressed.
>

Yes, correct, it's *only* used not to leak descriptors on teardown.

> Since we're transmitting something that was just received, on TX
> completion
> the buffer should logically be placed back into the fill queue, but with
> the
> current SPSC mechanism, this isn't possible. Hence the ReuseQ. I think
> this
> should be transparently be made part of the FQ mechanism, so the RQ is
> always
> checked before using the FQ. I believe xsk_umem_peek_addr_rq() was added
> for this purpose, but should really be the main API.
>

...and here is why I'd like to use the pool instead of using the
(slower) _rq() functions instead. More below. Further, the current _rq
functions are also bounded in size, so we could end-up in the same
scenario here as well; The _rq queue is full *and* the Rx ring is
full, and hence leak. A more robust mechanism is needed (page pool!
:-)).

>
> > IOW, Rx recycling
> > can currently *only* be done in an Rx context.
>
> I was assuming that the Tx was part of the same VSI, so the Rx ring
> would
> be accessible from the napi_poll() routine.
>

Yeah, that's right. Context was not good wording. The problem is that
the napi_poll is broken up into the "Rx part" and the "Tx part", and
the current design doesn't guarantee that there's space for recycling
an Rx descriptor from the "Tx part" (even though they are executed in
the same napi).

>
> > What I would like to do, is moving i40e-xsk to Jesper's page-pool,
> > instead of the existing recycle mechanism. Then we could return the
> > descriptor to the pool, if the Rx ring doesn't have space for the
> > completed/sent buffer.
>
> Hm, not sure how this would help? The xdp_return() goes to the ZCA free
> routine for zero-copy buffers, bypassing the page pool, doesn't it?
>

When the page pool usage would be added, this would be changed to use
the page pool free mechanism instead, and leave per-core
caching/free'ing synchronization and such to the page pool. Again,
this is a bigger task to implement...

>
> > TL;DR version: Passing zc-frames in XDP_TX cannot be done properly
> > until the Rx recycle mechanism is more robust. :-(
>
> Agree. So i40e_zca_free() would look more like:
>
>    if (space in rx_bi)
>      /* locally cache buffer */
>      i40e_reuse_rx_buffer_zc(rx, buf);
>    else
>      xsk_umem_fq_reuse(rx->umem, buf)
>
> and the the two alloc_rx_buffers_*_zc() would need to be adjusted?

...and if the _rq queue is full... :-(



Björn

> --
> Jonathan

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ