netdev - Re: [net-next, PATCH 1/2, v3] net: socionext: different approach on DMA

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20181001154845.4cd1d5dc@redhat.com>
Date:   Mon, 1 Oct 2018 15:48:45 +0200
From:   Jesper Dangaard Brouer <brouer@...hat.com>
To:     Ilias Apalodimas <ilias.apalodimas@...aro.org>
Cc:     netdev@...r.kernel.org, jaswinder.singh@...aro.org,
        ard.biesheuvel@...aro.org, masami.hiramatsu@...aro.org,
        arnd@...db.de, bjorn.topel@...el.com, magnus.karlsson@...el.com,
        daniel@...earbox.net, ast@...nel.org,
        jesus.sanchez-palencia@...el.com, vinicius.gomes@...el.com,
        makita.toshiaki@....ntt.co.jp, Tariq Toukan <tariqt@...lanox.com>,
        Tariq Toukan <ttoukan.linux@...il.com>, brouer@...hat.com
Subject: Re: [net-next, PATCH 1/2, v3] net: socionext: different approach on
 DMA

On Mon, 1 Oct 2018 14:20:21 +0300
Ilias Apalodimas <ilias.apalodimas@...aro.org> wrote:

> On Mon, Oct 01, 2018 at 01:03:13PM +0200, Jesper Dangaard Brouer wrote:
> > On Mon, 1 Oct 2018 12:56:58 +0300
> > Ilias Apalodimas <ilias.apalodimas@...aro.org> wrote:
> >   
> > > > > #2: You have allocations on the XDP fast-path.
> > > > > 
> > > > > The REAL secret behind the XDP performance is to avoid allocations on
> > > > > the fast-path.  While I just told you to use the page-allocator and
> > > > > order-0 pages, this will actually kill performance.  Thus, to make this
> > > > > fast, you need a driver local recycle scheme that avoids going through
> > > > > the page allocator, which makes XDP_DROP and XDP_TX extremely fast.
> > > > > For the XDP_REDIRECT action (which you seems to be interested in, as
> > > > > this is needed for AF_XDP), there is a xdp_return_frame() API that can
> > > > > make this fast.    
> > > >
> > > > I had an initial implementation that did exactly that (that's why you the
> > > > dma_sync_single_for_cpu() -> dma_unmap_single_attrs() is there). In the case 
> > > > of AF_XDP isn't that introducing a 'bottleneck' though? I mean you'll feed fresh
> > > > buffers back to the hardware only when your packets have been processed from
> > > > your userspace application   
> > >
> > > Just a clarification here. This is the case if ZC is implemented. In my case
> > > the buffers will be 'ok' to be passed back to the hardware once the use
> > > userspace payload has been copied by xdp_do_redirect()  
> > 
> > Thanks for clarifying.  But no, this is not introducing a 'bottleneck'
> > for AF_XDP.
> > 
> > For (1) the copy-mode-AF_XDP the frame (as you noticed) is "freed" or
> > "returned" very quickly after it is copied.  The code is a bit hard to
> > follow, but in __xsk_rcv() it calls xdp_return_buff() after the memcpy.
> > Thus, the frame can be kept DMA mapped and reused in RX-ring quickly.
>  
> Ok makes sense. I'll send a v4 with page re-usage, while using your
> API for page allocation

Sound good, BUT do notice that using the bare page_pool, will/should
give you increased XDP performance, but might slow-down normal network
stack delivery, because netstack will not call xdp_return_frame() and
instead falls back to returning the pages through the page-allocator.

I'm very interested in knowing what performance increase you see with
XDP_DROP, with just a "bare" page_pool implementation.

The mlx5 driver does not see this netstack slowdown, because it have a
hybrid approach of maintaining a recycle ring for frames going into
netstack, by bumping the refcnt.  I think Tariq is cleaning this up.
The mlx5 code is hard to follow... in mlx5e_xdp_handle()[1] the
refcnt==1 and a bit is set. And in [2] the refcnt is page_ref_inc(),
and bit is caught in [3].  (This really need to be cleaned up and
generalized).



[1] https://github.com/torvalds/linux/blob/master/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c#L83-L88
    https://github.com/torvalds/linux/blob/v4.18/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c#L952-L959

[2] https://github.com/torvalds/linux/blob/v4.18/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c#L1015-L1025

[3] https://github.com/torvalds/linux/blob/v4.18/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c#L1094-L1098

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer