netdev - Re: [net-next, PATCH 1/2, v3] net: socionext: different approach on DMA

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20181001175802.5a8c0b64@redhat.com>
Date:   Mon, 1 Oct 2018 17:58:02 +0200
From:   Jesper Dangaard Brouer <brouer@...hat.com>
To:     Ilias Apalodimas <ilias.apalodimas@...aro.org>
Cc:     netdev@...r.kernel.org, jaswinder.singh@...aro.org,
        ard.biesheuvel@...aro.org, masami.hiramatsu@...aro.org,
        arnd@...db.de, bjorn.topel@...el.com, magnus.karlsson@...el.com,
        daniel@...earbox.net, ast@...nel.org,
        jesus.sanchez-palencia@...el.com, vinicius.gomes@...el.com,
        makita.toshiaki@....ntt.co.jp, Tariq Toukan <tariqt@...lanox.com>,
        Tariq Toukan <ttoukan.linux@...il.com>, brouer@...hat.com
Subject: Re: [net-next, PATCH 1/2, v3] net: socionext: different approach on
 DMA

On Mon, 1 Oct 2018 17:37:06 +0300
Ilias Apalodimas <ilias.apalodimas@...aro.org> wrote:

> On Mon, Oct 01, 2018 at 03:48:45PM +0200, Jesper Dangaard Brouer wrote:
> > On Mon, 1 Oct 2018 14:20:21 +0300
> > Ilias Apalodimas <ilias.apalodimas@...aro.org> wrote:
> >   
> > > On Mon, Oct 01, 2018 at 01:03:13PM +0200, Jesper Dangaard Brouer wrote:  
> > > > On Mon, 1 Oct 2018 12:56:58 +0300
> > > > Ilias Apalodimas <ilias.apalodimas@...aro.org> wrote:
> > > >     
> > > > > > > #2: You have allocations on the XDP fast-path.
> > > > > > > 
> > > > > > > The REAL secret behind the XDP performance is to avoid allocations on
> > > > > > > the fast-path.  While I just told you to use the page-allocator and
> > > > > > > order-0 pages, this will actually kill performance.  Thus, to make this
> > > > > > > fast, you need a driver local recycle scheme that avoids going through
> > > > > > > the page allocator, which makes XDP_DROP and XDP_TX extremely fast.
> > > > > > > For the XDP_REDIRECT action (which you seems to be interested in, as
> > > > > > > this is needed for AF_XDP), there is a xdp_return_frame() API that can
> > > > > > > make this fast.      
> > > > > >
> > > > > > I had an initial implementation that did exactly that (that's why you the
> > > > > > dma_sync_single_for_cpu() -> dma_unmap_single_attrs() is there). In the case 
> > > > > > of AF_XDP isn't that introducing a 'bottleneck' though? I mean you'll feed fresh
> > > > > > buffers back to the hardware only when your packets have been processed from
> > > > > > your userspace application     
> > > > >
> > > > > Just a clarification here. This is the case if ZC is implemented. In my case
> > > > > the buffers will be 'ok' to be passed back to the hardware once the use
> > > > > userspace payload has been copied by xdp_do_redirect()    
> > > > 
> > > > Thanks for clarifying.  But no, this is not introducing a 'bottleneck'
> > > > for AF_XDP.
> > > > 
> > > > For (1) the copy-mode-AF_XDP the frame (as you noticed) is "freed" or
> > > > "returned" very quickly after it is copied.  The code is a bit hard to
> > > > follow, but in __xsk_rcv() it calls xdp_return_buff() after the memcpy.
> > > > Thus, the frame can be kept DMA mapped and reused in RX-ring quickly.  
> > >  
> > > Ok makes sense. I'll send a v4 with page re-usage, while using your
> > > API for page allocation  
> > 
> > Sound good, BUT do notice that using the bare page_pool, will/should
> > give you increased XDP performance, but might slow-down normal network
> > stack delivery, because netstack will not call xdp_return_frame() and
> > instead falls back to returning the pages through the page-allocator.
> > 
> > I'm very interested in knowing what performance increase you see with
> > XDP_DROP, with just a "bare" page_pool implementation.  
>
> When i was just syncing the page fragments instead of unmap -> alloc -> map i
> was getting ~340kpps (with XDP_REDIRECT). I ended up with 320kpps on this patch.

I explicitly asked for the XDP_DROP performance... because it will tell
me/us if your hardware is actually the bottleneck.  For your specific
hardware, you might be limited by the cost of DMA-sync.  It might be
faster to use the DMA-map/unmap calls(?).

I'm hinting you should take one step at a time, and measure.  Knowing
and identifying the limits, is essential. Else you are doing blind
optimizations. If you don't know the HW limit, then you don't know what
the gap is to optimum (and then you don't know when to stop optimizing).


> I did a couple of more changes though (like the dma mapping when allocating 
> the buffers) so i am not 100% sure what caused that difference
> I'll let you know once i finish up the code using the API for page allocation
>
> Regarding the change and the 'bottleneck' discussion we had. XDP_REDIRECT is
> straight forward (non ZC mode). I agree with you that since the payload is
> pretty much immediately copied before being flushed to the userspace, it's
> unlikely you'll end up delaying the hardware (starving without buffers).
> Do you think that's the same for XDP_TX? The DMA buffer will need to be synced 
> for the CPU, then you ring a doorbell with X packets. After that you'll have to
> wait for the Tx completion and resync the buffers to the device. So you actually
> make your Rx descriptors depending on your Tx completion (and keep in mind this
> NIC only has 1 queue per direction)

The page_pool will cache-pages (it have a pool of pages) that should
large enough to handle some pages are outstanding in the TX completion
queue.

>
> Now for the measurements part, i'll have to check with the vendor if the
> interface can do more than 340kpps and we are missing something
> performance-wise. 

Yes, please. The XDP_DROP test I requested above is exactly an attempt
to determine what the NIC HW limits are... esle you are working blindly.


> Have you done any tests with IOMMU enabled/disabled? In theory the dma recycling
> will shine against map/unmap when IOMMU is on (and the IOMMU stressed i.e have a
> different NIC doing a traffic test)

Nope, I could not get the IOMMU working on my testlab, last time I
tried to activate it.  Hench, why I have not implemented/optimized DMA
map/unmap stuff too much (e.g. mlx5 currently does a DMA unmap for
XDP_REDIRECT, which should be fixed).
 
 
> > The mlx5 driver does not see this netstack slowdown, because it
> > have a hybrid approach of maintaining a recycle ring for frames
> > going into netstack, by bumping the refcnt.  I think Tariq is
> > cleaning this up. The mlx5 code is hard to follow... in
> > mlx5e_xdp_handle()[1] the refcnt==1 and a bit is set. And in [2]
> > the refcnt is page_ref_inc(), and bit is caught in [3].  (This
> > really need to be cleaned up and generalized).  
>
> I've read most of the XDP related code on Intel/Mellanox
> before starting my patch series. I'll have a closer look now, thanks!
> > 
> > 
> > 
> > [1]
> > https://github.com/torvalds/linux/blob/master/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c#L83-L88
> > https://github.com/torvalds/linux/blob/v4.18/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c#L952-L959
> > 
> > [2]
> > https://github.com/torvalds/linux/blob/v4.18/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c#L1015-L1025
> > 
> > [3]
> > https://github.com/torvalds/linux/blob/v4.18/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c#L1094-L1098



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer