[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20181001094450.GA24329@apalos>
Date: Mon, 1 Oct 2018 12:44:50 +0300
From: Ilias Apalodimas <ilias.apalodimas@...aro.org>
To: Jesper Dangaard Brouer <brouer@...hat.com>
Cc: netdev@...r.kernel.org, jaswinder.singh@...aro.org,
ard.biesheuvel@...aro.org, masami.hiramatsu@...aro.org,
arnd@...db.de, bjorn.topel@...el.com, magnus.karlsson@...el.com,
daniel@...earbox.net, ast@...nel.org,
jesus.sanchez-palencia@...el.com, vinicius.gomes@...el.com,
makita.toshiaki@....ntt.co.jp, Tariq Toukan <tariqt@...lanox.com>,
Tariq Toukan <ttoukan.linux@...il.com>
Subject: Re: [net-next, PATCH 1/2, v3] net: socionext: different approach on
DMA
On Mon, Oct 01, 2018 at 11:26:31AM +0200, Jesper Dangaard Brouer wrote:
>
> On Sat, 29 Sep 2018 14:28:01 +0300 Ilias Apalodimas <ilias.apalodimas@...aro.org> wrote:
>
> > +static void *netsec_alloc_rx_data(struct netsec_priv *priv,
> > + dma_addr_t *dma_handle, u16 *desc_len)
> > +{
> > + size_t len = priv->ndev->mtu + ETH_HLEN + 2 * VLAN_HLEN + NET_SKB_PAD +
> > + NET_IP_ALIGN;
> > + dma_addr_t mapping;
> > + void *buf;
> > +
> > + len += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> > + len = SKB_DATA_ALIGN(len);
> > +
> > + buf = napi_alloc_frag(len);
>
> Using napi_alloc_frag here ^^^^
>
> > + if (!buf)
> > + return NULL;
> > +
> > + mapping = dma_map_single(priv->dev, buf, len, DMA_FROM_DEVICE);
> > + if (unlikely(dma_mapping_error(priv->dev, mapping)))
> > + goto err_out;
> > +
> > + *dma_handle = mapping;
> > + *desc_len = len;
> > +
> > + return buf;
> > +
> > +err_out:
> > + skb_free_frag(buf);
> > + return NULL;
> > +}
>
> Hmmm, you are using napi_alloc_frag() in above code, which behind
> your-back allocates order-3 pages (32 Kbytes memory in 8 order-0 pages).
>
> This violates at-least two XDP principals:
>
> #1: You are NOT using order-0 page based allocations for XDP.
>
> Notice, I'm not saying 1-page per packet, as ixgbe + i40e violated
> this, and it is now "per-practical-code-example" acceptable to split up
> the order-0 page, and store two RX frames per order-0 page (4096 bytes).
> (To make this fit you have to reduce XDP_HEADROOM to 192 bytes, which
> killed the idea of placing the SKB in this area).
Yes i saw the Intel implementation. I just thought it wasn't worth the hassle
for an 1gbit interface (but wasn't aware it violates and XDP principle).
I also noticed that Netronome(and others) are allocating 1 page per packet
when using XDP
>
> #2: You have allocations on the XDP fast-path.
>
> The REAL secret behind the XDP performance is to avoid allocations on
> the fast-path. While I just told you to use the page-allocator and
> order-0 pages, this will actually kill performance. Thus, to make this
> fast, you need a driver local recycle scheme that avoids going through
> the page allocator, which makes XDP_DROP and XDP_TX extremely fast.
> For the XDP_REDIRECT action (which you seems to be interested in, as
> this is needed for AF_XDP), there is a xdp_return_frame() API that can
> make this fast.
I had an initial implementation that did exactly that (that's why you the
dma_sync_single_for_cpu() -> dma_unmap_single_attrs() is there). In the case
of AF_XDP isn't that introducing a 'bottleneck' though? I mean you'll feed fresh
buffers back to the hardware only when your packets have been processed from
your userspace application
>
> To avoid every driver inventing their own driver local page-recycle
> cache (which many does today), we/I have created the page pool API.
> See include/net/page_pool.h, and look at how mlx5 driver uses it
> in v4.18 links[1][2][3]. Do notice, that mlx5 ALSO have a driver
> recycle scheme on top, which Tariq is working on removing or
> generalizing. AND also that mlx5 does not use the DMA mapping feature
> that page_pool also provide yet. (Contact me if you want to use
> page_pool for handing DMA mapping, we might need to export
> __page_pool_clean_page and call it before XDP_PASS action).
Ok i'll have a look on that and let you know. i
P.S : A few months back we reported that Chelsio is using a different
'memory scheme' for incoming packets. Essentially they just feed the hardware
with unstructred pages and it decides were to place
the packet. Maybe that's worth considering for the page pool API?
>
>
> [1] https://github.com/torvalds/linux/blob/v4.18/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c#L226
> [2] https://github.com/torvalds/linux/blob/v4.18/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c#L255
> [3] https://github.com/torvalds/linux/blob/v4.18/drivers/net/ethernet/mellanox/mlx5/core/en_main.c#L598-L618
Thanks
/Ilias
Powered by blists - more mailing lists