[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMArcTVY+8rVtnYronP4Ud6T0S1eSgQX3N0TK_BFYjiBxDaSyA@mail.gmail.com>
Date: Sat, 2 Nov 2024 03:24:22 +0900
From: Taehee Yoo <ap420073@...il.com>
To: Mina Almasry <almasrymina@...gle.com>
Cc: davem@...emloft.net, kuba@...nel.org, pabeni@...hat.com,
edumazet@...gle.com, donald.hunter@...il.com, corbet@....net,
michael.chan@...adcom.com, andrew+netdev@...n.ch, hawk@...nel.org,
ilias.apalodimas@...aro.org, ast@...nel.org, daniel@...earbox.net,
john.fastabend@...il.com, dw@...idwei.uk, sdf@...ichev.me,
asml.silence@...il.com, brett.creeley@....com, linux-doc@...r.kernel.org,
netdev@...r.kernel.org, kory.maincent@...tlin.com,
maxime.chevallier@...tlin.com, danieller@...dia.com, hengqi@...ux.alibaba.com,
ecree.xilinx@...il.com, przemyslaw.kitszel@...el.com, hkallweit1@...il.com,
ahmed.zaki@...el.com, rrameshbabu@...dia.com, idosch@...dia.com,
jiri@...nulli.us, bigeasy@...utronix.de, lorenzo@...nel.org,
jdamato@...tly.com, aleksander.lobakin@...el.com, kaiyuanz@...gle.com,
willemb@...gle.com, daniel.zahka@...il.com
Subject: Re: [PATCH net-next v4 8/8] bnxt_en: add support for device memory tcp
On Fri, Nov 1, 2024 at 11:53 PM Mina Almasry <almasrymina@...gle.com> wrote:
>
> On Tue, Oct 22, 2024 at 9:25 AM Taehee Yoo <ap420073@...il.com> wrote:
> >
> > Currently, bnxt_en driver satisfies the requirements of Device memory
> > TCP, which is tcp-data-split.
> > So, it implements Device memory TCP for bnxt_en driver.
> >
> > From now on, the aggregation ring handles netmem_ref instead of page
> > regardless of the on/off of netmem.
> > So, for the aggregation ring, memory will be handled with the netmem
> > page_pool API instead of generic page_pool API.
> >
> > If Devmem is enabled, netmem_ref is used as-is and if Devmem is not
> > enabled, netmem_ref will be converted to page and that is used.
> >
> > Driver recognizes whether the devmem is set or unset based on the
> > mp_params.mp_priv is not NULL.
> > Only if devmem is set, it passes PP_FLAG_ALLOW_UNREADABLE_NETMEM.
>
> Looks like in the latest version, you pass
> PP_FLAG_ALLOW_UNREADABLE_NETMEM unconditionally, so this line is
> obsolete.
Okay, I will remove this line.
>
> However, I think you should only pass PP_FLAG_ALLOW_UNREADABLE_NETMEM
> if hds_thresh==0 and tcp-data-split==1, because otherwise the driver
> is not configured well enough to handle unreadable netmem, right? I
> know that we added checks in the devmem binding to detect hds_thresh
> and tcp-data-split, but we should keep another layer of protection in
> the driver. The driver should not set PP_FLAG_ALLOW_UNREADABLE_NETMEM
> unless it's configured to be able to handle unreadable netmem.
Okay, I agree, I will pass PP_FLAG_ALLOW_UNREADABLE_NETMEM
only when hds_thresh==0 and tcp-data-split==1.
>
> >
> > Tested-by: Stanislav Fomichev <sdf@...ichev.me>
> > Signed-off-by: Taehee Yoo <ap420073@...il.com>
> > ---
> >
> > v4:
> > - Do not select NET_DEVMEM in Kconfig.
> > - Pass PP_FLAG_ALLOW_UNREADABLE_NETMEM flag unconditionally.
> > - Add __bnxt_rx_agg_pages_xdp().
> > - Use gfp flag in __bnxt_alloc_rx_netmem().
> > - Do not add *offset in the __bnxt_alloc_rx_netmem().
> > - Do not pass queue_idx to bnxt_alloc_rx_page_pool().
> > - Add Test tag from Stanislav.
> > - Add page_pool_recycle_direct_netmem() helper.
> >
> > v3:
> > - Patch added.
> >
> > drivers/net/ethernet/broadcom/bnxt/bnxt.c | 182 ++++++++++++++++------
> > drivers/net/ethernet/broadcom/bnxt/bnxt.h | 2 +-
> > include/net/page_pool/helpers.h | 6 +
> > 3 files changed, 142 insertions(+), 48 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> > index 7d9da483b867..7924b1da0413 100644
> > --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> > +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> > @@ -55,6 +55,7 @@
> > #include <net/page_pool/helpers.h>
> > #include <linux/align.h>
> > #include <net/netdev_queues.h>
> > +#include <net/netdev_rx_queue.h>
> >
> > #include "bnxt_hsi.h"
> > #include "bnxt.h"
> > @@ -863,6 +864,22 @@ static void bnxt_tx_int(struct bnxt *bp, struct bnxt_napi *bnapi, int budget)
> > bnapi->events &= ~BNXT_TX_CMP_EVENT;
> > }
> >
> > +static netmem_ref __bnxt_alloc_rx_netmem(struct bnxt *bp, dma_addr_t *mapping,
> > + struct bnxt_rx_ring_info *rxr,
> > + unsigned int *offset,
> > + gfp_t gfp)
> > +{
> > + netmem_ref netmem;
> > +
> > + netmem = page_pool_alloc_netmem(rxr->page_pool, gfp);
> > + if (!netmem)
> > + return 0;
> > + *offset = 0;
> > +
> > + *mapping = page_pool_get_dma_addr_netmem(netmem);
> > + return netmem;
> > +}
> > +
> > static struct page *__bnxt_alloc_rx_page(struct bnxt *bp, dma_addr_t *mapping,
> > struct bnxt_rx_ring_info *rxr,
> > unsigned int *offset,
> > @@ -972,21 +989,21 @@ static inline u16 bnxt_find_next_agg_idx(struct bnxt_rx_ring_info *rxr, u16 idx)
> > return next;
> > }
> >
> > -static inline int bnxt_alloc_rx_page(struct bnxt *bp,
> > - struct bnxt_rx_ring_info *rxr,
> > - u16 prod, gfp_t gfp)
> > +static inline int bnxt_alloc_rx_netmem(struct bnxt *bp,
> > + struct bnxt_rx_ring_info *rxr,
> > + u16 prod, gfp_t gfp)
> > {
> > struct rx_bd *rxbd =
> > &rxr->rx_agg_desc_ring[RX_AGG_RING(bp, prod)][RX_IDX(prod)];
> > struct bnxt_sw_rx_agg_bd *rx_agg_buf;
> > - struct page *page;
> > - dma_addr_t mapping;
> > u16 sw_prod = rxr->rx_sw_agg_prod;
> > unsigned int offset = 0;
> > + dma_addr_t mapping;
> > + netmem_ref netmem;
> >
> > - page = __bnxt_alloc_rx_page(bp, &mapping, rxr, &offset, gfp);
> > + netmem = __bnxt_alloc_rx_netmem(bp, &mapping, rxr, &offset, gfp);
> >
> > - if (!page)
> > + if (!netmem)
> > return -ENOMEM;
> >
> > if (unlikely(test_bit(sw_prod, rxr->rx_agg_bmap)))
> > @@ -996,7 +1013,7 @@ static inline int bnxt_alloc_rx_page(struct bnxt *bp,
> > rx_agg_buf = &rxr->rx_agg_ring[sw_prod];
> > rxr->rx_sw_agg_prod = RING_RX_AGG(bp, NEXT_RX_AGG(sw_prod));
> >
> > - rx_agg_buf->page = page;
> > + rx_agg_buf->netmem = netmem;
> > rx_agg_buf->offset = offset;
> > rx_agg_buf->mapping = mapping;
> > rxbd->rx_bd_haddr = cpu_to_le64(mapping);
> > @@ -1044,7 +1061,7 @@ static void bnxt_reuse_rx_agg_bufs(struct bnxt_cp_ring_info *cpr, u16 idx,
> > struct rx_agg_cmp *agg;
> > struct bnxt_sw_rx_agg_bd *cons_rx_buf, *prod_rx_buf;
> > struct rx_bd *prod_bd;
> > - struct page *page;
> > + netmem_ref netmem;
> >
> > if (p5_tpa)
> > agg = bnxt_get_tpa_agg_p5(bp, rxr, idx, start + i);
> > @@ -1061,11 +1078,11 @@ static void bnxt_reuse_rx_agg_bufs(struct bnxt_cp_ring_info *cpr, u16 idx,
> > cons_rx_buf = &rxr->rx_agg_ring[cons];
> >
> > /* It is possible for sw_prod to be equal to cons, so
> > - * set cons_rx_buf->page to NULL first.
> > + * set cons_rx_buf->netmem to 0 first.
> > */
> > - page = cons_rx_buf->page;
> > - cons_rx_buf->page = NULL;
> > - prod_rx_buf->page = page;
> > + netmem = cons_rx_buf->netmem;
> > + cons_rx_buf->netmem = 0;
> > + prod_rx_buf->netmem = netmem;
> > prod_rx_buf->offset = cons_rx_buf->offset;
> >
> > prod_rx_buf->mapping = cons_rx_buf->mapping;
> > @@ -1190,29 +1207,104 @@ static struct sk_buff *bnxt_rx_skb(struct bnxt *bp,
> > return skb;
> > }
> >
> > -static u32 __bnxt_rx_agg_pages(struct bnxt *bp,
> > - struct bnxt_cp_ring_info *cpr,
> > - struct skb_shared_info *shinfo,
> > - u16 idx, u32 agg_bufs, bool tpa,
> > - struct xdp_buff *xdp)
> > +static bool __bnxt_rx_agg_pages_skb(struct bnxt *bp,
> > + struct bnxt_cp_ring_info *cpr,
> > + struct sk_buff *skb,
> > + u16 idx, u32 agg_bufs, bool tpa)
> > {
>
> To be honest I could not immediately understand why
> __bnxt_rx_agg_pages needed to be split into __bnxt_rx_agg_pages_skb
> and __bnxt_rx_agg_pages_xdp.
>
> Fundamentally speaking we wanted the netmem transition to be as smooth
> and low-churn as possible for drivers. The only big changes in this
> patch are the split between skb and xdp. That points to a problem in
> the design of netmem maybe.
>
> For xdp, core makes sure that if xdp is enabled on the device, then
> the netmem is always pages (never unreadable). So I think netmem
> should be able to handle xdp as well as skb. Can you give more details
> on why the split?
In the v3 patch, there was an opinion that refactoring for separating
into skb path and xdp path in the future. So I changed.
As you feel, I think separating skb path and xdp path is not directly
related to the purpose of this patch.
I agree the separating them but no need to be included in this patchset.
I will revert it.
>
> > struct bnxt_napi *bnapi = cpr->bnapi;
> > struct pci_dev *pdev = bp->pdev;
> > - struct bnxt_rx_ring_info *rxr = bnapi->rx_ring;
> > - u16 prod = rxr->rx_agg_prod;
> > + struct bnxt_rx_ring_info *rxr;
> > u32 i, total_frag_len = 0;
> > bool p5_tpa = false;
> > + u16 prod;
> > +
> > + rxr = bnapi->rx_ring;
> > + prod = rxr->rx_agg_prod;
> >
> > if ((bp->flags & BNXT_FLAG_CHIP_P5_PLUS) && tpa)
> > p5_tpa = true;
> >
> > for (i = 0; i < agg_bufs; i++) {
> > - skb_frag_t *frag = &shinfo->frags[i];
> > - u16 cons, frag_len;
> > + struct bnxt_sw_rx_agg_bd *cons_rx_buf;
> > struct rx_agg_cmp *agg;
> > + u16 cons, frag_len;
> > + dma_addr_t mapping;
> > + netmem_ref netmem;
> > +
> > + if (p5_tpa)
> > + agg = bnxt_get_tpa_agg_p5(bp, rxr, idx, i);
> > + else
> > + agg = bnxt_get_agg(bp, cpr, idx, i);
> > + cons = agg->rx_agg_cmp_opaque;
> > + frag_len = (le32_to_cpu(agg->rx_agg_cmp_len_flags_type) &
> > + RX_AGG_CMP_LEN) >> RX_AGG_CMP_LEN_SHIFT;
> > +
> > + cons_rx_buf = &rxr->rx_agg_ring[cons];
> > + skb_add_rx_frag_netmem(skb, i, cons_rx_buf->netmem,
> > + cons_rx_buf->offset, frag_len,
> > + BNXT_RX_PAGE_SIZE);
> > + __clear_bit(cons, rxr->rx_agg_bmap);
> > +
> > + /* It is possible for bnxt_alloc_rx_netmem() to allocate
> > + * a sw_prod index that equals the cons index, so we
> > + * need to clear the cons entry now.
> > + */
> > + mapping = cons_rx_buf->mapping;
> > + netmem = cons_rx_buf->netmem;
> > + cons_rx_buf->netmem = 0;
> > +
> > + if (bnxt_alloc_rx_netmem(bp, rxr, prod, GFP_ATOMIC) != 0) {
> > + skb->len -= frag_len;
> > + skb->data_len -= frag_len;
> > + skb->truesize -= BNXT_RX_PAGE_SIZE;
> > + --skb_shinfo(skb)->nr_frags;
> > + cons_rx_buf->netmem = netmem;
> > +
> > + /* Update prod since possibly some pages have been
> > + * allocated already.
> > + */
> > + rxr->rx_agg_prod = prod;
> > + bnxt_reuse_rx_agg_bufs(cpr, idx, i, agg_bufs - i, tpa);
> > + return 0;
> > + }
> > +
> > + dma_sync_single_for_cpu(&pdev->dev, mapping, BNXT_RX_PAGE_SIZE,
> > + bp->rx_dir);
> > +
>
> You should probably use page_pool_dma_sync_for_cpu. I'm merging a
> change to make that function skip dma-syncing for net_iov:
>
> https://lore.kernel.org/netdev/20241029204541.1301203-5-almasrymina@google.com/
>
> Which is necessary following Jason Gunthorpe's guidance.
Okay, no problem.
I will wait to merge it then I will use that then send a v5 patch.
>
> > + total_frag_len += frag_len;
> > + prod = NEXT_RX_AGG(prod);
> > + }
> > + rxr->rx_agg_prod = prod;
> > + return total_frag_len;
> > +}
> > +
> > +static u32 __bnxt_rx_agg_pages_xdp(struct bnxt *bp,
> > + struct bnxt_cp_ring_info *cpr,
> > + struct skb_shared_info *shinfo,
> > + u16 idx, u32 agg_bufs, bool tpa,
> > + struct xdp_buff *xdp)
> > +{
> > + struct bnxt_napi *bnapi = cpr->bnapi;
> > + struct pci_dev *pdev = bp->pdev;
> > + struct bnxt_rx_ring_info *rxr;
> > + u32 i, total_frag_len = 0;
> > + bool p5_tpa = false;
> > + u16 prod;
> > +
> > + rxr = bnapi->rx_ring;
> > + prod = rxr->rx_agg_prod;
> > +
> > + if ((bp->flags & BNXT_FLAG_CHIP_P5_PLUS) && tpa)
> > + p5_tpa = true;
> > +
> > + for (i = 0; i < agg_bufs; i++) {
> > struct bnxt_sw_rx_agg_bd *cons_rx_buf;
> > - struct page *page;
> > + skb_frag_t *frag = &shinfo->frags[i];
> > + struct rx_agg_cmp *agg;
> > + u16 cons, frag_len;
> > dma_addr_t mapping;
> > + netmem_ref netmem;
> >
> > if (p5_tpa)
> > agg = bnxt_get_tpa_agg_p5(bp, rxr, idx, i);
> > @@ -1223,9 +1315,10 @@ static u32 __bnxt_rx_agg_pages(struct bnxt *bp,
> > RX_AGG_CMP_LEN) >> RX_AGG_CMP_LEN_SHIFT;
> >
> > cons_rx_buf = &rxr->rx_agg_ring[cons];
> > - skb_frag_fill_page_desc(frag, cons_rx_buf->page,
> > - cons_rx_buf->offset, frag_len);
> > + skb_frag_fill_netmem_desc(frag, cons_rx_buf->netmem,
> > + cons_rx_buf->offset, frag_len);
> > shinfo->nr_frags = i + 1;
> > +
> > __clear_bit(cons, rxr->rx_agg_bmap);
> >
> > /* It is possible for bnxt_alloc_rx_page() to allocate
> > @@ -1233,15 +1326,15 @@ static u32 __bnxt_rx_agg_pages(struct bnxt *bp,
> > * need to clear the cons entry now.
> > */
> > mapping = cons_rx_buf->mapping;
> > - page = cons_rx_buf->page;
> > - cons_rx_buf->page = NULL;
> > + netmem = cons_rx_buf->netmem;
> > + cons_rx_buf->netmem = 0;
> >
> > - if (xdp && page_is_pfmemalloc(page))
> > + if (netmem_is_pfmemalloc(netmem))
> > xdp_buff_set_frag_pfmemalloc(xdp);
> >
> > - if (bnxt_alloc_rx_page(bp, rxr, prod, GFP_ATOMIC) != 0) {
> > + if (bnxt_alloc_rx_netmem(bp, rxr, prod, GFP_ATOMIC) != 0) {
> > --shinfo->nr_frags;
> > - cons_rx_buf->page = page;
> > + cons_rx_buf->netmem = netmem;
> >
> > /* Update prod since possibly some pages have been
> > * allocated already.
> > @@ -1266,20 +1359,12 @@ static struct sk_buff *bnxt_rx_agg_pages_skb(struct bnxt *bp,
> > struct sk_buff *skb, u16 idx,
> > u32 agg_bufs, bool tpa)
> > {
> > - struct skb_shared_info *shinfo = skb_shinfo(skb);
> > - u32 total_frag_len = 0;
> > -
> > - total_frag_len = __bnxt_rx_agg_pages(bp, cpr, shinfo, idx,
> > - agg_bufs, tpa, NULL);
> > - if (!total_frag_len) {
> > + if (!__bnxt_rx_agg_pages_skb(bp, cpr, skb, idx, agg_bufs, tpa)) {
> > skb_mark_for_recycle(skb);
> > dev_kfree_skb(skb);
> > return NULL;
> > }
> >
> > - skb->data_len += total_frag_len;
> > - skb->len += total_frag_len;
> > - skb->truesize += BNXT_RX_PAGE_SIZE * agg_bufs;
> > return skb;
> > }
> >
> > @@ -1294,8 +1379,8 @@ static u32 bnxt_rx_agg_pages_xdp(struct bnxt *bp,
> > if (!xdp_buff_has_frags(xdp))
> > shinfo->nr_frags = 0;
> >
> > - total_frag_len = __bnxt_rx_agg_pages(bp, cpr, shinfo,
> > - idx, agg_bufs, tpa, xdp);
> > + total_frag_len = __bnxt_rx_agg_pages_xdp(bp, cpr, shinfo,
> > + idx, agg_bufs, tpa, xdp);
> > if (total_frag_len) {
> > xdp_buff_set_frags_flag(xdp);
> > shinfo->nr_frags = agg_bufs;
> > @@ -3341,15 +3426,15 @@ static void bnxt_free_one_rx_agg_ring(struct bnxt *bp, struct bnxt_rx_ring_info
> >
> > for (i = 0; i < max_idx; i++) {
> > struct bnxt_sw_rx_agg_bd *rx_agg_buf = &rxr->rx_agg_ring[i];
> > - struct page *page = rx_agg_buf->page;
> > + netmem_ref netmem = rx_agg_buf->netmem;
> >
> > - if (!page)
> > + if (!netmem)
> > continue;
> >
> > - rx_agg_buf->page = NULL;
> > + rx_agg_buf->netmem = 0;
> > __clear_bit(i, rxr->rx_agg_bmap);
> >
> > - page_pool_recycle_direct(rxr->page_pool, page);
> > + page_pool_recycle_direct_netmem(rxr->page_pool, netmem);
> > }
> > }
> >
> > @@ -3620,7 +3705,10 @@ static int bnxt_alloc_rx_page_pool(struct bnxt *bp,
> > pp.dev = &bp->pdev->dev;
> > pp.dma_dir = bp->rx_dir;
> > pp.max_len = PAGE_SIZE;
> > - pp.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV;
> > + pp.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV |
> > + PP_FLAG_ALLOW_UNREADABLE_NETMEM;
>
> PP_FLAG_ALLOW_UNREADABLE_NETMEM should only be set when the driver can
> handle unreadable netmem. I.e. when hds_thresh==0 and
> tcp-data-split==1.
Okay, I will add a condition for that.
Thanks a lot!
Taehee Yoo
>
>
> --
> Thanks,
> Mina
Powered by blists - more mailing lists