[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAMArcTXqC+OjO_kEhP_+N5y6N9ayyfi3AF-bE2kD98mRAySGcA@mail.gmail.com>
Date: Thu, 10 Apr 2025 16:35:38 +0900
From: Taehee Yoo <ap420073@...il.com>
To: Mina Almasry <almasrymina@...gle.com>
Cc: davem@...emloft.net, kuba@...nel.org, pabeni@...hat.com,
edumazet@...gle.com, andrew+netdev@...n.ch, horms@...nel.org,
michael.chan@...adcom.com, pavan.chebbi@...adcom.com, hawk@...nel.org,
ilias.apalodimas@...aro.org, netdev@...r.kernel.org, dw@...idwei.uk,
kuniyu@...zon.com, sdf@...ichev.me, ahmed.zaki@...el.com,
aleksander.lobakin@...el.com
Subject: Re: [PATCH net-next] eth: bnxt: add support rx side device memory TCP
On Thu, Apr 10, 2025 at 1:49 PM Mina Almasry <almasrymina@...gle.com> wrote:
>
Hi Mina,
Thanks a lot for your review!
> On Mon, Apr 7, 2025 at 9:36 PM Taehee Yoo <ap420073@...il.com> wrote:
> >
> > Currently, bnxt_en driver satisfies the requirements of the Device
> > memory TCP, which is HDS.
> > So, it implements rx-side Device memory TCP for bnxt_en driver.
> > It requires only converting the page API to netmem API.
> > `struct page` of agg rings are changed to `netmem_ref netmem` and
> > corresponding functions are changed to a variant of netmem API.
> >
> > It also passes PP_FLAG_ALLOW_UNREADABLE_NETMEM flag to a parameter of
> > page_pool.
> > The netmem will be activated only when a user requests devmem TCP.
> >
> > When netmem is activated, received data is unreadable and netmem is
> > disabled, received data is readable.
> > But drivers don't need to handle both cases because netmem core API will
> > handle it properly.
> > So, using proper netmem API is enough for drivers.
> >
> > Device memory TCP can be tested with
> > tools/testing/selftests/drivers/net/hw/ncdevmem.
> > This is tested with BCM57504-N425G and firmware version 232.0.155.8/pkg
> > 232.1.132.8.
> >
> > Signed-off-by: Taehee Yoo <ap420073@...il.com>
> > ---
> >
> > RFC -> PATCH v1:
> > - Drop ring buffer descriptor refactoring patch.
> > - Do not convert to netmem API for normal ring(non-agg ring).
> > - Remove changes of napi_{enable | disable}() to
> > napi_{enable | disable}_locked().
> > - Relocate a need_head_pool in struct bnxt_rx_ring_info due to
> > an alignment hole.
> > - Remove *offset parameter of __bnxt_alloc_rx_netmem().
> > *offset is always set to 0 in this function. it's unnecessary.
> > - Get skb_shared_info outside of loop in __bnxt_rx_agg_netmems().
> > - Drop Tested-by tag due to changes of this patch.
> >
> > drivers/net/ethernet/broadcom/bnxt/bnxt.c | 201 +++++++++++++---------
> > drivers/net/ethernet/broadcom/bnxt/bnxt.h | 3 +-
> > include/linux/netdevice.h | 1 +
> > include/net/page_pool/helpers.h | 6 +
> > net/core/dev.c | 6 +
> > 5 files changed, 137 insertions(+), 80 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> > index 28ee12186c37..eb36646d2f8b 100644
> > --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> > +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> > @@ -893,9 +893,9 @@ static void bnxt_tx_int(struct bnxt *bp, struct bnxt_napi *bnapi, int budget)
> > bnapi->events &= ~BNXT_TX_CMP_EVENT;
> > }
> >
> > -static bool bnxt_separate_head_pool(void)
> > +static bool bnxt_separate_head_pool(struct bnxt_rx_ring_info *rxr)
> > {
> > - return PAGE_SIZE > BNXT_RX_PAGE_SIZE;
> > + return rxr->need_head_pool || PAGE_SIZE > BNXT_RX_PAGE_SIZE;
> > }
> >
> > static struct page *__bnxt_alloc_rx_page(struct bnxt *bp, dma_addr_t *mapping,
> > @@ -919,6 +919,20 @@ static struct page *__bnxt_alloc_rx_page(struct bnxt *bp, dma_addr_t *mapping,
> > return page;
> > }
> >
> > +static netmem_ref __bnxt_alloc_rx_netmem(struct bnxt *bp, dma_addr_t *mapping,
> > + struct bnxt_rx_ring_info *rxr,
> > + gfp_t gfp)
> > +{
> > + netmem_ref netmem;
> > +
> > + netmem = page_pool_alloc_netmems(rxr->page_pool, gfp);
> > + if (!netmem)
> > + return 0;
> > +
> > + *mapping = page_pool_get_dma_addr_netmem(netmem);
> > + return netmem;
> > +}
> > +
> > static inline u8 *__bnxt_alloc_rx_frag(struct bnxt *bp, dma_addr_t *mapping,
> > struct bnxt_rx_ring_info *rxr,
> > gfp_t gfp)
> > @@ -999,21 +1013,20 @@ static inline u16 bnxt_find_next_agg_idx(struct bnxt_rx_ring_info *rxr, u16 idx)
> > return next;
> > }
> >
> > -static inline int bnxt_alloc_rx_page(struct bnxt *bp,
> > - struct bnxt_rx_ring_info *rxr,
> > - u16 prod, gfp_t gfp)
> > +static inline int bnxt_alloc_rx_netmem(struct bnxt *bp,
> > + struct bnxt_rx_ring_info *rxr,
> > + u16 prod, gfp_t gfp)
> > {
> > struct rx_bd *rxbd =
> > &rxr->rx_agg_desc_ring[RX_AGG_RING(bp, prod)][RX_IDX(prod)];
> > struct bnxt_sw_rx_agg_bd *rx_agg_buf;
> > - struct page *page;
> > - dma_addr_t mapping;
> > u16 sw_prod = rxr->rx_sw_agg_prod;
> > unsigned int offset = 0;
> > + dma_addr_t mapping;
> > + netmem_ref netmem;
> >
> > - page = __bnxt_alloc_rx_page(bp, &mapping, rxr, &offset, gfp);
> > -
> > - if (!page)
> > + netmem = __bnxt_alloc_rx_netmem(bp, &mapping, rxr, gfp);
> > + if (!netmem)
> > return -ENOMEM;
> >
> > if (unlikely(test_bit(sw_prod, rxr->rx_agg_bmap)))
> > @@ -1023,7 +1036,7 @@ static inline int bnxt_alloc_rx_page(struct bnxt *bp,
> > rx_agg_buf = &rxr->rx_agg_ring[sw_prod];
> > rxr->rx_sw_agg_prod = RING_RX_AGG(bp, NEXT_RX_AGG(sw_prod));
> >
> > - rx_agg_buf->page = page;
> > + rx_agg_buf->netmem = netmem;
> > rx_agg_buf->offset = offset;
> > rx_agg_buf->mapping = mapping;
> > rxbd->rx_bd_haddr = cpu_to_le64(mapping);
> > @@ -1067,11 +1080,11 @@ static void bnxt_reuse_rx_agg_bufs(struct bnxt_cp_ring_info *cpr, u16 idx,
> > p5_tpa = true;
> >
> > for (i = 0; i < agg_bufs; i++) {
> > - u16 cons;
> > - struct rx_agg_cmp *agg;
> > struct bnxt_sw_rx_agg_bd *cons_rx_buf, *prod_rx_buf;
> > + struct rx_agg_cmp *agg;
> > struct rx_bd *prod_bd;
> > - struct page *page;
> > + netmem_ref netmem;
> > + u16 cons;
> >
> > if (p5_tpa)
> > agg = bnxt_get_tpa_agg_p5(bp, rxr, idx, start + i);
> > @@ -1088,11 +1101,11 @@ static void bnxt_reuse_rx_agg_bufs(struct bnxt_cp_ring_info *cpr, u16 idx,
> > cons_rx_buf = &rxr->rx_agg_ring[cons];
> >
> > /* It is possible for sw_prod to be equal to cons, so
> > - * set cons_rx_buf->page to NULL first.
> > + * set cons_rx_buf->netmem to 0 first.
> > */
> > - page = cons_rx_buf->page;
> > - cons_rx_buf->page = NULL;
> > - prod_rx_buf->page = page;
> > + netmem = cons_rx_buf->netmem;
> > + cons_rx_buf->netmem = 0;
> > + prod_rx_buf->netmem = netmem;
> > prod_rx_buf->offset = cons_rx_buf->offset;
> >
> > prod_rx_buf->mapping = cons_rx_buf->mapping;
> > @@ -1218,29 +1231,36 @@ static struct sk_buff *bnxt_rx_skb(struct bnxt *bp,
> > return skb;
> > }
> >
> > -static u32 __bnxt_rx_agg_pages(struct bnxt *bp,
> > - struct bnxt_cp_ring_info *cpr,
> > - struct skb_shared_info *shinfo,
> > - u16 idx, u32 agg_bufs, bool tpa,
> > - struct xdp_buff *xdp)
> > +static u32 __bnxt_rx_agg_netmems(struct bnxt *bp,
> > + struct bnxt_cp_ring_info *cpr,
> > + u16 idx, u32 agg_bufs, bool tpa,
> > + struct sk_buff *skb,
> > + struct xdp_buff *xdp)
> > {
> > struct bnxt_napi *bnapi = cpr->bnapi;
> > - struct pci_dev *pdev = bp->pdev;
> > - struct bnxt_rx_ring_info *rxr = bnapi->rx_ring;
> > - u16 prod = rxr->rx_agg_prod;
> > + struct skb_shared_info *shinfo;
> > + struct bnxt_rx_ring_info *rxr;
> > u32 i, total_frag_len = 0;
> > bool p5_tpa = false;
> > + u16 prod;
> > +
> > + rxr = bnapi->rx_ring;
> > + prod = rxr->rx_agg_prod;
> >
> > if ((bp->flags & BNXT_FLAG_CHIP_P5_PLUS) && tpa)
> > p5_tpa = true;
> >
> > + if (skb)
> > + shinfo = skb_shinfo(skb);
> > + else
> > + shinfo = xdp_get_shared_info_from_buff(xdp);
> > +
> > for (i = 0; i < agg_bufs; i++) {
> > - skb_frag_t *frag = &shinfo->frags[i];
> > - u16 cons, frag_len;
> > - struct rx_agg_cmp *agg;
> > struct bnxt_sw_rx_agg_bd *cons_rx_buf;
> > - struct page *page;
> > + struct rx_agg_cmp *agg;
> > + u16 cons, frag_len;
> > dma_addr_t mapping;
> > + netmem_ref netmem;
> >
> > if (p5_tpa)
> > agg = bnxt_get_tpa_agg_p5(bp, rxr, idx, i);
> > @@ -1251,27 +1271,42 @@ static u32 __bnxt_rx_agg_pages(struct bnxt *bp,
> > RX_AGG_CMP_LEN) >> RX_AGG_CMP_LEN_SHIFT;
> >
> > cons_rx_buf = &rxr->rx_agg_ring[cons];
> > - skb_frag_fill_page_desc(frag, cons_rx_buf->page,
> > - cons_rx_buf->offset, frag_len);
> > - shinfo->nr_frags = i + 1;
> > + if (skb) {
> > + skb_add_rx_frag_netmem(skb, i, cons_rx_buf->netmem,
> > + cons_rx_buf->offset,
> > + frag_len, BNXT_RX_PAGE_SIZE);
> > + } else {
> > + skb_frag_t *frag = &shinfo->frags[i];
> > +
> > + skb_frag_fill_netmem_desc(frag, cons_rx_buf->netmem,
> > + cons_rx_buf->offset,
> > + frag_len);
> > + shinfo->nr_frags = i + 1;
> > + }
> > __clear_bit(cons, rxr->rx_agg_bmap);
> >
> > - /* It is possible for bnxt_alloc_rx_page() to allocate
> > + /* It is possible for bnxt_alloc_rx_netmem() to allocate
> > * a sw_prod index that equals the cons index, so we
> > * need to clear the cons entry now.
> > */
> > mapping = cons_rx_buf->mapping;
> > - page = cons_rx_buf->page;
> > - cons_rx_buf->page = NULL;
> > + netmem = cons_rx_buf->netmem;
> > + cons_rx_buf->netmem = 0;
> >
> > - if (xdp && page_is_pfmemalloc(page))
> > + if (xdp && netmem_is_pfmemalloc(netmem))
> > xdp_buff_set_frag_pfmemalloc(xdp);
> >
> > - if (bnxt_alloc_rx_page(bp, rxr, prod, GFP_ATOMIC) != 0) {
> > + if (bnxt_alloc_rx_netmem(bp, rxr, prod, GFP_ATOMIC) != 0) {
> > + if (skb) {
> > + skb->len -= frag_len;
> > + skb->data_len -= frag_len;
> > + skb->truesize -= BNXT_RX_PAGE_SIZE;
> > + }
> > +
> > --shinfo->nr_frags;
> > - cons_rx_buf->page = page;
> > + cons_rx_buf->netmem = netmem;
> >
> > - /* Update prod since possibly some pages have been
> > + /* Update prod since possibly some netmems have been
> > * allocated already.
> > */
> > rxr->rx_agg_prod = prod;
> > @@ -1279,8 +1314,8 @@ static u32 __bnxt_rx_agg_pages(struct bnxt *bp,
> > return 0;
> > }
> >
> > - dma_sync_single_for_cpu(&pdev->dev, mapping, BNXT_RX_PAGE_SIZE,
> > - bp->rx_dir);
> > + page_pool_dma_sync_netmem_for_cpu(rxr->page_pool, netmem, 0,
> > + BNXT_RX_PAGE_SIZE);
> >
> > total_frag_len += frag_len;
> > prod = NEXT_RX_AGG(prod);
> > @@ -1289,32 +1324,28 @@ static u32 __bnxt_rx_agg_pages(struct bnxt *bp,
> > return total_frag_len;
> > }
> >
> > -static struct sk_buff *bnxt_rx_agg_pages_skb(struct bnxt *bp,
> > - struct bnxt_cp_ring_info *cpr,
> > - struct sk_buff *skb, u16 idx,
> > - u32 agg_bufs, bool tpa)
> > +static struct sk_buff *bnxt_rx_agg_netmems_skb(struct bnxt *bp,
> > + struct bnxt_cp_ring_info *cpr,
> > + struct sk_buff *skb, u16 idx,
> > + u32 agg_bufs, bool tpa)
> > {
> > - struct skb_shared_info *shinfo = skb_shinfo(skb);
> > u32 total_frag_len = 0;
> >
> > - total_frag_len = __bnxt_rx_agg_pages(bp, cpr, shinfo, idx,
> > - agg_bufs, tpa, NULL);
> > + total_frag_len = __bnxt_rx_agg_netmems(bp, cpr, idx, agg_bufs, tpa,
> > + skb, NULL);
> > if (!total_frag_len) {
> > skb_mark_for_recycle(skb);
> > dev_kfree_skb(skb);
> > return NULL;
> > }
> >
> > - skb->data_len += total_frag_len;
> > - skb->len += total_frag_len;
> > - skb->truesize += BNXT_RX_PAGE_SIZE * agg_bufs;
> > return skb;
> > }
> >
> > -static u32 bnxt_rx_agg_pages_xdp(struct bnxt *bp,
> > - struct bnxt_cp_ring_info *cpr,
> > - struct xdp_buff *xdp, u16 idx,
> > - u32 agg_bufs, bool tpa)
> > +static u32 bnxt_rx_agg_netmems_xdp(struct bnxt *bp,
> > + struct bnxt_cp_ring_info *cpr,
> > + struct xdp_buff *xdp, u16 idx,
> > + u32 agg_bufs, bool tpa)
> > {
> > struct skb_shared_info *shinfo = xdp_get_shared_info_from_buff(xdp);
> > u32 total_frag_len = 0;
> > @@ -1322,8 +1353,8 @@ static u32 bnxt_rx_agg_pages_xdp(struct bnxt *bp,
> > if (!xdp_buff_has_frags(xdp))
> > shinfo->nr_frags = 0;
> >
> > - total_frag_len = __bnxt_rx_agg_pages(bp, cpr, shinfo,
> > - idx, agg_bufs, tpa, xdp);
> > + total_frag_len = __bnxt_rx_agg_netmems(bp, cpr, idx, agg_bufs, tpa,
> > + NULL, xdp);
> > if (total_frag_len) {
> > xdp_buff_set_frags_flag(xdp);
> > shinfo->nr_frags = agg_bufs;
> > @@ -1895,7 +1926,8 @@ static inline struct sk_buff *bnxt_tpa_end(struct bnxt *bp,
> > }
> >
> > if (agg_bufs) {
> > - skb = bnxt_rx_agg_pages_skb(bp, cpr, skb, idx, agg_bufs, true);
> > + skb = bnxt_rx_agg_netmems_skb(bp, cpr, skb, idx, agg_bufs,
> > + true);
> > if (!skb) {
> > /* Page reuse already handled by bnxt_rx_pages(). */
> > cpr->sw_stats->rx.rx_oom_discards += 1;
> > @@ -2175,9 +2207,10 @@ static int bnxt_rx_pkt(struct bnxt *bp, struct bnxt_cp_ring_info *cpr,
> > if (bnxt_xdp_attached(bp, rxr)) {
> > bnxt_xdp_buff_init(bp, rxr, cons, data_ptr, len, &xdp);
> > if (agg_bufs) {
> > - u32 frag_len = bnxt_rx_agg_pages_xdp(bp, cpr, &xdp,
> > - cp_cons, agg_bufs,
> > - false);
> > + u32 frag_len = bnxt_rx_agg_netmems_xdp(bp, cpr, &xdp,
> > + cp_cons,
> > + agg_bufs,
> > + false);
> > if (!frag_len)
> > goto oom_next_rx;
> >
> > @@ -2229,7 +2262,8 @@ static int bnxt_rx_pkt(struct bnxt *bp, struct bnxt_cp_ring_info *cpr,
> >
> > if (agg_bufs) {
> > if (!xdp_active) {
> > - skb = bnxt_rx_agg_pages_skb(bp, cpr, skb, cp_cons, agg_bufs, false);
> > + skb = bnxt_rx_agg_netmems_skb(bp, cpr, skb, cp_cons,
> > + agg_bufs, false);
> > if (!skb)
> > goto oom_next_rx;
> > } else {
> > @@ -3445,15 +3479,15 @@ static void bnxt_free_one_rx_agg_ring(struct bnxt *bp, struct bnxt_rx_ring_info
> >
> > for (i = 0; i < max_idx; i++) {
> > struct bnxt_sw_rx_agg_bd *rx_agg_buf = &rxr->rx_agg_ring[i];
> > - struct page *page = rx_agg_buf->page;
> > + netmem_ref netmem = rx_agg_buf->netmem;
> >
> > - if (!page)
> > + if (!netmem)
> > continue;
> >
> > - rx_agg_buf->page = NULL;
> > + rx_agg_buf->netmem = 0;
> > __clear_bit(i, rxr->rx_agg_bmap);
> >
> > - page_pool_recycle_direct(rxr->page_pool, page);
> > + page_pool_recycle_direct_netmem(rxr->page_pool, netmem);
> > }
> > }
> >
> > @@ -3746,7 +3780,7 @@ static void bnxt_free_rx_rings(struct bnxt *bp)
> > xdp_rxq_info_unreg(&rxr->xdp_rxq);
> >
> > page_pool_destroy(rxr->page_pool);
> > - if (bnxt_separate_head_pool())
> > + if (bnxt_separate_head_pool(rxr))
> > page_pool_destroy(rxr->head_pool);
> > rxr->page_pool = rxr->head_pool = NULL;
> >
> > @@ -3777,15 +3811,20 @@ static int bnxt_alloc_rx_page_pool(struct bnxt *bp,
> > pp.dev = &bp->pdev->dev;
> > pp.dma_dir = bp->rx_dir;
> > pp.max_len = PAGE_SIZE;
> > - pp.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV;
> > + pp.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV |
> > + PP_FLAG_ALLOW_UNREADABLE_NETMEM;
>
> FWIW I had expected drivers to only set
> PP_FLAG_ALLOW_UNREADABLE_NETMEM if the driver is capable of handling
> unreadable netmem in this configuration, i.e. header split is turned
> on and headersplit threshold is 0, and I think we're planning to do
> that for GVE.
>
> I know that there is a core check on binding for this, but in my
> experience some of these settings may get reset on driver resets? And
> core could miss a check here and there. Checking here on page_pool
> create seems like a straightforward way to prevent some bugs. although
> it could be seen as a defensive check.
So, you mean that netmem is already set and then by some things like a
driver resetting, a configuration would be changed and then driver
re-enters initialization logic.
If so, the configuration requirement for devmem TCP would not be
satisfied. I'm not sure, but I think this scenario may be a bug.
The core should receive that signal from devices and handle it properly.
As you mentioned, mp_ops->init() would be a better place if this check
is required. This is called by page_pool_create().
>
> But this is fine too, no strong feelings here:
>
> Reviewed-by: Mina Almasry <almasrymina@...gle.com>
Thanks a lot!
Taehee Yoo
Powered by blists - more mailing lists