linux-kernel - Re: [PATCH net-next v3 09/12] iavf: switch to Page Pool

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKgT0Ue7US2wwZXXU6HcGPBZWg+pSZ=PE_HWxJHgF8bmLymkfg@mail.gmail.com>
Date:   Fri, 2 Jun 2023 11:00:07 -0700
From:   Alexander Duyck <alexander.duyck@...il.com>
To:     Alexander Lobakin <aleksander.lobakin@...el.com>
Cc:     Jakub Kicinski <kuba@...nel.org>,
        "David S. Miller" <davem@...emloft.net>,
        Eric Dumazet <edumazet@...gle.com>,
        Paolo Abeni <pabeni@...hat.com>,
        Maciej Fijalkowski <maciej.fijalkowski@...el.com>,
        Magnus Karlsson <magnus.karlsson@...el.com>,
        Michal Kubiak <michal.kubiak@...el.com>,
        Larysa Zaremba <larysa.zaremba@...el.com>,
        Jesper Dangaard Brouer <hawk@...nel.org>,
        Ilias Apalodimas <ilias.apalodimas@...aro.org>,
        Christoph Hellwig <hch@....de>,
        Paul Menzel <pmenzel@...gen.mpg.de>, netdev@...r.kernel.org,
        intel-wired-lan@...ts.osuosl.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH net-next v3 09/12] iavf: switch to Page Pool

On Fri, Jun 2, 2023 at 9:31 AM Alexander Lobakin
<aleksander.lobakin@...el.com> wrote:
>
> From: Alexander H Duyck <alexander.duyck@...il.com>
> Date: Wed, 31 May 2023 09:19:06 -0700
>
> > On Tue, 2023-05-30 at 17:00 +0200, Alexander Lobakin wrote:
> >> Now that the IAVF driver simply uses dev_alloc_page() + free_page() with
> >> no custom recycling logics and one whole page per frame, it can easily
> >> be switched to using Page Pool API instead.
>
> [...]
>
> >> @@ -691,8 +690,6 @@ int iavf_setup_tx_descriptors(struct iavf_ring *tx_ring)
> >>   **/
> >>  void iavf_clean_rx_ring(struct iavf_ring *rx_ring)
> >>  {
> >> -    u16 i;
> >> -
> >>      /* ring already cleared, nothing to do */
> >>      if (!rx_ring->rx_pages)
> >>              return;
> >> @@ -703,28 +700,17 @@ void iavf_clean_rx_ring(struct iavf_ring *rx_ring)
> >>      }
> >>
> >>      /* Free all the Rx ring sk_buffs */
> >> -    for (i = 0; i < rx_ring->count; i++) {
> >> +    for (u32 i = 0; i < rx_ring->count; i++) {
> >
> > Did we make a change to our coding style to allow declaration of
> > variables inside of for statements? Just wondering if this is a change
> > since the recent updates to the ISO C standard, or if this doesn't
> > match up with what we would expect per the coding standard.
>
> It's optional right now, nobody would object declaring it either way.
> Doing it inside is allowed since we switched to C11, right.
> Here I did that because my heart was breaking to see this little u16
> alone (and yeah, u16 on the stack).

Yeah, that was back when I was declaring stack variables the exact
same size as the ring parameters. So u16 should match the size of
rx_ring->count not that it matters. It was just a quirk I had at the
time.

> >
> >>              struct page *page = rx_ring->rx_pages[i];
> >> -            dma_addr_t dma;
> >>
> >>              if (!page)
> >>                      continue;
> >>
> >> -            dma = page_pool_get_dma_addr(page);
> >> -
> >>              /* Invalidate cache lines that may have been written to by
> >>               * device so that we avoid corrupting memory.
> >>               */
> >> -            dma_sync_single_range_for_cpu(rx_ring->dev, dma,
> >> -                                          LIBIE_SKB_HEADROOM,
> >> -                                          LIBIE_RX_BUF_LEN,
> >> -                                          DMA_FROM_DEVICE);
> >> -
> >> -            /* free resources associated with mapping */
> >> -            dma_unmap_page_attrs(rx_ring->dev, dma, LIBIE_RX_TRUESIZE,
> >> -                                 DMA_FROM_DEVICE, IAVF_RX_DMA_ATTR);
> >> -
> >> -            __free_page(page);
> >> +            page_pool_dma_sync_full_for_cpu(rx_ring->pool, page);
> >> +            page_pool_put_full_page(rx_ring->pool, page, false);
> >>      }
> >>
> >>      rx_ring->next_to_clean = 0;
> >> @@ -739,10 +725,15 @@ void iavf_clean_rx_ring(struct iavf_ring *rx_ring)
> >>   **/
> >>  void iavf_free_rx_resources(struct iavf_ring *rx_ring)
> >>  {
> >> +    struct device *dev = rx_ring->pool->p.dev;
> >> +
> >>      iavf_clean_rx_ring(rx_ring);
> >>      kfree(rx_ring->rx_pages);
> >>      rx_ring->rx_pages = NULL;
> >>
> >> +    page_pool_destroy(rx_ring->pool);
> >> +    rx_ring->dev = dev;
> >> +
> >>      if (rx_ring->desc) {
> >>              dma_free_coherent(rx_ring->dev, rx_ring->size,
> >>                                rx_ring->desc, rx_ring->dma);
> >
> > Not a fan of this switching back and forth between being a page pool
> > pointer and a dev pointer. Seems problematic as it is easily
> > misinterpreted. I would say that at a minimum stick to either it is
> > page_pool(Rx) or dev(Tx) on a ring type basis.
>
> The problem is that page_pool has lifetime from ifup to ifdown, while
> its ring lives longer. So I had to do something with this, but also I
> didn't want to have 2 pointers at the same time since it's redundant and
> +8 bytes to the ring for nothing.

It might be better to just go with NULL rather than populating it w/
two different possible values. Then at least you know if it is an
rx_ring it is a page_pool and if it is a tx_ring it is dev. You can
reset to the page pool when you repopulate the rest of the ring.

> > This setup works for iavf, however for i40e/ice you may run into issues
> > since the setup_rx_descriptors call is also used to setup the ethtool
> > loopback test w/o a napi struct as I recall so there may not be a
> > q_vector.
>
> I'll handle that. Somehow :D Thanks for noticing, I'll take a look
> whether I should do something right now or it can be done later when
> switching the actual mentioned drivers.
>
> [...]
>
> >> @@ -240,7 +237,10 @@ struct iavf_rx_queue_stats {
> >>  struct iavf_ring {
> >>      struct iavf_ring *next;         /* pointer to next ring in q_vector */
> >>      void *desc;                     /* Descriptor ring memory */
> >> -    struct device *dev;             /* Used for DMA mapping */
> >> +    union {
> >> +            struct page_pool *pool; /* Used for Rx page management */
> >> +            struct device *dev;     /* Used for DMA mapping on Tx */
> >> +    };
> >>      struct net_device *netdev;      /* netdev ring maps to */
> >>      union {
> >>              struct iavf_tx_buffer *tx_bi;
> >
> > Would it make more sense to have the page pool in the q_vector rather
> > than the ring? Essentially the page pool is associated per napi
> > instance so it seems like it would make more sense to store it with the
> > napi struct rather than potentially have multiple instances per napi.
>
> As per Page Pool design, you should have it per ring. Plus you have
> rxq_info (XDP-related structure), which is also per-ring and
> participates in recycling in some cases. So I wouldn't complicate.
> I went down the chain and haven't found any place where having more than
> 1 PP per NAPI would break anything. If I got it correctly, Jakub's
> optimization discourages having 1 PP per several NAPIs (or scheduling
> one NAPI on different CPUs), but not the other way around. The goal was
> to exclude concurrent access to one PP from different threads, and here
> it's impossible.

The xdp_rxq can be mapped many:1 to the page pool if I am not mistaken.

The only reason why I am a fan of trying to keep the page_pool tightly
associated with the napi instance is because the napi instance is what
essentially is guaranteeing the page_pool is consistent as it is only
accessed by that one napi instance.

> Lemme know. I can always disable NAPI optimization for cases when one
> vector is shared by several queues -- and it's not a usual case for
> these NICs anyway -- but I haven't found a reason for that.

I suppose we should be fine if we have a many to one mapping though I
suppose. As you said the issue would be if multiple NAPI were
accessing the same page pool.