netdev - Re: [net-next PATCH 13/15] eth: fbnic: add basic Rx handling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKgT0UeRWsJ+NiniSKa7Z3Law=QrYZp3giLAigJf7EvuAbjkRA@mail.gmail.com>
Date: Fri, 12 Apr 2024 08:05:07 -0700
From: Alexander Duyck <alexander.duyck@...il.com>
To: Yunsheng Lin <linyunsheng@...wei.com>
Cc: netdev@...r.kernel.org, Alexander Duyck <alexanderduyck@...com>, kuba@...nel.org, 
	davem@...emloft.net, pabeni@...hat.com
Subject: Re: [net-next PATCH 13/15] eth: fbnic: add basic Rx handling

On Fri, Apr 12, 2024 at 1:43 AM Yunsheng Lin <linyunsheng@...wei.com> wrote:
>
> On 2024/4/10 23:03, Alexander Duyck wrote:
> > On Wed, Apr 10, 2024 at 4:54 AM Yunsheng Lin <linyunsheng@...wei.com> wrote:
> >>
> >> On 2024/4/9 23:08, Alexander Duyck wrote:
> >>> On Tue, Apr 9, 2024 at 4:47 AM Yunsheng Lin <linyunsheng@...wei.com> wrote:
> >>>>
> >>>> On 2024/4/4 4:09, Alexander Duyck wrote:
> >>>>> From: Alexander Duyck <alexanderduyck@...com>
> >>>
> >>> [...]
> >>>
> >>>>> +     /* Unmap and free processed buffers */
> >>>>> +     if (head0 >= 0)
> >>>>> +             fbnic_clean_bdq(nv, budget, &qt->sub0, head0);
> >>>>> +     fbnic_fill_bdq(nv, &qt->sub0);
> >>>>> +
> >>>>> +     if (head1 >= 0)
> >>>>> +             fbnic_clean_bdq(nv, budget, &qt->sub1, head1);
> >>>>> +     fbnic_fill_bdq(nv, &qt->sub1);
> >>>>
> >>>> I am not sure how complicated the rx handling will be for the advanced
> >>>> feature. For the current code, for each entry/desc in both qt->sub0 and
> >>>> qt->sub1 at least need one page, and the page seems to be only used once
> >>>> no matter however small the page is used?
> >>>>
> >>>> I am assuming you want to do 'tightly optimized' operation for this by
> >>>> calling page_pool_fragment_page(), but manipulating page->pp_ref_count
> >>>> directly does not seems to add any value for the current code, but seem
> >>>> to waste a lot of memory by not using the frag API, especially PAGE_SIZE
> >>>>> 4K?
> >>>
> >>> On this hardware both the header and payload buffers are fragmentable.
> >>> The hardware decides the partitioning and we just follow it. So for
> >>> example it wouldn't be uncommon to have a jumbo frame split up such
> >>> that the header is less than 128B plus SKB overhead while the actual
> >>> data in the payload is just over 1400. So for us fragmenting the pages
> >>> is a very likely case especially with smaller packets.
> >>
> >> I understand that is what you are trying to do, but the code above does
> >> not seems to match the description, as the fbnic_clean_bdq() and
> >> fbnic_fill_bdq() are called for qt->sub0 and qt->sub1, so the old pages
> >> of qt->sub0 and qt->sub1 just cleaned are drained and refill each sub
> >> with new pages, which does not seems to have any fragmenting?
> >
> > That is because it is all taken care of by the completion queue. Take
> > a look in fbnic_pkt_prepare. We are taking the buffer from the header
> > descriptor and taking a slice out of it there via fbnic_page_pool_get.
> > Basically we store the fragment count locally in the rx_buf and then
> > subtract what is leftover when the device is done with it.
>
> The above seems look a lot like the prepare/commit API in [1], the prepare
> is done in fbnic_fill_bdq() and commit is done by fbnic_page_pool_get() in
> fbnic_pkt_prepare() and fbnic_add_rx_frag().
>
> If page_pool is able to provide a central place for pagecnt_bias of all the
> fragmemts of the same page, we may provide a similar prepare/commit API for
> frag API, I am not sure how to handle it for now.
>
> From the below macro, this hw seems to be only able to handle 4K memory for
> each entry/desc in qt->sub0 and qt->sub1, so there seems to be a lot of memory
> that is unused for PAGE_SIZE > 4K as it is allocating memory based on page
> granularity for each rx_buf in qt->sub0 and qt->sub1.
>
> +#define FBNIC_RCD_AL_BUFF_OFF_MASK             DESC_GENMASK(43, 32)

The advantage of being a purpose built driver is that we aren't
running on any architectures where the PAGE_SIZE > 4K. If it came to
that we could probably look at splitting the pages within the
descriptors by simply having a single page span multiple descriptors.

> It is still possible to reserve enough pagecnt_bias for each fragment, so that
> the caller can still do its own fragmenting on fragment granularity as we
> seems to have enough pagecnt_bias for each page.
>
> If we provide a proper frag API to reserve enough pagecnt_bias for caller to
> do its own fragmenting, then the memory waste may be avoided for this hw in
> system with PAGE_SIZE > 4K.
>
> 1. https://lore.kernel.org/lkml/20240407130850.19625-10-linyunsheng@huawei.com/

That isn't a concern for us as we are only using the device on x86
systems at this time.

> >
> >> The fragmenting can only happen when there is continuous small packet
> >> coming from wire so that hw can report the same pg_id for different
> >> packet with pg_offset before fbnic_clean_bdq() and fbnic_fill_bdq()
> >> is called? I am not sure how to ensure that considering that we might
> >> break out of while loop in fbnic_clean_rcq() because of 'packets < budget'
> >> checking.
> >
> > We don't free the page until we have moved one past it, or the
> > hardware has indicated it will take no more slices via a PAGE_FIN bit
> > in the descriptor.
>
>
> I look more closely at it, I am not able to figure it out how it is done
> yet, as the PAGE_FIN bit mentioned above seems to be only used to calculate
> the hdr_pg_end and truesize in fbnic_pkt_prepare() and fbnic_add_rx_frag().
>
> For the below flow in fbnic_clean_rcq(), fbnic_clean_bdq() will be called
> to drain the page in rx_buf just cleaned when head0/head1 >= 0, so I am not
> sure how it do the fragmenting yet, am I missing something obvious here?
>
>         while (likely(packets < budget)) {
>                 switch (FIELD_GET(FBNIC_RCD_TYPE_MASK, rcd)) {
>                 case FBNIC_RCD_TYPE_HDR_AL:
>                         head0 = FIELD_GET(FBNIC_RCD_AL_BUFF_ID_MASK, rcd);
>                         fbnic_pkt_prepare(nv, rcd, pkt, qt);
>
>                         break;
>                 case FBNIC_RCD_TYPE_PAY_AL:
>                         head1 = FIELD_GET(FBNIC_RCD_AL_BUFF_ID_MASK, rcd);
>                         fbnic_add_rx_frag(nv, rcd, pkt, qt);
>
>                         break;
>
>                 case FBNIC_RCD_TYPE_META:
>                         if (likely(!fbnic_rcd_metadata_err(rcd)))
>                                 skb = fbnic_build_skb(nv, pkt);
>
>                         /* populate skb and invalidate XDP */
>                         if (!IS_ERR_OR_NULL(skb)) {
>                                 fbnic_populate_skb_fields(nv, rcd, skb, qt);
>
>                                 packets++;
>
>                                 napi_gro_receive(&nv->napi, skb);
>                         }
>
>                         pkt->buff.data_hard_start = NULL;
>
>                         break;
>                 }
>
>         /* Unmap and free processed buffers */
>         if (head0 >= 0)
>                 fbnic_clean_bdq(nv, budget, &qt->sub0, head0);
>         fbnic_fill_bdq(nv, &qt->sub0);
>
>         if (head1 >= 0)
>                 fbnic_clean_bdq(nv, budget, &qt->sub1, head1);
>         fbnic_fill_bdq(nv, &qt->sub1);
>
>         }

The cleanup logic cleans everything up to but not including the
head0/head1 offsets. So the pages are left on the ring until they are
fully consumed.

> >
> >>> It is better for us to optimize for the small packet scenario than
> >>> optimize for the case where 4K slices are getting taken. That way when
> >>> we are CPU constrained handling small packets we are the most
> >>> optimized whereas for the larger frames we can spare a few cycles to
> >>> account for the extra overhead. The result should be a higher overall
> >>> packets per second.
> >>
> >> The problem is that small packet means low utilization of the bandwidth
> >> as more bandwidth is used to send header instead of payload that is useful
> >> for the user, so the question seems to be how often the small packet is
> >> seen in the wire?
> >
> > Very often. Especially when you are running something like servers
> > where the flow usually consists of an incoming request which is often
> > only a few hundred bytes, followed by us sending a response which then
> > leads to a flow of control frames for it.
>
> I think this is depending on the use case, if it is video streaming server,
> I guess most of the packet is mtu-sized?

For the transmit side, yes. For the server side no. A typical TCP flow
has two sides two it. One sending SYN/ACK/FIN requests and the initial
get request and the other basically sending the response data.