netdev - Re: [net-next PATCH 13/15] eth: fbnic: add basic Rx handling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0e5e3196-ca2f-b905-a6ba-7721e8586ed7@huawei.com>
Date: Fri, 12 Apr 2024 16:43:34 +0800
From: Yunsheng Lin <linyunsheng@...wei.com>
To: Alexander Duyck <alexander.duyck@...il.com>
CC: <netdev@...r.kernel.org>, Alexander Duyck <alexanderduyck@...com>,
	<kuba@...nel.org>, <davem@...emloft.net>, <pabeni@...hat.com>
Subject: Re: [net-next PATCH 13/15] eth: fbnic: add basic Rx handling

On 2024/4/10 23:03, Alexander Duyck wrote:
> On Wed, Apr 10, 2024 at 4:54 AM Yunsheng Lin <linyunsheng@...wei.com> wrote:
>>
>> On 2024/4/9 23:08, Alexander Duyck wrote:
>>> On Tue, Apr 9, 2024 at 4:47 AM Yunsheng Lin <linyunsheng@...wei.com> wrote:
>>>>
>>>> On 2024/4/4 4:09, Alexander Duyck wrote:
>>>>> From: Alexander Duyck <alexanderduyck@...com>
>>>
>>> [...]
>>>
>>>>> +     /* Unmap and free processed buffers */
>>>>> +     if (head0 >= 0)
>>>>> +             fbnic_clean_bdq(nv, budget, &qt->sub0, head0);
>>>>> +     fbnic_fill_bdq(nv, &qt->sub0);
>>>>> +
>>>>> +     if (head1 >= 0)
>>>>> +             fbnic_clean_bdq(nv, budget, &qt->sub1, head1);
>>>>> +     fbnic_fill_bdq(nv, &qt->sub1);
>>>>
>>>> I am not sure how complicated the rx handling will be for the advanced
>>>> feature. For the current code, for each entry/desc in both qt->sub0 and
>>>> qt->sub1 at least need one page, and the page seems to be only used once
>>>> no matter however small the page is used?
>>>>
>>>> I am assuming you want to do 'tightly optimized' operation for this by
>>>> calling page_pool_fragment_page(), but manipulating page->pp_ref_count
>>>> directly does not seems to add any value for the current code, but seem
>>>> to waste a lot of memory by not using the frag API, especially PAGE_SIZE
>>>>> 4K?
>>>
>>> On this hardware both the header and payload buffers are fragmentable.
>>> The hardware decides the partitioning and we just follow it. So for
>>> example it wouldn't be uncommon to have a jumbo frame split up such
>>> that the header is less than 128B plus SKB overhead while the actual
>>> data in the payload is just over 1400. So for us fragmenting the pages
>>> is a very likely case especially with smaller packets.
>>
>> I understand that is what you are trying to do, but the code above does
>> not seems to match the description, as the fbnic_clean_bdq() and
>> fbnic_fill_bdq() are called for qt->sub0 and qt->sub1, so the old pages
>> of qt->sub0 and qt->sub1 just cleaned are drained and refill each sub
>> with new pages, which does not seems to have any fragmenting?
> 
> That is because it is all taken care of by the completion queue. Take
> a look in fbnic_pkt_prepare. We are taking the buffer from the header
> descriptor and taking a slice out of it there via fbnic_page_pool_get.
> Basically we store the fragment count locally in the rx_buf and then
> subtract what is leftover when the device is done with it.

The above seems look a lot like the prepare/commit API in [1], the prepare
is done in fbnic_fill_bdq() and commit is done by fbnic_page_pool_get() in
fbnic_pkt_prepare() and fbnic_add_rx_frag().

If page_pool is able to provide a central place for pagecnt_bias of all the
fragmemts of the same page, we may provide a similar prepare/commit API for
frag API, I am not sure how to handle it for now.

>From the below macro, this hw seems to be only able to handle 4K memory for
each entry/desc in qt->sub0 and qt->sub1, so there seems to be a lot of memory
that is unused for PAGE_SIZE > 4K as it is allocating memory based on page
granularity for each rx_buf in qt->sub0 and qt->sub1.

+#define FBNIC_RCD_AL_BUFF_OFF_MASK		DESC_GENMASK(43, 32)

It is still possible to reserve enough pagecnt_bias for each fragment, so that
the caller can still do its own fragmenting on fragment granularity as we
seems to have enough pagecnt_bias for each page.

If we provide a proper frag API to reserve enough pagecnt_bias for caller to
do its own fragmenting, then the memory waste may be avoided for this hw in
system with PAGE_SIZE > 4K.

1. https://lore.kernel.org/lkml/20240407130850.19625-10-linyunsheng@huawei.com/

> 
>> The fragmenting can only happen when there is continuous small packet
>> coming from wire so that hw can report the same pg_id for different
>> packet with pg_offset before fbnic_clean_bdq() and fbnic_fill_bdq()
>> is called? I am not sure how to ensure that considering that we might
>> break out of while loop in fbnic_clean_rcq() because of 'packets < budget'
>> checking.
> 
> We don't free the page until we have moved one past it, or the
> hardware has indicated it will take no more slices via a PAGE_FIN bit
> in the descriptor.


I look more closely at it, I am not able to figure it out how it is done
yet, as the PAGE_FIN bit mentioned above seems to be only used to calculate
the hdr_pg_end and truesize in fbnic_pkt_prepare() and fbnic_add_rx_frag().

For the below flow in fbnic_clean_rcq(), fbnic_clean_bdq() will be called
to drain the page in rx_buf just cleaned when head0/head1 >= 0, so I am not
sure how it do the fragmenting yet, am I missing something obvious here?

	while (likely(packets < budget)) {
		switch (FIELD_GET(FBNIC_RCD_TYPE_MASK, rcd)) {
		case FBNIC_RCD_TYPE_HDR_AL:
			head0 = FIELD_GET(FBNIC_RCD_AL_BUFF_ID_MASK, rcd);
			fbnic_pkt_prepare(nv, rcd, pkt, qt);

			break;
		case FBNIC_RCD_TYPE_PAY_AL:
			head1 = FIELD_GET(FBNIC_RCD_AL_BUFF_ID_MASK, rcd);
			fbnic_add_rx_frag(nv, rcd, pkt, qt);

			break;

		case FBNIC_RCD_TYPE_META:
			if (likely(!fbnic_rcd_metadata_err(rcd)))
				skb = fbnic_build_skb(nv, pkt);

			/* populate skb and invalidate XDP */
			if (!IS_ERR_OR_NULL(skb)) {
				fbnic_populate_skb_fields(nv, rcd, skb, qt);

				packets++;

				napi_gro_receive(&nv->napi, skb);
			}

			pkt->buff.data_hard_start = NULL;

			break;
		}

	/* Unmap and free processed buffers */
	if (head0 >= 0)
		fbnic_clean_bdq(nv, budget, &qt->sub0, head0);
	fbnic_fill_bdq(nv, &qt->sub0);

	if (head1 >= 0)
		fbnic_clean_bdq(nv, budget, &qt->sub1, head1);
	fbnic_fill_bdq(nv, &qt->sub1);
	
	}

> 
>>> It is better for us to optimize for the small packet scenario than
>>> optimize for the case where 4K slices are getting taken. That way when
>>> we are CPU constrained handling small packets we are the most
>>> optimized whereas for the larger frames we can spare a few cycles to
>>> account for the extra overhead. The result should be a higher overall
>>> packets per second.
>>
>> The problem is that small packet means low utilization of the bandwidth
>> as more bandwidth is used to send header instead of payload that is useful
>> for the user, so the question seems to be how often the small packet is
>> seen in the wire?
> 
> Very often. Especially when you are running something like servers
> where the flow usually consists of an incoming request which is often
> only a few hundred bytes, followed by us sending a response which then
> leads to a flow of control frames for it.

I think this is depending on the use case, if it is video streaming server,
I guess most of the packet is mtu-sized?

> .
>