linux-kernel - Re: [net-next PATCH v4 7/7] net: ravb: Allocate RX buffers via page pool

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e7cf9dd8-9c67-476b-a892-b8dbe9312c4c@bp.renesas.com>
Date: Thu, 30 May 2024 10:21:16 +0100
From: Paul Barker <paul.barker.ct@...renesas.com>
To: Sergey Shtylyov <s.shtylyov@....ru>, "David S. Miller"
 <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>,
 Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
 Niklas Söderlund <niklas.soderlund+renesas@...natech.se>
Cc: Biju Das <biju.das.jz@...renesas.com>,
 Claudiu Beznea <claudiu.beznea.uj@...renesas.com>,
 Yoshihiro Shimoda <yoshihiro.shimoda.uh@...esas.com>,
 netdev@...r.kernel.org, linux-renesas-soc@...r.kernel.org,
 linux-kernel@...r.kernel.org
Subject: Re: [net-next PATCH v4 7/7] net: ravb: Allocate RX buffers via page
 pool

On 29/05/2024 21:52, Sergey Shtylyov wrote:
> On 5/28/24 6:03 PM, Paul Barker wrote:
> 
>> This patch makes multiple changes that can't be separated:
>>
>>   1) Allocate plain RX buffers via a page pool instead of allocating
>>      SKBs, then use build_skb() when a packet is received.
>>   2) For GbEth IP, reduce the RX buffer size to 2kB.
>>   3) For GbEth IP, merge packets which span more than one RX descriptor
>>      as SKB fragments instead of copying data.
>>
>> Implementing (1) without (2) would require the use of an order-1 page
>> pool (instead of an order-0 page pool split into page fragments) for
>> GbEth.
>>
>> Implementing (2) without (3) would leave us no space to re-assemble
>> packets which span more than one RX descriptor.
>>
>> Implementing (3) without (1) would not be possible as the network stack
>> expects to use put_page() or page_pool_put_page() to free SKB fragments
>> after an SKB is consumed.
>>
>> RX checksum offload support is adjusted to handle both linear and
>> nonlinear (fragmented) packets.
>>
>> This patch gives the following improvements during testing with iperf3.
>>
>>   * RZ/G2L:
>>     * TCP RX: same bandwidth at -43% CPU load (70% -> 40%)
>>     * UDP RX: same bandwidth at -17% CPU load (88% -> 74%)
>>
>>   * RZ/G2UL:
>>     * TCP RX: +30% bandwidth (726Mbps -> 941Mbps)
>>     * UDP RX: +417% bandwidth (108Mbps -> 558Mbps)
>>
>>   * RZ/G3S:
>>     * TCP RX: +64% bandwidth (562Mbps -> 920Mbps)
>>     * UDP RX: +420% bandwidth (90Mbps -> 468Mbps)
>>
>>   * RZ/Five:
>>     * TCP RX: +217% bandwidth (145Mbps -> 459Mbps)
>>     * UDP RX: +470% bandwidth (20Mbps -> 114Mbps)
>>
>> There is no significant impact on bandwidth or CPU load in testing on
>> RZ/G2H or R-Car M3N.
>>
>> Signed-off-by: Paul Barker <paul.barker.ct@...renesas.com>
>> ---
>> Changes v3->v4:
>>   * Used a separate page pool for each RX queue.
>>   * Passed struct ravb_rx_desc to ravb_alloc_rx_buffer() so that we can
>>     simplify the calling function.
>>   * Explained the calculation of rx_desc->ds_cc.
>>   * Added handling of nonlinear SKBs in ravb_rx_csum_gbeth().
>>
>>  drivers/net/ethernet/renesas/ravb.h      |  10 +-
>>  drivers/net/ethernet/renesas/ravb_main.c | 230 ++++++++++++++---------
>>  2 files changed, 146 insertions(+), 94 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/renesas/ravb.h b/drivers/net/ethernet/renesas/ravb.h
>> index 6a7aa7dd17e6..f2091a17fcf7 100644
>> --- a/drivers/net/ethernet/renesas/ravb.h
>> +++ b/drivers/net/ethernet/renesas/ravb.h
> [...]> @@ -1094,7 +1099,8 @@ struct ravb_private {
>>  	struct ravb_tx_desc *tx_ring[NUM_TX_QUEUE];
>>  	void *tx_align[NUM_TX_QUEUE];
>>  	struct sk_buff *rx_1st_skb;
>> -	struct sk_buff **rx_skb[NUM_RX_QUEUE];
>> +	struct page_pool *rx_pool[NUM_RX_QUEUE];
> 
>    Don't we need #include <net/page_pool/types.h>

Yes. I got away with it as ravb_main.c includes
<net/page_pool/helpers.h> before including "ravb.h", but the header
shouldn't assume that.

> 
> [...]
>> diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c
>> index dd92f074881a..bb7f7d44be6e 100644
>> --- a/drivers/net/ethernet/renesas/ravb_main.c
>> +++ b/drivers/net/ethernet/renesas/ravb_main.c
> [...]
>> @@ -317,35 +289,56 @@ static void ravb_ring_free(struct net_device *ndev, int q)
>>  	priv->tx_skb[q] = NULL;
>>  }
>>  
>> +static int
>> +ravb_alloc_rx_buffer(struct net_device *ndev, int q, u32 entry, gfp_t gfp_mask,
>> +		     struct ravb_rx_desc *rx_desc)
>> +{
>> +	struct ravb_private *priv = netdev_priv(ndev);
>> +	const struct ravb_hw_info *info = priv->info;
>> +	struct ravb_rx_buffer *rx_buff = &priv->rx_buffers[q][entry];
>> +	dma_addr_t dma_addr;
>> +	unsigned int size;
>> +
>> +	size = info->rx_buffer_size;
>> +	rx_buff->page = page_pool_alloc(priv->rx_pool[q], &rx_buff->offset, &size,
>> +					gfp_mask);
>> +	if (unlikely(!rx_buff->page)) {
>> +		/* We just set the data size to 0 for a failed mapping
>> +		 * which should prevent DMA from happening...
>> +		 */
>> +		rx_desc->ds_cc = cpu_to_le16(0);
>> +		return -ENOMEM;
>> +	}
>> +
>> +	dma_addr = page_pool_get_dma_addr(rx_buff->page) + rx_buff->offset;
>> +	dma_sync_single_for_device(ndev->dev.parent, dma_addr,
>> +				   info->rx_buffer_size, DMA_FROM_DEVICE);
> 
>    Do we really need this call?

Looking at .config I see CONFIG_DMA_NEED_SYNC=y so yes I think this is
needed.

> 
>> +	rx_desc->dptr = cpu_to_le32(dma_addr);
>> +
>> +	/* The end of the RX buffer is used to store skb shared data, so we need
>> +	 * to ensure that the hardware leaves enough space for this.
>> +	 */
>> +	rx_desc->ds_cc = cpu_to_le16(info->rx_buffer_size
>> +				     - SKB_DATA_ALIGN(sizeof(struct skb_shared_info))
> 
>    Please leave the - operator on the previous line...

Ack.

> 
>> +				     - ETH_FCS_LEN + sizeof(__sum16));
> 
>    Here as well...

Ack.

> 
>> +	return 0;
>> +}
>> +
>>  static u32
>>  ravb_rx_ring_refill(struct net_device *ndev, int q, u32 count, gfp_t gfp_mask)
>>  {
>>  	struct ravb_private *priv = netdev_priv(ndev);
>> -	const struct ravb_hw_info *info = priv->info;
>>  	struct ravb_rx_desc *rx_desc;
>> -	dma_addr_t dma_addr;
>>  	u32 i, entry;
>>  
>>  	for (i = 0; i < count; i++) {
>>  		entry = (priv->dirty_rx[q] + i) % priv->num_rx_ring[q];
>>  		rx_desc = ravb_rx_get_desc(priv, q, entry);
>> -		rx_desc->ds_cc = cpu_to_le16(info->rx_max_desc_use);
>>  
>> -		if (!priv->rx_skb[q][entry]) {
>> -			priv->rx_skb[q][entry] = ravb_alloc_skb(ndev, info, gfp_mask);
>> -			if (!priv->rx_skb[q][entry])
>> +		if (!priv->rx_buffers[q][entry].page) {
>> +			if (unlikely(ravb_alloc_rx_buffer(ndev, q, entry,
> 
>    Well, IIRC Greg KH is against using unlikely() unless you have actually
> instrumented the code and this gives an improvement... have you? :-)

My understanding was that we should use unlikely() for error checking in
hot code paths where we want the "good" path to be optimised. I can drop
this if I'm wrong though.

> 
> [...]
>> @@ -727,12 +739,22 @@ static void ravb_rx_csum_gbeth(struct sk_buff *skb)
>>  	if (unlikely(skb->len < sizeof(__sum16) * 2))
>>  		return;
>>  
>> -	hw_csum = skb_tail_pointer(skb) - sizeof(__sum16);
>> +	if (skb_is_nonlinear(skb)) {
>> +		last_frag = &shinfo->frags[shinfo->nr_frags - 1];
>> +		hw_csum = skb_frag_address(last_frag) + skb_frag_size(last_frag) - sizeof(__sum16);
>> +	} else {
>> +		hw_csum = skb_tail_pointer(skb) - sizeof(__sum16);
>> +	}
> 
>    We can do the subtraction only once here...

Ack. I'll pull that out of the if.

> 
> [...]
>> @@ -816,14 +824,26 @@ static int ravb_rx_gbeth(struct net_device *ndev, int budget, int q)
>>  			if (desc_status & MSC_CEEF)
>>  				stats->rx_missed_errors++;
>>  		} else {
>> +			struct ravb_rx_buffer *rx_buff = &priv->rx_buffers[q][entry];
>> +			void *rx_addr = page_address(rx_buff->page) + rx_buff->offset;
> 
>    Need an empty line here...

Ack.

> 
>>  			die_dt = desc->die_dt & 0xF0;
>> -			skb = ravb_get_skb_gbeth(ndev, entry, desc);
>> +			dma_sync_single_for_cpu(ndev->dev.parent, le32_to_cpu(desc->dptr),
>> +						desc_len, DMA_FROM_DEVICE);
>> +
>>  			switch (die_dt) {
>>  			case DT_FSINGLE:
>>  			case DT_FSTART:
>>  				/* Start of packet:
>> -				 * Set initial data length.
>> +				 * Prepare an SKB and add initial data.
> 
>    I'd prefer calling it skb in the comments...

Ack.

> 
> [...]
>> @@ -865,7 +894,16 @@ static int ravb_rx_gbeth(struct net_device *ndev, int budget, int q)
>>  				stats->rx_bytes += skb->len;
>>  				napi_gro_receive(&priv->napi[q], skb);
>>  				rx_packets++;
>> +
>> +				/* Clear rx_1st_skb so that it will only be
>> +				 * non-NULL when valid.
>> +				 */
>> +				if (die_dt == DT_FEND)
>> +					priv->rx_1st_skb = NULL;
> 
>    Hm, can't we do this under *case* DT_FEND above?

It makes more logical sense to me to do this as the last step, but I
guess it's a little more optimal to do it earlier. I'll move it.

Thanks,

-- 
Paul Barker
Download attachment "OpenPGP_0x27F4B3459F002257.asc" of type "application/pgp-keys" (3521 bytes)

Download attachment "OpenPGP_signature.asc" of type "application/pgp-signature" (237 bytes)