netdev - Re: [PATCH V3 net-next] net: fec: add XDP

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8fd0313b-8f6f-9814-247d-c2687d053e2a@kernel.org>
Date: Mon, 7 Aug 2023 15:15:22 +0200
From: Jesper Dangaard Brouer <hawk@...nel.org>
To: Wei Fang <wei.fang@....com>, Jesper Dangaard Brouer <hawk@...nel.org>,
 Jesper Dangaard Brouer <jbrouer@...hat.com>, Jakub Kicinski <kuba@...nel.org>
Cc: "davem@...emloft.net" <davem@...emloft.net>,
 "edumazet@...gle.com" <edumazet@...gle.com>,
 "pabeni@...hat.com" <pabeni@...hat.com>, Shenwei Wang
 <shenwei.wang@....com>, Clark Wang <xiaoning.wang@....com>,
 "ast@...nel.org" <ast@...nel.org>,
 "daniel@...earbox.net" <daniel@...earbox.net>,
 "john.fastabend@...il.com" <john.fastabend@...il.com>,
 "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
 dl-linux-imx <linux-imx@....com>,
 "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
 "bpf@...r.kernel.org" <bpf@...r.kernel.org>, Andrew Lunn <andrew@...n.ch>
Subject: Re: [PATCH V3 net-next] net: fec: add XDP_TX feature support



On 07/08/2023 12.30, Wei Fang wrote:
>>> The flow-control was not disabled before, so according to your
>>> suggestion, I disable the flow-control on the both boards and run the
>>> test again, the performance is slightly improved, but still can not
>>> see a clear difference between the two methods. Below are the results.
>>
>> Something else must be stalling the CPU.
>> When looking at fec_main.c code, I noticed that
>> fec_enet_txq_xmit_frame() will do a MMIO write for every xdp_frame (to
>> trigger transmit start), which I believe will stall the CPU.
>> The ndo_xdp_xmit/fec_enet_xdp_xmit does bulking, and should be the
>> function that does the MMIO write to trigger transmit start.
>>
> We'd better keep a MMIO write for every xdp_frame on txq, as you know,
> the txq will be inactive when no additional ready descriptors remain in the
> tx-BDR. So it may increase the delay of the packets if we do a MMIO write
> for multiple packets.
> 

You know this hardware better than me, so I will leave to you.

>> $ git diff
>> diff --git a/drivers/net/ethernet/freescale/fec_main.c
>> b/drivers/net/ethernet/freescale/fec_main.c
>> index 03ac7690b5c4..57a6a3899b80 100644
>> --- a/drivers/net/ethernet/freescale/fec_main.c
>> +++ b/drivers/net/ethernet/freescale/fec_main.c
>> @@ -3849,9 +3849,6 @@ static int fec_enet_txq_xmit_frame(struct
>> fec_enet_private *fep,
>>
>>           txq->bd.cur = bdp;
>>
>> -       /* Trigger transmission start */
>> -       writel(0, txq->bd.reg_desc_active);
>> -
>>           return 0;
>>    }
>>
>> @@ -3880,6 +3877,9 @@ static int fec_enet_xdp_xmit(struct net_device
>> *dev,
>>                   sent_frames++;
>>           }
>>
>> +       /* Trigger transmission start */
>> +       writel(0, txq->bd.reg_desc_active);
>> +
>>           __netif_tx_unlock(nq);
>>
>>           return sent_frames;
>>
>>
>>> Result: use "sync_dma_len" method
>>> root@...8mpevk:~# ./xdp2 eth0
>>
>> The xdp2 (and xdp1) program(s) have a performance issue (due to using
>>
>> Can I ask you to test using xdp_rxq_info, like:
>>
>>    sudo ./xdp_rxq_info --dev mlx5p1 --action XDP_TX
>>
> Yes, below are the results, the results are also basically the same.
> Result 1: current method
> ./xdp_rxq_info --dev eth0 --action XDP_TX
> Running XDP on dev:eth0 (ifindex:2) action:XDP_TX options:swapmac
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      0       259,102     0
> XDP-RX CPU      total   259,102
> RXQ stats       RXQ:CPU pps         issue-pps
> rx_queue_index    0:0   259,102     0
> rx_queue_index    0:sum 259,102
> Running XDP on dev:eth0 (ifindex:2) action:XDP_TX options:swapmac
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      0       259,498     0
> XDP-RX CPU      total   259,498
> RXQ stats       RXQ:CPU pps         issue-pps
> rx_queue_index    0:0   259,496     0
> rx_queue_index    0:sum 259,496
> Running XDP on dev:eth0 (ifindex:2) action:XDP_TX options:swapmac
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      0       259,408     0
> XDP-RX CPU      total   259,408
> 
> Result 2: dma_sync_len method
> Running XDP on dev:eth0 (ifindex:2) action:XDP_TX options:swapmac
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      0       258,254     0
> XDP-RX CPU      total   258,254
> RXQ stats       RXQ:CPU pps         issue-pps
> rx_queue_index    0:0   258,254     0
> rx_queue_index    0:sum 258,254
> Running XDP on dev:eth0 (ifindex:2) action:XDP_TX options:swapmac
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      0       259,316     0
> XDP-RX CPU      total   259,316
> RXQ stats       RXQ:CPU pps         issue-pps
> rx_queue_index    0:0   259,318     0
> rx_queue_index    0:sum 259,318
> Running XDP on dev:eth0 (ifindex:2) action:XDP_TX options:swapmac
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      0       259,554     0
> XDP-RX CPU      total   259,554
> RXQ stats       RXQ:CPU pps         issue-pps
> rx_queue_index    0:0   259,553     0
> rx_queue_index    0:sum 259,553
> 

Thanks for running this.

>>
>>> proto 17:     258886 pkt/s
>>> proto 17:     258879 pkt/s
>>
>> If you provide numbers for xdp_redirect, then we could better evaluate if
>> changing the lock per xdp_frame, for XDP_TX also, is worth it.
>>
> For XDP_REDIRECT, the performance show as follow.
> root@...8mpevk:~# ./xdp_redirect eth1 eth0
> Redirecting from eth1 (ifindex 3; driver st_gmac) to eth0 (ifindex 2; driver fec)

This is not exactly the same as XDP_TX setup as here you choose to 
redirect between eth1 (driver st_gmac) and to eth0 (driver fec).

I would like to see eth0 to eth0 XDP_REDIRECT, so we can compare to 
XDP_TX performance.
Sorry for all the requests, but can you provide those numbers?

> eth1->eth0        221,642 rx/s       0 err,drop/s      221,643 xmit/s

So, XDP_REDIRECT is approx (1-(221825/259554))*100 = 14.53% slower.
But as this is 'eth1->eth0' this isn't true comparison to XDP_TX.

> eth1->eth0        221,761 rx/s       0 err,drop/s      221,760 xmit/s
> eth1->eth0        221,793 rx/s       0 err,drop/s      221,794 xmit/s
> eth1->eth0        221,825 rx/s       0 err,drop/s      221,825 xmit/s
> eth1->eth0        221,823 rx/s       0 err,drop/s      221,821 xmit/s
> eth1->eth0        221,815 rx/s       0 err,drop/s      221,816 xmit/s
> eth1->eth0        222,016 rx/s       0 err,drop/s      222,016 xmit/s
> eth1->eth0        222,059 rx/s       0 err,drop/s      222,059 xmit/s
> eth1->eth0        222,085 rx/s       0 err,drop/s      222,089 xmit/s
> eth1->eth0        221,956 rx/s       0 err,drop/s      221,952 xmit/s
> eth1->eth0        222,070 rx/s       0 err,drop/s      222,071 xmit/s
> eth1->eth0        222,017 rx/s       0 err,drop/s      222,017 xmit/s
> eth1->eth0        222,069 rx/s       0 err,drop/s      222,067 xmit/s
> eth1->eth0        221,986 rx/s       0 err,drop/s      221,987 xmit/s
> eth1->eth0        221,932 rx/s       0 err,drop/s      221,936 xmit/s
> eth1->eth0        222,045 rx/s       0 err,drop/s      222,041 xmit/s
> eth1->eth0        222,014 rx/s       0 err,drop/s      222,014 xmit/s
>    Packets received    : 3,772,908
>    Average packets/s   : 221,936
>    Packets transmitted : 3,772,908
>    Average transmit/s  : 221,936
>> And also find out of moving the MMIO write have any effect.
>>
> I move the MMIO write to fec_enet_xdp_xmit(), the result shows as follow,
> the performance is slightly improved.
> 

I'm puzzled that moving the MMIO write isn't change performance.

Can you please verify that the packet generator machine is sending more
frame than the system can handle?

(meaning the pktgen_sample03_burst_single_flow.sh script fast enough?)

> root@...8mpevk:~# ./xdp_redirect eth1 eth0
> Redirecting from eth1 (ifindex 3; driver st_gmac) to eth0 (ifindex 2; driver fec)
> eth1->eth0        222,666 rx/s        0 err,drop/s      222,668 xmit/s
> eth1->eth0        221,663 rx/s        0 err,drop/s      221,664 xmit/s
> eth1->eth0        222,743 rx/s        0 err,drop/s      222,741 xmit/s
> eth1->eth0        222,917 rx/s        0 err,drop/s      222,923 xmit/s
> eth1->eth0        221,810 rx/s        0 err,drop/s      221,808 xmit/s
> eth1->eth0        222,891 rx/s        0 err,drop/s      222,888 xmit/s
> eth1->eth0        222,983 rx/s        0 err,drop/s      222,984 xmit/s
> eth1->eth0        221,655 rx/s        0 err,drop/s      221,653 xmit/s
> eth1->eth0        222,827 rx/s        0 err,drop/s      222,827 xmit/s
> eth1->eth0        221,728 rx/s        0 err,drop/s      221,728 xmit/s
> eth1->eth0        222,790 rx/s        0 err,drop/s      222,789 xmit/s
> eth1->eth0        222,874 rx/s        0 err,drop/s      222,874 xmit/s
> eth1->eth0        221,888 rx/s        0 err,drop/s      221,887 xmit/s
> eth1->eth0        223,057 rx/s        0 err,drop/s      223,056 xmit/s
> eth1->eth0        222,219 rx/s        0 err,drop/s      222,220 xmit/s
>    Packets received    : 3,336,711
>    Average packets/s   : 222,447
>    Packets transmitted : 3,336,710
>    Average transmit/s  : 222,447
> 
>> I also noticed driver does a MMIO write (on rxq) for every RX-packet in
>> fec_enet_rx_queue() napi-poll loop.  This also looks like a potential
>> performance stall.
>>
> The same as txq, the rxq will be inactive if the rx-BDR has no free BDs, so we'd
> better do a MMIO write when we recycle a BD, so that the hardware can timely
> attach the received pakcets on the rx-BDR.
> 
> In addition, I also tried to avoid using xdp_convert_buff_to_frame(), but the
> performance of XDP_TX is still not improved. :(
> 

I would not expect much performance improvement from this anyhow.

> After these days of testing, I think it's best to keep the solution in V3, and then
> make some optimizations on the V3 patch.

I agree.

I think you need to send a V5, and then I can ACK that.

Thanks for all this testing,
--Jesper