[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8fd0313b-8f6f-9814-247d-c2687d053e2a@kernel.org>
Date: Mon, 7 Aug 2023 15:15:22 +0200
From: Jesper Dangaard Brouer <hawk@...nel.org>
To: Wei Fang <wei.fang@....com>,
Jesper Dangaard Brouer <hawk@...nel.org>,
Jesper Dangaard Brouer <jbrouer@...hat.com>,
Jakub Kicinski <kuba@...nel.org>
Cc: "davem@...emloft.net" <davem@...emloft.net>,
"edumazet@...gle.com" <edumazet@...gle.com>,
"pabeni@...hat.com" <pabeni@...hat.com>,
Shenwei Wang <shenwei.wang@....com>,
Clark Wang <xiaoning.wang@....com>,
"ast@...nel.org" <ast@...nel.org>,
"daniel@...earbox.net" <daniel@...earbox.net>,
"john.fastabend@...il.com" <john.fastabend@...il.com>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
dl-linux-imx <linux-imx@....com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"bpf@...r.kernel.org" <bpf@...r.kernel.org>,
Andrew Lunn <andrew@...n.ch>
Subject: Re: [PATCH V3 net-next] net: fec: add XDP_TX feature support
On 07/08/2023 12.30, Wei Fang wrote:
>>> The flow-control was not disabled before, so according to your
>>> suggestion, I disable the flow-control on the both boards and run the
>>> test again, the performance is slightly improved, but still can not
>>> see a clear difference between the two methods. Below are the results.
>>
>> Something else must be stalling the CPU.
>> When looking at fec_main.c code, I noticed that
>> fec_enet_txq_xmit_frame() will do a MMIO write for every xdp_frame (to
>> trigger transmit start), which I believe will stall the CPU.
>> The ndo_xdp_xmit/fec_enet_xdp_xmit does bulking, and should be the
>> function that does the MMIO write to trigger transmit start.
>>
> We'd better keep a MMIO write for every xdp_frame on txq, as you know,
> the txq will be inactive when no additional ready descriptors remain in the
> tx-BDR. So it may increase the delay of the packets if we do a MMIO write
> for multiple packets.
>
You know this hardware better than me, so I will leave to you.
>> $ git diff
>> diff --git a/drivers/net/ethernet/freescale/fec_main.c
>> b/drivers/net/ethernet/freescale/fec_main.c
>> index 03ac7690b5c4..57a6a3899b80 100644
>> --- a/drivers/net/ethernet/freescale/fec_main.c
>> +++ b/drivers/net/ethernet/freescale/fec_main.c
>> @@ -3849,9 +3849,6 @@ static int fec_enet_txq_xmit_frame(struct
>> fec_enet_private *fep,
>>
>> txq->bd.cur = bdp;
>>
>> - /* Trigger transmission start */
>> - writel(0, txq->bd.reg_desc_active);
>> -
>> return 0;
>> }
>>
>> @@ -3880,6 +3877,9 @@ static int fec_enet_xdp_xmit(struct net_device
>> *dev,
>> sent_frames++;
>> }
>>
>> + /* Trigger transmission start */
>> + writel(0, txq->bd.reg_desc_active);
>> +
>> __netif_tx_unlock(nq);
>>
>> return sent_frames;
>>
>>
>>> Result: use "sync_dma_len" method
>>> root@...8mpevk:~# ./xdp2 eth0
>>
>> The xdp2 (and xdp1) program(s) have a performance issue (due to using
>>
>> Can I ask you to test using xdp_rxq_info, like:
>>
>> sudo ./xdp_rxq_info --dev mlx5p1 --action XDP_TX
>>
> Yes, below are the results, the results are also basically the same.
> Result 1: current method
> ./xdp_rxq_info --dev eth0 --action XDP_TX
> Running XDP on dev:eth0 (ifindex:2) action:XDP_TX options:swapmac
> XDP stats CPU pps issue-pps
> XDP-RX CPU 0 259,102 0
> XDP-RX CPU total 259,102
> RXQ stats RXQ:CPU pps issue-pps
> rx_queue_index 0:0 259,102 0
> rx_queue_index 0:sum 259,102
> Running XDP on dev:eth0 (ifindex:2) action:XDP_TX options:swapmac
> XDP stats CPU pps issue-pps
> XDP-RX CPU 0 259,498 0
> XDP-RX CPU total 259,498
> RXQ stats RXQ:CPU pps issue-pps
> rx_queue_index 0:0 259,496 0
> rx_queue_index 0:sum 259,496
> Running XDP on dev:eth0 (ifindex:2) action:XDP_TX options:swapmac
> XDP stats CPU pps issue-pps
> XDP-RX CPU 0 259,408 0
> XDP-RX CPU total 259,408
>
> Result 2: dma_sync_len method
> Running XDP on dev:eth0 (ifindex:2) action:XDP_TX options:swapmac
> XDP stats CPU pps issue-pps
> XDP-RX CPU 0 258,254 0
> XDP-RX CPU total 258,254
> RXQ stats RXQ:CPU pps issue-pps
> rx_queue_index 0:0 258,254 0
> rx_queue_index 0:sum 258,254
> Running XDP on dev:eth0 (ifindex:2) action:XDP_TX options:swapmac
> XDP stats CPU pps issue-pps
> XDP-RX CPU 0 259,316 0
> XDP-RX CPU total 259,316
> RXQ stats RXQ:CPU pps issue-pps
> rx_queue_index 0:0 259,318 0
> rx_queue_index 0:sum 259,318
> Running XDP on dev:eth0 (ifindex:2) action:XDP_TX options:swapmac
> XDP stats CPU pps issue-pps
> XDP-RX CPU 0 259,554 0
> XDP-RX CPU total 259,554
> RXQ stats RXQ:CPU pps issue-pps
> rx_queue_index 0:0 259,553 0
> rx_queue_index 0:sum 259,553
>
Thanks for running this.
>>
>>> proto 17: 258886 pkt/s
>>> proto 17: 258879 pkt/s
>>
>> If you provide numbers for xdp_redirect, then we could better evaluate if
>> changing the lock per xdp_frame, for XDP_TX also, is worth it.
>>
> For XDP_REDIRECT, the performance show as follow.
> root@...8mpevk:~# ./xdp_redirect eth1 eth0
> Redirecting from eth1 (ifindex 3; driver st_gmac) to eth0 (ifindex 2; driver fec)
This is not exactly the same as XDP_TX setup as here you choose to
redirect between eth1 (driver st_gmac) and to eth0 (driver fec).
I would like to see eth0 to eth0 XDP_REDIRECT, so we can compare to
XDP_TX performance.
Sorry for all the requests, but can you provide those numbers?
> eth1->eth0 221,642 rx/s 0 err,drop/s 221,643 xmit/s
So, XDP_REDIRECT is approx (1-(221825/259554))*100 = 14.53% slower.
But as this is 'eth1->eth0' this isn't true comparison to XDP_TX.
> eth1->eth0 221,761 rx/s 0 err,drop/s 221,760 xmit/s
> eth1->eth0 221,793 rx/s 0 err,drop/s 221,794 xmit/s
> eth1->eth0 221,825 rx/s 0 err,drop/s 221,825 xmit/s
> eth1->eth0 221,823 rx/s 0 err,drop/s 221,821 xmit/s
> eth1->eth0 221,815 rx/s 0 err,drop/s 221,816 xmit/s
> eth1->eth0 222,016 rx/s 0 err,drop/s 222,016 xmit/s
> eth1->eth0 222,059 rx/s 0 err,drop/s 222,059 xmit/s
> eth1->eth0 222,085 rx/s 0 err,drop/s 222,089 xmit/s
> eth1->eth0 221,956 rx/s 0 err,drop/s 221,952 xmit/s
> eth1->eth0 222,070 rx/s 0 err,drop/s 222,071 xmit/s
> eth1->eth0 222,017 rx/s 0 err,drop/s 222,017 xmit/s
> eth1->eth0 222,069 rx/s 0 err,drop/s 222,067 xmit/s
> eth1->eth0 221,986 rx/s 0 err,drop/s 221,987 xmit/s
> eth1->eth0 221,932 rx/s 0 err,drop/s 221,936 xmit/s
> eth1->eth0 222,045 rx/s 0 err,drop/s 222,041 xmit/s
> eth1->eth0 222,014 rx/s 0 err,drop/s 222,014 xmit/s
> Packets received : 3,772,908
> Average packets/s : 221,936
> Packets transmitted : 3,772,908
> Average transmit/s : 221,936
>> And also find out of moving the MMIO write have any effect.
>>
> I move the MMIO write to fec_enet_xdp_xmit(), the result shows as follow,
> the performance is slightly improved.
>
I'm puzzled that moving the MMIO write isn't change performance.
Can you please verify that the packet generator machine is sending more
frame than the system can handle?
(meaning the pktgen_sample03_burst_single_flow.sh script fast enough?)
> root@...8mpevk:~# ./xdp_redirect eth1 eth0
> Redirecting from eth1 (ifindex 3; driver st_gmac) to eth0 (ifindex 2; driver fec)
> eth1->eth0 222,666 rx/s 0 err,drop/s 222,668 xmit/s
> eth1->eth0 221,663 rx/s 0 err,drop/s 221,664 xmit/s
> eth1->eth0 222,743 rx/s 0 err,drop/s 222,741 xmit/s
> eth1->eth0 222,917 rx/s 0 err,drop/s 222,923 xmit/s
> eth1->eth0 221,810 rx/s 0 err,drop/s 221,808 xmit/s
> eth1->eth0 222,891 rx/s 0 err,drop/s 222,888 xmit/s
> eth1->eth0 222,983 rx/s 0 err,drop/s 222,984 xmit/s
> eth1->eth0 221,655 rx/s 0 err,drop/s 221,653 xmit/s
> eth1->eth0 222,827 rx/s 0 err,drop/s 222,827 xmit/s
> eth1->eth0 221,728 rx/s 0 err,drop/s 221,728 xmit/s
> eth1->eth0 222,790 rx/s 0 err,drop/s 222,789 xmit/s
> eth1->eth0 222,874 rx/s 0 err,drop/s 222,874 xmit/s
> eth1->eth0 221,888 rx/s 0 err,drop/s 221,887 xmit/s
> eth1->eth0 223,057 rx/s 0 err,drop/s 223,056 xmit/s
> eth1->eth0 222,219 rx/s 0 err,drop/s 222,220 xmit/s
> Packets received : 3,336,711
> Average packets/s : 222,447
> Packets transmitted : 3,336,710
> Average transmit/s : 222,447
>
>> I also noticed driver does a MMIO write (on rxq) for every RX-packet in
>> fec_enet_rx_queue() napi-poll loop. This also looks like a potential
>> performance stall.
>>
> The same as txq, the rxq will be inactive if the rx-BDR has no free BDs, so we'd
> better do a MMIO write when we recycle a BD, so that the hardware can timely
> attach the received pakcets on the rx-BDR.
>
> In addition, I also tried to avoid using xdp_convert_buff_to_frame(), but the
> performance of XDP_TX is still not improved. :(
>
I would not expect much performance improvement from this anyhow.
> After these days of testing, I think it's best to keep the solution in V3, and then
> make some optimizations on the V3 patch.
I agree.
I think you need to send a V5, and then I can ACK that.
Thanks for all this testing,
--Jesper
Powered by blists - more mailing lists