[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMet4B4iJjQK6yX+XBD2CtH3B30oqECUAYDj3ZE3ysdJVu8O4w@mail.gmail.com>
Date: Wed, 22 Sep 2021 12:10:05 +0530
From: Siva Reddy Kallam <siva.kallam@...adcom.com>
To: Vitaly Bursov <vitaly@...sov.com>
Cc: Prashant Sreedharan <prashant@...adcom.com>,
Michael Chan <mchan@...adcom.com>,
Linux Netdev List <netdev@...r.kernel.org>,
Pavan Chebbi <pavan.chebbi@...adcom.com>
Subject: Re: tg3 RX packet re-order in queue 0 with RSS
Thank you for reporting this. Pavan(cc'd) from Broadcom looking into this issue.
We will provide our feedback very soon on this.
On Mon, Sep 20, 2021 at 6:59 PM Vitaly Bursov <vitaly@...sov.com> wrote:
>
> Hi,
>
> We found a occassional and random (sometimes happens, sometimes not)
> packet re-order when NIC is involved in UDP multicast reception, which
> is sensitive to a packet re-order. Network capture with tcpdump
> sometimes shows the packet re-order, sometimes not (e.g. no re-order on
> a host, re-order in a container at the same time). In a pcap file
> re-ordered packets have a correct timestamp - delayed packet had a more
> earlier timestamp compared to a previous packet:
> 1.00s packet1
> 1.20s packet3
> 1.10s packet2
> 1.30s packet4
>
> There's about 300Mbps of traffic on this NIC, and server is busy
> (hyper-threading enabled, about 50% overall idle) with its
> computational application work.
>
> NIC is HPE's 4-port 331i adapter - BCM5719, in a default ring and
> coalescing configuration, 1 TX queue, 4 RX queues.
>
> After further investigation, I believe that there are two separate
> issues in tg3.c driver. Issues can be reproduced with iperf3, and
> unicast UDP.
>
> Here are the details of how I understand this behavior.
>
> 1. Packet re-order.
>
> Driver calls napi_schedule(&tnapi->napi) when handling the interrupt,
> however, sometimes it calls napi_schedule(&tp->napi[1].napi), which
> handles RX queue 0 too:
>
> https://github.com/torvalds/linux/blob/master/drivers/net/ethernet/broadcom/tg3.c#L6802-L7007
>
> static int tg3_rx(struct tg3_napi *tnapi, int budget)
> {
> struct tg3 *tp = tnapi->tp;
>
> ...
>
> /* Refill RX ring(s). */
> if (!tg3_flag(tp, ENABLE_RSS)) {
> ....
> } else if (work_mask) {
> ...
>
> if (tnapi != &tp->napi[1]) {
> tp->rx_refill = true;
> napi_schedule(&tp->napi[1].napi);
> }
> }
> ...
> }
>
> From napi_schedule() code, it should schedure RX 0 traffic handling on
> a current CPU, which handles queues RX1-3 right now.
>
> At least two traffic flows are required - one on RX queue 0, and the
> other on any other queue (1-3). Re-ordering may happend only on flow
> from queue 0, the second flow will work fine.
>
> No idea how to fix this.
>
> There are two ways to mitigate this:
>
> 1. Enable RPS by writting any non-zero mask to
> /sys/class/net/enp2s0f0/queues/rx-0/rps_cpus This encorces CPU
> when processing traffic, and overrides whatever "current" CPU for
> RX queue 0 is in this moment.
>
> 2. Configure RX hash flow redirection with: ethtool -X enp2s0f0
> weight 0 1 1 1 to exclude RX queue 0 from handling the traffic.
>
>
> 2. RPS configuration
>
> Before napi_gro_receive() call, there's no call to skb_record_rx_queue():
>
> static int tg3_rx(struct tg3_napi *tnapi, int budget)
> {
> struct tg3 *tp = tnapi->tp;
> u32 work_mask, rx_std_posted = 0;
> u32 std_prod_idx, jmb_prod_idx;
> u32 sw_idx = tnapi->rx_rcb_ptr;
> u16 hw_idx;
> int received;
> struct tg3_rx_prodring_set *tpr = &tnapi->prodring;
>
> ...
>
> napi_gro_receive(&tnapi->napi, skb);
>
>
> received++;
> budget--;
> ...
>
>
> As a result, queue_mapping is always 0/not set, and RPS handles all
> traffic as originating from queue 0.
>
> <idle>-0 [013] ..s. 14030782.234664: napi_gro_receive_entry: dev=enp2s0f0 napi_id=0x0 queue_mapping=0 ...
>
> RPS configuration for rx-1 to to rx-3 has no effect.
>
>
> NIC:
> 02:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
> Subsystem: Hewlett-Packard Company Ethernet 1Gb 4-port 331i Adapter
> Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
> Latency: 0, Cache Line Size: 64 bytes
> Interrupt: pin A routed to IRQ 16
> NUMA node: 0
> Region 0: Memory at d9d90000 (64-bit, prefetchable) [size=64K]
> Region 2: Memory at d9da0000 (64-bit, prefetchable) [size=64K]
> Region 4: Memory at d9db0000 (64-bit, prefetchable) [size=64K]
> [virtual] Expansion ROM at d9c00000 [disabled] [size=256K]
> Capabilities: <access denied>
> Kernel driver in use: tg3
> Kernel modules: tg3
>
> Linux kernel:
> CentOS 7 - 3.10.0-1160.15.2
> Ubuntu - 5.4.0-80.90
>
> Network configuration:
> iperf3 (sender) - [LAN] - NIC (enp2s0f0) - Bridge (br0) - veth (v1) - namespace veth (v2) - iperf3 (receiver 1)
>
> brctl addbr br0
> ip l set up dev br0
> ip a a 10.10.10.10/24 dev br0
> ip r a default via 10.10.10.1 dev br0
> ip l set dev enp2s0f0 master br0
> ip l set up dev enp2s0f0
>
> ip netns add n1
> ip link add v1 type veth peer name v2
> ip l set up dev v1
> ip l set dev v1 master br0
> ip l set dev v2 netns n1
>
> ip netns exec n1 bash
> ip l set up dev lo
> ip l set up dev v2
> ip a a 10.10.10.11/24 dev v2
>
> "receiver 2" has the same configuration but different IP and different namespace.
>
> Iperf3:
>
> Sender runs iperfs: iperf3 -s -p 5201 & iperf3 -s -p 5202 &
> Receiver's iperf3 -c 10.10.10.1 -R -p 5201 -u -l 163 -b 200M -t 300
>
> --
> Thanks
> Vitalii
>
Powered by blists - more mailing lists