[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALs4sv2YVu0euy5-stBNuES3Bf2SR7MtiD0TJDfGmTLAiUONSA@mail.gmail.com>
Date: Wed, 27 Oct 2021 15:00:16 +0530
From: Pavan Chebbi <pavan.chebbi@...adcom.com>
To: Siva Reddy Kallam <siva.kallam@...adcom.com>
Cc: Vitaly Bursov <vitaly@...sov.com>,
Prashant Sreedharan <prashant@...adcom.com>,
Michael Chan <mchan@...adcom.com>,
Linux Netdev List <netdev@...r.kernel.org>
Subject: Re: tg3 RX packet re-order in queue 0 with RSS
On Wed, Sep 22, 2021 at 12:10 PM Siva Reddy Kallam
<siva.kallam@...adcom.com> wrote:
>
> Thank you for reporting this. Pavan(cc'd) from Broadcom looking into this issue.
> We will provide our feedback very soon on this.
>
> On Mon, Sep 20, 2021 at 6:59 PM Vitaly Bursov <vitaly@...sov.com> wrote:
> >
> > Hi,
> >
> > We found a occassional and random (sometimes happens, sometimes not)
> > packet re-order when NIC is involved in UDP multicast reception, which
> > is sensitive to a packet re-order. Network capture with tcpdump
> > sometimes shows the packet re-order, sometimes not (e.g. no re-order on
> > a host, re-order in a container at the same time). In a pcap file
> > re-ordered packets have a correct timestamp - delayed packet had a more
> > earlier timestamp compared to a previous packet:
> > 1.00s packet1
> > 1.20s packet3
> > 1.10s packet2
> > 1.30s packet4
> >
> > There's about 300Mbps of traffic on this NIC, and server is busy
> > (hyper-threading enabled, about 50% overall idle) with its
> > computational application work.
> >
> > NIC is HPE's 4-port 331i adapter - BCM5719, in a default ring and
> > coalescing configuration, 1 TX queue, 4 RX queues.
> >
> > After further investigation, I believe that there are two separate
> > issues in tg3.c driver. Issues can be reproduced with iperf3, and
> > unicast UDP.
> >
> > Here are the details of how I understand this behavior.
> >
> > 1. Packet re-order.
> >
> > Driver calls napi_schedule(&tnapi->napi) when handling the interrupt,
> > however, sometimes it calls napi_schedule(&tp->napi[1].napi), which
> > handles RX queue 0 too:
> >
> > https://github.com/torvalds/linux/blob/master/drivers/net/ethernet/broadcom/tg3.c#L6802-L7007
> >
> > static int tg3_rx(struct tg3_napi *tnapi, int budget)
> > {
> > struct tg3 *tp = tnapi->tp;
> >
> > ...
> >
> > /* Refill RX ring(s). */
> > if (!tg3_flag(tp, ENABLE_RSS)) {
> > ....
> > } else if (work_mask) {
> > ...
> >
> > if (tnapi != &tp->napi[1]) {
> > tp->rx_refill = true;
> > napi_schedule(&tp->napi[1].napi);
> > }
> > }
> > ...
> > }
> >
> > From napi_schedule() code, it should schedure RX 0 traffic handling on
> > a current CPU, which handles queues RX1-3 right now.
> >
> > At least two traffic flows are required - one on RX queue 0, and the
> > other on any other queue (1-3). Re-ordering may happend only on flow
> > from queue 0, the second flow will work fine.
> >
> > No idea how to fix this.
In the case of RSS the actual rings for RX are from 1 to 4.
The napi of those rings are indeed processing the packets.
The explicit napi_schedule of napi[1] is only re-filling rx BD
producer ring because it is shared with return rings for 1-4.
I tried to repro this but I am not seeing the issue. If you are
receiving packets on RX 0 then the RSS must have been disabled.
Can you please check?
> >
> > There are two ways to mitigate this:
> >
> > 1. Enable RPS by writting any non-zero mask to
> > /sys/class/net/enp2s0f0/queues/rx-0/rps_cpus This encorces CPU
> > when processing traffic, and overrides whatever "current" CPU for
> > RX queue 0 is in this moment.
> >
> > 2. Configure RX hash flow redirection with: ethtool -X enp2s0f0
> > weight 0 1 1 1 to exclude RX queue 0 from handling the traffic.
> >
> >
> > 2. RPS configuration
> >
> > Before napi_gro_receive() call, there's no call to skb_record_rx_queue():
> >
> > static int tg3_rx(struct tg3_napi *tnapi, int budget)
> > {
> > struct tg3 *tp = tnapi->tp;
> > u32 work_mask, rx_std_posted = 0;
> > u32 std_prod_idx, jmb_prod_idx;
> > u32 sw_idx = tnapi->rx_rcb_ptr;
> > u16 hw_idx;
> > int received;
> > struct tg3_rx_prodring_set *tpr = &tnapi->prodring;
> >
> > ...
> >
> > napi_gro_receive(&tnapi->napi, skb);
> >
> >
> > received++;
> > budget--;
> > ...
> >
> >
> > As a result, queue_mapping is always 0/not set, and RPS handles all
> > traffic as originating from queue 0.
> >
> > <idle>-0 [013] ..s. 14030782.234664: napi_gro_receive_entry: dev=enp2s0f0 napi_id=0x0 queue_mapping=0 ...
> >
> > RPS configuration for rx-1 to to rx-3 has no effect.
OK I think we could add a patch to update skb with queue mapping. I
will discuss it internally.
> >
> >
> > NIC:
> > 02:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
> > Subsystem: Hewlett-Packard Company Ethernet 1Gb 4-port 331i Adapter
> > Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
> > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
> > Latency: 0, Cache Line Size: 64 bytes
> > Interrupt: pin A routed to IRQ 16
> > NUMA node: 0
> > Region 0: Memory at d9d90000 (64-bit, prefetchable) [size=64K]
> > Region 2: Memory at d9da0000 (64-bit, prefetchable) [size=64K]
> > Region 4: Memory at d9db0000 (64-bit, prefetchable) [size=64K]
> > [virtual] Expansion ROM at d9c00000 [disabled] [size=256K]
> > Capabilities: <access denied>
> > Kernel driver in use: tg3
> > Kernel modules: tg3
> >
> > Linux kernel:
> > CentOS 7 - 3.10.0-1160.15.2
> > Ubuntu - 5.4.0-80.90
> >
> > Network configuration:
> > iperf3 (sender) - [LAN] - NIC (enp2s0f0) - Bridge (br0) - veth (v1) - namespace veth (v2) - iperf3 (receiver 1)
> >
> > brctl addbr br0
> > ip l set up dev br0
> > ip a a 10.10.10.10/24 dev br0
> > ip r a default via 10.10.10.1 dev br0
> > ip l set dev enp2s0f0 master br0
> > ip l set up dev enp2s0f0
> >
> > ip netns add n1
> > ip link add v1 type veth peer name v2
> > ip l set up dev v1
> > ip l set dev v1 master br0
> > ip l set dev v2 netns n1
> >
> > ip netns exec n1 bash
> > ip l set up dev lo
> > ip l set up dev v2
> > ip a a 10.10.10.11/24 dev v2
> >
> > "receiver 2" has the same configuration but different IP and different namespace.
> >
> > Iperf3:
> >
> > Sender runs iperfs: iperf3 -s -p 5201 & iperf3 -s -p 5202 &
> > Receiver's iperf3 -c 10.10.10.1 -R -p 5201 -u -l 163 -b 200M -t 300
> >
> > --
> > Thanks
> > Vitalii
> >
Powered by blists - more mailing lists