netdev - Re: tg3 RX packet re-order in queue 0 with RSS

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMet4B4iJjQK6yX+XBD2CtH3B30oqECUAYDj3ZE3ysdJVu8O4w@mail.gmail.com>
Date:   Wed, 22 Sep 2021 12:10:05 +0530
From:   Siva Reddy Kallam <siva.kallam@...adcom.com>
To:     Vitaly Bursov <vitaly@...sov.com>
Cc:     Prashant Sreedharan <prashant@...adcom.com>,
        Michael Chan <mchan@...adcom.com>,
        Linux Netdev List <netdev@...r.kernel.org>,
        Pavan Chebbi <pavan.chebbi@...adcom.com>
Subject: Re: tg3 RX packet re-order in queue 0 with RSS

Thank you for reporting this. Pavan(cc'd) from Broadcom looking into this issue.
We will provide our feedback very soon on this.

On Mon, Sep 20, 2021 at 6:59 PM Vitaly Bursov <vitaly@...sov.com> wrote:
>
> Hi,
>
> We found a occassional and random (sometimes happens, sometimes not)
> packet re-order when NIC is involved in UDP multicast reception, which
> is sensitive to a packet re-order. Network capture with tcpdump
> sometimes shows the packet re-order, sometimes not (e.g. no re-order on
> a host, re-order in a container at the same time). In a pcap file
> re-ordered packets have a correct timestamp - delayed packet had a more
> earlier timestamp compared to a previous packet:
>      1.00s packet1
>      1.20s packet3
>      1.10s packet2
>      1.30s packet4
>
> There's about 300Mbps of traffic on this NIC, and server is busy
> (hyper-threading enabled, about 50% overall idle) with its
> computational application work.
>
> NIC is HPE's 4-port 331i adapter - BCM5719, in a default ring and
> coalescing configuration, 1 TX queue, 4 RX queues.
>
> After further investigation, I believe that there are two separate
> issues in tg3.c driver. Issues can be reproduced with iperf3, and
> unicast UDP.
>
> Here are the details of how I understand this behavior.
>
> 1. Packet re-order.
>
> Driver calls napi_schedule(&tnapi->napi) when handling the interrupt,
> however, sometimes it calls napi_schedule(&tp->napi[1].napi), which
> handles RX queue 0 too:
>
>      https://github.com/torvalds/linux/blob/master/drivers/net/ethernet/broadcom/tg3.c#L6802-L7007
>
>      static int tg3_rx(struct tg3_napi *tnapi, int budget)
>      {
>              struct tg3 *tp = tnapi->tp;
>
>              ...
>
>              /* Refill RX ring(s). */
>              if (!tg3_flag(tp, ENABLE_RSS)) {
>                      ....
>              } else if (work_mask) {
>                      ...
>
>                      if (tnapi != &tp->napi[1]) {
>                              tp->rx_refill = true;
>                              napi_schedule(&tp->napi[1].napi);
>                      }
>              }
>              ...
>      }
>
>  From napi_schedule() code, it should schedure RX 0 traffic handling on
> a current CPU, which handles queues RX1-3 right now.
>
> At least two traffic flows are required - one on RX queue 0, and the
> other on any other queue (1-3). Re-ordering may happend only on flow
> from queue 0, the second flow will work fine.
>
> No idea how to fix this.
>
> There are two ways to mitigate this:
>
>    1. Enable RPS by writting any non-zero mask to
>       /sys/class/net/enp2s0f0/queues/rx-0/rps_cpus This encorces CPU
>       when processing traffic, and overrides whatever "current" CPU for
>       RX queue 0 is in this moment.
>
>    2. Configure RX hash flow redirection with: ethtool -X enp2s0f0
>       weight 0 1 1 1 to exclude RX queue 0 from handling the traffic.
>
>
> 2. RPS configuration
>
> Before napi_gro_receive() call, there's no call to skb_record_rx_queue():
>
>      static int tg3_rx(struct tg3_napi *tnapi, int budget)
>      {
>              struct tg3 *tp = tnapi->tp;
>              u32 work_mask, rx_std_posted = 0;
>              u32 std_prod_idx, jmb_prod_idx;
>              u32 sw_idx = tnapi->rx_rcb_ptr;
>              u16 hw_idx;
>              int received;
>              struct tg3_rx_prodring_set *tpr = &tnapi->prodring;
>
>              ...
>
>                      napi_gro_receive(&tnapi->napi, skb);
>
>
>                      received++;
>                      budget--;
>              ...
>
>
> As a result, queue_mapping is always 0/not set, and RPS handles all
> traffic as originating from queue 0.
>
>            <idle>-0     [013] ..s. 14030782.234664: napi_gro_receive_entry: dev=enp2s0f0 napi_id=0x0 queue_mapping=0 ...
>
> RPS configuration for rx-1 to to rx-3 has no effect.
>
>
> NIC:
> 02:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
>      Subsystem: Hewlett-Packard Company Ethernet 1Gb 4-port 331i Adapter
>      Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
>      Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>      Latency: 0, Cache Line Size: 64 bytes
>      Interrupt: pin A routed to IRQ 16
>      NUMA node: 0
>      Region 0: Memory at d9d90000 (64-bit, prefetchable) [size=64K]
>      Region 2: Memory at d9da0000 (64-bit, prefetchable) [size=64K]
>      Region 4: Memory at d9db0000 (64-bit, prefetchable) [size=64K]
>      [virtual] Expansion ROM at d9c00000 [disabled] [size=256K]
>      Capabilities: <access denied>
>      Kernel driver in use: tg3
>      Kernel modules: tg3
>
> Linux kernel:
>      CentOS 7 - 3.10.0-1160.15.2
>      Ubuntu - 5.4.0-80.90
>
> Network configuration:
>      iperf3 (sender) - [LAN] - NIC (enp2s0f0) - Bridge (br0) - veth (v1) - namespace veth (v2) - iperf3 (receiver 1)
>
>      brctl addbr br0
>      ip l set up dev br0
>      ip a a 10.10.10.10/24 dev br0
>      ip r a default via 10.10.10.1 dev br0
>      ip l set dev enp2s0f0 master br0
>      ip l set up dev enp2s0f0
>
>      ip netns add n1
>      ip link add v1 type veth peer name v2
>      ip l set up dev v1
>      ip l set dev v1 master br0
>      ip l set dev v2 netns n1
>
>      ip netns exec n1 bash
>      ip l set up dev lo
>      ip l set up dev v2
>      ip a a 10.10.10.11/24 dev v2
>
>      "receiver 2" has the same configuration but different IP and different namespace.
>
> Iperf3:
>
>      Sender runs iperfs: iperf3 -s -p 5201 & iperf3 -s -p 5202 &
>      Receiver's iperf3 -c 10.10.10.1 -R -p 5201 -u -l 163 -b 200M -t 300
>
> --
> Thanks
> Vitalii
>