netdev - Re: tg3 RX packet re-order in queue 0 with RSS

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALs4sv3XOTBKCxaUieYosMdXuuqiuHT5Gbhz8oixGv2XGw4+Ug@mail.gmail.com>
Date:   Thu, 28 Oct 2021 13:03:57 +0530
From:   Pavan Chebbi <pavan.chebbi@...adcom.com>
To:     Vitaly Bursov <vitaly@...sov.com>
Cc:     Siva Reddy Kallam <siva.kallam@...adcom.com>,
        Prashant Sreedharan <prashant@...adcom.com>,
        Michael Chan <mchan@...adcom.com>,
        Linux Netdev List <netdev@...r.kernel.org>
Subject: Re: tg3 RX packet re-order in queue 0 with RSS

On Wed, Oct 27, 2021 at 4:02 PM Vitaly Bursov <vitaly@...sov.com> wrote:
>
>
> 27.10.2021 12:30, Pavan Chebbi wrote:
> > On Wed, Sep 22, 2021 at 12:10 PM Siva Reddy Kallam
> > <siva.kallam@...adcom.com> wrote:
> >>
> >> Thank you for reporting this. Pavan(cc'd) from Broadcom looking into this issue.
> >> We will provide our feedback very soon on this.
> >>
> >> On Mon, Sep 20, 2021 at 6:59 PM Vitaly Bursov <vitaly@...sov.com> wrote:
> >>>
> >>> Hi,
> >>>
> >>> We found a occassional and random (sometimes happens, sometimes not)
> >>> packet re-order when NIC is involved in UDP multicast reception, which
> >>> is sensitive to a packet re-order. Network capture with tcpdump
> >>> sometimes shows the packet re-order, sometimes not (e.g. no re-order on
> >>> a host, re-order in a container at the same time). In a pcap file
> >>> re-ordered packets have a correct timestamp - delayed packet had a more
> >>> earlier timestamp compared to a previous packet:
> >>>       1.00s packet1
> >>>       1.20s packet3
> >>>       1.10s packet2
> >>>       1.30s packet4
> >>>
> >>> There's about 300Mbps of traffic on this NIC, and server is busy
> >>> (hyper-threading enabled, about 50% overall idle) with its
> >>> computational application work.
> >>>
> >>> NIC is HPE's 4-port 331i adapter - BCM5719, in a default ring and
> >>> coalescing configuration, 1 TX queue, 4 RX queues.
> >>>
> >>> After further investigation, I believe that there are two separate
> >>> issues in tg3.c driver. Issues can be reproduced with iperf3, and
> >>> unicast UDP.
> >>>
> >>> Here are the details of how I understand this behavior.
> >>>
> >>> 1. Packet re-order.
> >>>
> >>> Driver calls napi_schedule(&tnapi->napi) when handling the interrupt,
> >>> however, sometimes it calls napi_schedule(&tp->napi[1].napi), which
> >>> handles RX queue 0 too:
> >>>
> >>>       https://github.com/torvalds/linux/blob/master/drivers/net/ethernet/broadcom/tg3.c#L6802-L7007
> >>>
> >>>       static int tg3_rx(struct tg3_napi *tnapi, int budget)
> >>>       {
> >>>               struct tg3 *tp = tnapi->tp;
> >>>
> >>>               ...
> >>>
> >>>               /* Refill RX ring(s). */
> >>>               if (!tg3_flag(tp, ENABLE_RSS)) {
> >>>                       ....
> >>>               } else if (work_mask) {
> >>>                       ...
> >>>
> >>>                       if (tnapi != &tp->napi[1]) {
> >>>                               tp->rx_refill = true;
> >>>                               napi_schedule(&tp->napi[1].napi);
> >>>                       }
> >>>               }
> >>>               ...
> >>>       }
> >>>
> >>>   From napi_schedule() code, it should schedure RX 0 traffic handling on
> >>> a current CPU, which handles queues RX1-3 right now.
> >>>
> >>> At least two traffic flows are required - one on RX queue 0, and the
> >>> other on any other queue (1-3). Re-ordering may happend only on flow
> >>> from queue 0, the second flow will work fine.
> >>>
> >>> No idea how to fix this.
> >
> > In the case of RSS the actual rings for RX are from 1 to 4.
> > The napi of those rings are indeed processing the packets.
> > The explicit napi_schedule of napi[1] is only re-filling rx BD
> > producer ring because it is shared with return rings for 1-4.
> > I tried to repro this but I am not seeing the issue. If you are
> > receiving packets on RX 0 then the RSS must have been disabled.
> > Can you please check?
> >
>
> # ethtool -i enp2s0f0
> driver: tg3
> version: 3.137
> firmware-version: 5719-v1.46 NCSI v1.5.18.0
> expansion-rom-version:
> bus-info: 0000:02:00.0
> supports-statistics: yes
> supports-test: yes
> supports-eeprom-access: yes
> supports-register-dump: yes
> supports-priv-flags: no
>
> # ethtool -l enp2s0f0
> Channel parameters for enp2s0f0:
> Pre-set maximums:
> RX:             4
> TX:             4
> Other:          0
> Combined:       0
> Current hardware settings:
> RX:             4
> TX:             1
> Other:          0
> Combined:       0
>
> # ethtool -x enp2s0f0
> RX flow hash indirection table for enp2s0f0 with 4 RX ring(s):
>      0:      0     1     2     3     0     1     2     3
>      8:      0     1     2     3     0     1     2     3
>     16:      0     1     2     3     0     1     2     3
>     24:      0     1     2     3     0     1     2     3
>     32:      0     1     2     3     0     1     2     3
>     40:      0     1     2     3     0     1     2     3
>     48:      0     1     2     3     0     1     2     3
>     56:      0     1     2     3     0     1     2     3
>     64:      0     1     2     3     0     1     2     3
>     72:      0     1     2     3     0     1     2     3
>     80:      0     1     2     3     0     1     2     3
>     88:      0     1     2     3     0     1     2     3
>     96:      0     1     2     3     0     1     2     3
>    104:      0     1     2     3     0     1     2     3
>    112:      0     1     2     3     0     1     2     3
>    120:      0     1     2     3     0     1     2     3
> RSS hash key:
> Operation not supported
> RSS hash function:
>      toeplitz: on
>      xor: off
>      crc32: off
>
> In /proc/interrupts there are enp2s0f0-tx-0, enp2s0f0-rx-1,
> enp2s0f0-rx-2, enp2s0f0-rx-3, enp2s0f0-rx-4 interrupts, all on
> different CPU cores. Kernel also has "threadirqs" enabled in
> command line, I didn't check if this parameter affects the issue.
>
> Yes, some things start with 0, and others with 1, sorry for a confusion
> in terminology, what I meant:
>   - There are 4 RX rings/queues, I counted starting from 0, so: 0..3.
>     RX0 is the first queue/ring that actually receives the traffic.
>     RX0 is handled by enp2s0f0-rx-1 interrupt.
>   - These are related to (tp->napi[i]), but i is in 1..4, so the first
>     receiving queue relates to tp->napi[1], the second relates to
>     tp->napi[2], and so on. Correct?
>
> Suppose, tg3_rx() is called for tp->napi[2], this function most likely
> calls napi_gro_receive(&tnapi->napi, skb) to further process packets in
> tp->napi[2]. And, under some conditions (RSS and work_mask), it calls
> napi_schedule(&tp->napi[1].napi), which schedules tp->napi[1] work
> on a currect CPU, which is designated for tp->napi[2], but not for
> tp->napi[1]. Correct?
>
> I don't understand what napi_schedule(&tp->napi[1].napi) does for the
> NIC or driver, "re-filling rx BD producer ring" sounds important. I
> suspect something will break badly if I simply remove it without
> replacing with something more elaborate. I guess along with re-filling
> rx BD producer ring it also can process incoming packets. Is it possible?
>

Yes, napi[1] work may be called on the napi[2]'s CPU but it generally
won't process
any rx packets because the producer index of napi[1] has not changed. If the
producer count did change, then we get a poll from the ISR for napi[1]
to process
packets. So it is mostly used to re-fill rx buffers when called
explicitly. However
there could be a small window where the prod index is incremented but the ISR
is not fired yet. It may process some small no of packets. But I don't
think this
should lead to a reorder problem.


> --
> Thanks
> Vitalii