netdev - Re: tg3 RX packet re-order in queue 0 with RSS

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALs4sv0uo7+RQ0hNmNC=3tD8gk=MS4chygeiuzf4Ve9iiKt9uA@mail.gmail.com>
Date:   Mon, 1 Nov 2021 14:40:05 +0530
From:   Pavan Chebbi <pavan.chebbi@...adcom.com>
To:     Vitaly Bursov <vitaly@...sov.com>
Cc:     Siva Reddy Kallam <siva.kallam@...adcom.com>,
        Prashant Sreedharan <prashant@...adcom.com>,
        Michael Chan <mchan@...adcom.com>,
        Linux Netdev List <netdev@...r.kernel.org>
Subject: Re: tg3 RX packet re-order in queue 0 with RSS

On Mon, Nov 1, 2021 at 1:50 PM Vitaly Bursov <vitaly@...sov.com> wrote:
>
>
>
> 01.11.2021 09:06, Pavan Chebbi wrote:
> > On Fri, Oct 29, 2021 at 9:15 PM Vitaly Bursov <vitaly@...sov.com> wrote:
> >>
> >>
> >>
> >> 29.10.2021 08:04, Pavan Chebbi пишет:
> >>> 90On Thu, Oct 28, 2021 at 9:11 PM Vitaly Bursov <vitaly@...sov.com> wrote:
> >>>>
> >>>>
> >>>> 28.10.2021 10:33, Pavan Chebbi wrote:
> >>>>> On Wed, Oct 27, 2021 at 4:02 PM Vitaly Bursov <vitaly@...sov.com> wrote:
> >>>>>>
> >>>>>>
> >>>>>> 27.10.2021 12:30, Pavan Chebbi wrote:
> >>>>>>> On Wed, Sep 22, 2021 at 12:10 PM Siva Reddy Kallam
> >>>>>>> <siva.kallam@...adcom.com> wrote:
> >>>>>>>>
> >>>>>>>> Thank you for reporting this. Pavan(cc'd) from Broadcom looking into this issue.
> >>>>>>>> We will provide our feedback very soon on this.
> >>>>>>>>
> >>>>>>>> On Mon, Sep 20, 2021 at 6:59 PM Vitaly Bursov <vitaly@...sov.com> wrote:
> >>>>>>>>>
> >>>>>>>>> Hi,
> >>>>>>>>>
> >>>>>>>>> We found a occassional and random (sometimes happens, sometimes not)
> >>>>>>>>> packet re-order when NIC is involved in UDP multicast reception, which
> >>>>>>>>> is sensitive to a packet re-order. Network capture with tcpdump
> >>>>>>>>> sometimes shows the packet re-order, sometimes not (e.g. no re-order on
> >>>>>>>>> a host, re-order in a container at the same time). In a pcap file
> >>>>>>>>> re-ordered packets have a correct timestamp - delayed packet had a more
> >>>>>>>>> earlier timestamp compared to a previous packet:
> >>>>>>>>>          1.00s packet1
> >>>>>>>>>          1.20s packet3
> >>>>>>>>>          1.10s packet2
> >>>>>>>>>          1.30s packet4
> >>>>>>>>>
> >>>>>>>>> There's about 300Mbps of traffic on this NIC, and server is busy
> >>>>>>>>> (hyper-threading enabled, about 50% overall idle) with its
> >>>>>>>>> computational application work.
> >>>>>>>>>
> >>>>>>>>> NIC is HPE's 4-port 331i adapter - BCM5719, in a default ring and
> >>>>>>>>> coalescing configuration, 1 TX queue, 4 RX queues.
> >>>>>>>>>
> >>>>>>>>> After further investigation, I believe that there are two separate
> >>>>>>>>> issues in tg3.c driver. Issues can be reproduced with iperf3, and
> >>>>>>>>> unicast UDP.
> >>>>>>>>>
> >>>>>>>>> Here are the details of how I understand this behavior.
> >>>>>>>>>
> >>>>>>>>> 1. Packet re-order.
> >>>>>>>>>
> >>>>>>>>> Driver calls napi_schedule(&tnapi->napi) when handling the interrupt,
> >>>>>>>>> however, sometimes it calls napi_schedule(&tp->napi[1].napi), which
> >>>>>>>>> handles RX queue 0 too:
> >>>>>>>>>
> >>>>>>>>>          https://github.com/torvalds/linux/blob/master/drivers/net/ethernet/broadcom/tg3.c#L6802-L7007
> >>>>>>>>>
> >>>>>>>>>          static int tg3_rx(struct tg3_napi *tnapi, int budget)
> >>>>>>>>>          {
> >>>>>>>>>                  struct tg3 *tp = tnapi->tp;
> >>>>>>>>>
> >>>>>>>>>                  ...
> >>>>>>>>>
> >>>>>>>>>                  /* Refill RX ring(s). */
> >>>>>>>>>                  if (!tg3_flag(tp, ENABLE_RSS)) {
> >>>>>>>>>                          ....
> >>>>>>>>>                  } else if (work_mask) {
> >>>>>>>>>                          ...
> >>>>>>>>>
> >>>>>>>>>                          if (tnapi != &tp->napi[1]) {
> >>>>>>>>>                                  tp->rx_refill = true;
> >>>>>>>>>                                  napi_schedule(&tp->napi[1].napi);
> >>>>>>>>>                          }
> >>>>>>>>>                  }
> >>>>>>>>>                  ...
> >>>>>>>>>          }
> >>>>>>>>>
> >>>>>>>>>      From napi_schedule() code, it should schedure RX 0 traffic handling on
> >>>>>>>>> a current CPU, which handles queues RX1-3 right now.
> >>>>>>>>>
> >>>>>>>>> At least two traffic flows are required - one on RX queue 0, and the
> >>>>>>>>> other on any other queue (1-3). Re-ordering may happend only on flow
> >>>>>>>>> from queue 0, the second flow will work fine.
> >>>>>>>>>
> >>>>>>>>> No idea how to fix this.
> >>>>>>>
> >>>>>>> In the case of RSS the actual rings for RX are from 1 to 4.
> >>>>>>> The napi of those rings are indeed processing the packets.
> >>>>>>> The explicit napi_schedule of napi[1] is only re-filling rx BD
> >>>>>>> producer ring because it is shared with return rings for 1-4.
> >>>>>>> I tried to repro this but I am not seeing the issue. If you are
> >>>>>>> receiving packets on RX 0 then the RSS must have been disabled.
> >>>>>>> Can you please check?
> >>>>>>>
> >>>>>>
> >>>>>> # ethtool -i enp2s0f0
> >>>>>> driver: tg3
> >>>>>> version: 3.137
> >>>>>> firmware-version: 5719-v1.46 NCSI v1.5.18.0
> >>>>>> expansion-rom-version:
> >>>>>> bus-info: 0000:02:00.0
> >>>>>> supports-statistics: yes
> >>>>>> supports-test: yes
> >>>>>> supports-eeprom-access: yes
> >>>>>> supports-register-dump: yes
> >>>>>> supports-priv-flags: no
> >>>>>>
> >>>>>> # ethtool -l enp2s0f0
> >>>>>> Channel parameters for enp2s0f0:
> >>>>>> Pre-set maximums:
> >>>>>> RX:             4
> >>>>>> TX:             4
> >>>>>> Other:          0
> >>>>>> Combined:       0
> >>>>>> Current hardware settings:
> >>>>>> RX:             4
> >>>>>> TX:             1
> >>>>>> Other:          0
> >>>>>> Combined:       0
> >>>>>>
> >>>>>> # ethtool -x enp2s0f0
> >>>>>> RX flow hash indirection table for enp2s0f0 with 4 RX ring(s):
> >>>>>>         0:      0     1     2     3     0     1     2     3
> >>>>>>         8:      0     1     2     3     0     1     2     3
> >>>>>>        16:      0     1     2     3     0     1     2     3
> >>>>>>        24:      0     1     2     3     0     1     2     3
> >>>>>>        32:      0     1     2     3     0     1     2     3
> >>>>>>        40:      0     1     2     3     0     1     2     3
> >>>>>>        48:      0     1     2     3     0     1     2     3
> >>>>>>        56:      0     1     2     3     0     1     2     3
> >>>>>>        64:      0     1     2     3     0     1     2     3
> >>>>>>        72:      0     1     2     3     0     1     2     3
> >>>>>>        80:      0     1     2     3     0     1     2     3
> >>>>>>        88:      0     1     2     3     0     1     2     3
> >>>>>>        96:      0     1     2     3     0     1     2     3
> >>>>>>       104:      0     1     2     3     0     1     2     3
> >>>>>>       112:      0     1     2     3     0     1     2     3
> >>>>>>       120:      0     1     2     3     0     1     2     3
> >>>>>> RSS hash key:
> >>>>>> Operation not supported
> >>>>>> RSS hash function:
> >>>>>>         toeplitz: on
> >>>>>>         xor: off
> >>>>>>         crc32: off
> >>>>>>
> >>>>>> In /proc/interrupts there are enp2s0f0-tx-0, enp2s0f0-rx-1,
> >>>>>> enp2s0f0-rx-2, enp2s0f0-rx-3, enp2s0f0-rx-4 interrupts, all on
> >>>>>> different CPU cores. Kernel also has "threadirqs" enabled in
> >>>>>> command line, I didn't check if this parameter affects the issue.
> >>>>>>
> >>>>>> Yes, some things start with 0, and others with 1, sorry for a confusion
> >>>>>> in terminology, what I meant:
> >>>>>>      - There are 4 RX rings/queues, I counted starting from 0, so: 0..3.
> >>>>>>        RX0 is the first queue/ring that actually receives the traffic.
> >>>>>>        RX0 is handled by enp2s0f0-rx-1 interrupt.
> >>>>>>      - These are related to (tp->napi[i]), but i is in 1..4, so the first
> >>>>>>        receiving queue relates to tp->napi[1], the second relates to
> >>>>>>        tp->napi[2], and so on. Correct?
> >>>>>>
> >>>>>> Suppose, tg3_rx() is called for tp->napi[2], this function most likely
> >>>>>> calls napi_gro_receive(&tnapi->napi, skb) to further process packets in
> >>>>>> tp->napi[2]. And, under some conditions (RSS and work_mask), it calls
> >>>>>> napi_schedule(&tp->napi[1].napi), which schedules tp->napi[1] work
> >>>>>> on a currect CPU, which is designated for tp->napi[2], but not for
> >>>>>> tp->napi[1]. Correct?
> >>>>>>
> >>>>>> I don't understand what napi_schedule(&tp->napi[1].napi) does for the
> >>>>>> NIC or driver, "re-filling rx BD producer ring" sounds important. I
> >>>>>> suspect something will break badly if I simply remove it without
> >>>>>> replacing with something more elaborate. I guess along with re-filling
> >>>>>> rx BD producer ring it also can process incoming packets. Is it possible?
> >>>>>>
> >>>>>
> >>>>> Yes, napi[1] work may be called on the napi[2]'s CPU but it generally
> >>>>> won't process
> >>>>> any rx packets because the producer index of napi[1] has not changed. If the
> >>>>> producer count did change, then we get a poll from the ISR for napi[1]
> >>>>> to process
> >>>>> packets. So it is mostly used to re-fill rx buffers when called
> >>>>> explicitly. However
> >>>>> there could be a small window where the prod index is incremented but the ISR
> >>>>> is not fired yet. It may process some small no of packets. But I don't
> >>>>> think this
> >>>>> should lead to a reorder problem.
> >>>>>
> >>>>
> >>>> I tried to reproduce without using bridge and veth interfaces, and it seems
> >>>> like it's not reproducible, so traffic forwarding via a bridge interface may
> >>>> be necessary. It also does not happen if traffic load is low, but moderate
> >>>> load is enough - e.g. two 100 Mbps streams with 130-byte packets. It's easier
> >>>> to reproduce with a higher load.
> >>>>
> >>>> With about the same setup as in an original message (bridge + veth 2
> >>>> network namespaces), irqbalance daemon stopped, if traffic flows via
> >>>> enp2s0f0-rx-2 and enp2s0f0-rx-4, there's no reordering. enp2s0f0-rx-1
> >>>> still gets some interrupts, but at a much lower rate compared to 2 and
> >>>> 4.
> >>>>
> >>>> namespace 1:
> >>>>      # iperf3 -u -c server_ip -p 5000 -R -b 300M -t 300 -l 130
> >>>>      - - - - - - - - - - - - - - - - - - - - - - - - -
> >>>>      [ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
> >>>>      [  4]   0.00-300.00 sec  6.72 GBytes   192 Mbits/sec  0.008 ms  3805/55508325 (0.0069%)
> >>>>      [  4] Sent 55508325 datagrams
> >>>>
> >>>>      iperf Done.
> >>>>
> >>>> namespace 2:
> >>>>      # iperf3 -u -c server_ip -p 5001 -R -b 300M -t 300 -l 130
> >>>>      - - - - - - - - - - - - - - - - - - - - - - - - -
> >>>>      [ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
> >>>>      [  4]   0.00-300.00 sec  6.83 GBytes   196 Mbits/sec  0.005 ms  3873/56414001 (0.0069%)
> >>>>      [  4] Sent 56414001 datagrams
> >>>>
> >>>>      iperf Done.
> >>>>
> >>>>
> >>>> With the same configuration but different IP address so that instead of
> >>>> enp2s0f0-rx-4 enp2s0f0-rx-1 would be used, there is a reordering.
> >>>>
> >>>>
> >>>> namespace 1 (client IP was changed):
> >>>>      # iperf3 -u -c server_ip -p 5000 -R -b 300M -t 300 -l 130
> >>>>      - - - - - - - - - - - - - - - - - - - - - - - - -
> >>>>      [ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
> >>>>      [  4]   0.00-300.00 sec  6.32 GBytes   181 Mbits/sec  0.007 ms  8506/52172059 (0.016%)
> >>>>      [  4] Sent 52172059 datagrams
> >>>>      [SUM]  0.0-300.0 sec  2452 datagrams received out-of-order
> >>>>
> >>>>      iperf Done.
> >>>>
> >>>> namespace 2:
> >>>>      # iperf3 -u -c server_ip -p 5001 -R -b 300M -t 300 -l 130
> >>>>      - - - - - - - - - - - - - - - - - - - - - - - - -
> >>>>      [ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
> >>>>      [  4]   0.00-300.00 sec  6.59 GBytes   189 Mbits/sec  0.006 ms  6302/54463973 (0.012%)
> >>>>      [  4] Sent 54463973 datagrams
> >>>>
> >>>>      iperf Done.
> >>>>
> >>>> Swapping IP addresses in these namespaces also changes the namespace exhibiting the issue,
> >>>> it's following the IP address.
> >>>>
> >>>>
> >>>> Is there something I could check to confirm that this behavior is or is not
> >>>> related to napi_schedule(&tp->napi[1].napi) call?
> >>>
> >>> in the function tg3_msi_1shot() you could store the cpu assigned to
> >>> tnapi1 (inside the struct tg3_napi)
> >>> and then in tg3_poll_work() you can add another check after
> >>>           if (*(tnapi->rx_rcb_prod_idx) != tnapi->rx_rcb_ptr)
> >>> something like
> >>> if (tnapi == &tp->napi[1] && tnapi->assigned_cpu == smp_processor_id())
> >>> only then execute tg3_rx()
> >>>
> >>> This may stop tnapi 1 from reading rx pkts on the current CPU from
> >>> which refill is called.
> >>>
> >>
> >> Didn't work for me, perhaps I did something wrong - if tg3_rx() is not called,
> >> there's an infinite loop, and after I added "work_done = budget;", it still doesn't
> >> work - traffic does not flow.
> >>
> >
> > I think the easiest way is to modify the tg3_rx() calling condition
> > like below inside
> > tg3_poll_work() :
> >
> > if (*(tnapi->rx_rcb_prod_idx) != tnapi->rx_rcb_ptr) {
> >          if (tnapi != &tp->napi[1] || (tnapi == &tp->napi[1] &&
> > !tp->rx_refill)) {
> >                          work_done += tg3_rx(tnapi, budget - work_done);
> >          }
> > }
> >
> > This will prevent reading rx packets when napi[1] is scheduled only for refill.
> > Can you see if this works?
> >
>
> It doesn't hang and can receive the traffic with this change, but I don't see
> a difference. I'm suspectig that tg3_poll_work() is called again, maybe in tg3_poll_msix(),
> and refill happens first, and then packets are processed anyway.
>

OK I see it now. Let me try this out myself. Will get back on this.
However, can you see with your debug prints if there is any correlation
between the time and number of prints where napi 1 is reading packets
on unassigned CPU to the time and number of packets you received
out of order up the stack? Do they match with each other? If not, we may be
incorrectly suspecting napi1 here.

> +static int tg3_cc;
> +module_param(tg3_cc, int, 0644);
> +MODULE_PARM_DESC(tg3_cc, "cpu check");
> +
> ...
> +               if (tnapi->assigned_cpu != smp_processor_id())
> +                       net_dbg_ratelimited("tg3 refill %d budget %d napi %ld cpu %d %d",
> +                           tp->rx_refill, budget,
> +                           tnapi - tp->napi, tnapi->assigned_cpu, smp_processor_id());
>                 napi_gro_receive(&tnapi->napi, skb);
>
> ...
> +        if (*(tnapi->rx_rcb_prod_idx) != tnapi->rx_rcb_ptr) {
> +                if (tnapi != &tp->napi[1] || (tnapi == &tp->napi[1] && !tp->rx_refill) || (tg3_cc == 0)) {
> +                        work_done += tg3_rx(tnapi, budget - work_done);
> +                }
> +        }
>
> with tg3_cc set to 1:
>
> [212915.661886] net_ratelimit: 650710 callbacks suppressed
> [212915.661889] tg3 refill 0 budget 64 napi 1 cpu 0 3
> [212915.661890] tg3 refill 0 budget 63 napi 1 cpu 0 3
> [212915.661891] tg3 refill 0 budget 62 napi 1 cpu 0 3
> [212915.661892] tg3 refill 0 budget 61 napi 1 cpu 0 3
> [212915.661893] tg3 refill 0 budget 60 napi 1 cpu 0 3
> [212915.661915] tg3 refill 0 budget 64 napi 1 cpu 0 3
> [212915.661916] tg3 refill 0 budget 63 napi 1 cpu 0 3
> [212915.661917] tg3 refill 0 budget 62 napi 1 cpu 0 3
> [212915.661918] tg3 refill 0 budget 61 napi 1 cpu 0 3
> [212915.661919] tg3 refill 0 budget 60 napi 1 cpu 0 3
> [212920.665912] net_ratelimit: 251117 callbacks suppressed
> [212920.665914] tg3 refill 0 budget 64 napi 1 cpu 0 3
> [212920.665915] tg3 refill 0 budget 63 napi 1 cpu 0 3
> [212920.665917] tg3 refill 0 budget 62 napi 1 cpu 0 3
> [212920.665918] tg3 refill 0 budget 61 napi 1 cpu 0 3
> [212920.665919] tg3 refill 0 budget 60 napi 1 cpu 0 3
> [212920.665932] tg3 refill 0 budget 64 napi 1 cpu 0 3
> [212920.665933] tg3 refill 0 budget 63 napi 1 cpu 0 3
> [212920.665935] tg3 refill 0 budget 62 napi 1 cpu 0 3
> [212920.665936] tg3 refill 0 budget 61 napi 1 cpu 0 3
> [212920.665937] tg3 refill 0 budget 60 napi 1 cpu 0 3
>
> and with tg3_cc set to 0:
>
> [213686.689867] tg3 refill 1 budget 64 napi 1 cpu 0 3
> [213686.689869] tg3 refill 1 budget 63 napi 1 cpu 0 3
> [213686.689870] tg3 refill 1 budget 62 napi 1 cpu 0 3
> [213686.689871] tg3 refill 1 budget 61 napi 1 cpu 0 3
> [213686.689872] tg3 refill 1 budget 60 napi 1 cpu 0 3
> [213686.689890] tg3 refill 0 budget 64 napi 1 cpu 0 3
> [213686.689891] tg3 refill 0 budget 63 napi 1 cpu 0 3
> [213686.689892] tg3 refill 0 budget 62 napi 1 cpu 0 3
> [213686.689893] tg3 refill 0 budget 61 napi 1 cpu 0 3
>
> affinity:
> echo 1 > /proc/irq/106/smp_affinity  # enp2s0f0-rx-1
> echo 2 > /proc/irq/107/smp_affinity  # enp2s0f0-rx-2
> echo 4 > /proc/irq/108/smp_affinity  # enp2s0f0-rx-3
> echo 8 > /proc/irq/109/smp_affinity  # enp2s0f0-rx-4
>
> --
> Thanks
> Vitalii
>