netdev - Re: Poor thunderbolt-net interface performance when bridged

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <048AD01E-BCB0-447D-A0BB-A1D9074B9D2D@bejarano.io>
Date: Thu, 29 May 2025 10:38:17 +0200
From: Ricard Bejarano <ricard@...arano.io>
To: Andrew Lunn <andrew@...n.ch>
Cc: Mika Westerberg <mika.westerberg@...ux.intel.com>,
 netdev@...r.kernel.org,
 michael.jamet@...el.com,
 YehezkelShB@...il.com,
 andrew+netdev@...n.ch,
 davem@...emloft.net,
 edumazet@...gle.com,
 kuba@...nel.org,
 pabeni@...hat.com
Subject: Re: Poor thunderbolt-net interface performance when bridged

Okay, so in my ignorance of better ways to keep pulling this thread, I've been
sprinkling net->stats increases (of error types we know aren't happening) all
over tbnet_start_xmit on the tx side (red), to figure out what pieces of it
execute during various tests.

And after various tests, I've found:

1. Both lossy and lossless tests follow the same execution patterns, which rules
   out any specific branch of tbnet_start_xmit that might've only executed in
   lossy tests. In other words, it must be something wrong with logic common to
   both lossy and lossless scenarios.

2. We never run out of buffers, so we never execute L1123-L1124.

3. tbnet_get_tx_buffer never fails, so we never execute L1129.

4. data_len is never > TBNET_MAX_PAYLOAD_SIZE, so we never execute the while
   loop at L1135-L1183.

5. In 7 sk_buffs per iperf3 test (out of tens of thousands), I've observed the
   following three differences:
     4.1. skb->data_len is > 0, so len is < data_len, so we go into the loop at
          L1192-L1208, which we only execute once, so len must be >= data_len at
          the end of its first iteration.
     4.2. frag is < skb_shinfo(skb)->nr_frags, so we execute L1203-L1204.
     4.3. unmap is true after L1204, so we also execute L1213 when that runs.
   Yes, seven. In both 10Mbps (0/4316 lost packets) and 1.1Gbps (43715/474724
   lost packets). Also, from other tests, I'm somewhat sure the very first
   sk_buff in each iperf3 is non-linear.

6. tbnet_xmit_csum_and_map never fails, meaning we never execute L1216.

7. frame_index is never > 0, meaning we only execute L1219 once, we only put one
   frame on the Tx ring per tbnet_start_xmit invocation.

8. net->svc->prtcstns & TBNET_MATCH_FRAGS_ID is always true, meaning we always
   execute L1222. Makes sense, the Rx end also supports TBNET_MATCH_FRAGS_ID.

I'm slightly confused about the following:

skb has three lengths we care about:
  - skb->len is the total length of the payload (linear + non-linear).
  - skb_headlen(skb) is the length of the linear portion of the payload.
  - skb->data_len is the length of the non-linear portion of the payload.

At the beginning of tbnet_start_xmit, we define the following 2 local variables:
  - len = skb_headlen(skb)  // which is skb->len - skb->data_len
  - data_len = skb->len
As a side note, this is confusing in and of itself because we're renaming skb's
headlen to len and len to data_len, both of which mean something else outside of
this function. I'm missing the "why", but it makes the code harder to read.

data_len and len are only modified at L1194 and L1203, respetively, in those few
occasions when skb is non-linear, otherwise they remain untouched throughout
tbnet_start_xmit's execution. And since we see non-linear skb's in both lossy
and lossless tests, I doubt data_len miscalculations or tbnet_kmap_frag are to
blame for an unaligned memcpy later in L1210, though I'm not discarding it.

I also found this confusing in main.c:L1316:

>    /* ThunderboltIP takes advantage of TSO packets but instead of
>     * segmenting them we just split the packet into Thunderbolt
>     * frames (maximum payload size of each frame is 4084 bytes) and
>     * calculate checksum over the whole packet here.
>     *
>     * The receiving side does the opposite if the host OS supports
>     * LRO, otherwise it needs to split the large packet into MTU
>     * sized smaller packets.
>     *
>     * In order to receive large packets from the networking stack,
>     * we need to announce support for most of the offloading
>     * features here.
>     */

This looks weird to me. This driver does not support LRO, why would we rely on
having it? For peers running other operating systems?

So my next best candidate is tbnet_xmit_csum_and_map, but not for any particular
reason, I just haven't had time to understand it yet. From the looks of it, it's
the one place where checksums are calculated right before we put the frames on
the Tx ring, so if there is something wrong with checksums it should be there.

Any ideas?

RB