lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <cc9cfa46-ff93-4fff-9333-23f959e14fa8@amd.com>
Date: Tue, 9 Dec 2025 10:49:46 -0600
From: Mario Limonciello <mario.limonciello@....com>
To: Mika Westerberg <mika.westerberg@...ux.intel.com>,
 "Chia-Lin Kao (AceLan)" <acelan.kao@...onical.com>, Sanath.S@....com
Cc: Andreas Noever <andreas.noever@...il.com>,
 Mika Westerberg <westeri@...nel.org>, Yehezkel Bernat
 <YehezkelShB@...il.com>, linux-usb@...r.kernel.org,
 linux-kernel@...r.kernel.org
Subject: Re: [PATCH] [RFC] thunderbolt: Add delay for Dell U2725QE link width

+Sanath too

On 12/9/2025 1:06 AM, Mika Westerberg wrote:
> +Mario since this is AMD related.
> 
> [Also keeping all the context].
> 

Thanks for adding me.  A few other thoughts I have:

1) Is it possible that the USB4 controller in the monitor is powering up 
or exiting a low power state during the first hotplug?

2) Are you sure this only happens on AMD host?  What if you cold boot 
the monitor with Intel host?


> On Tue, Dec 09, 2025 at 01:41:41PM +0800, Chia-Lin Kao (AceLan) wrote:
>> When plugging in a Dell U2725QE Thunderbolt monitor, the kernel produces
>> a call trace during initial enumeration. The device automatically
>> disconnects and reconnects ~3 seconds later, and works correctly on the
>> second attempt.
>>
>> Issue Description:
>> ==================
>> The Dell U2725QE (USB4 device 8087:b26) requires additional time during
>> link width negotiation from single lane to dual lane. On first plug, the
>> following sequence occurs:
>>
>> 1. Port state reaches TB_PORT_UP (link established, single lane)
>> 2. Path activation begins immediately
>> 3. tb_path_activate() - > tb_port_write() returns -ENOTCONN (error -107)
>> 4. Call trace is generated at tb_path_activate()
>> 5. Device disconnects/reconnects automatically after ~3 seconds
>> 6. Second attempt succeeds with full dual-lane bandwidth
>>
>> First attempt dmesg (failure):
>> -------------------------------
>> [   36.030347] thunderbolt 0000:c7:00.6: 2:16: available bandwidth for new USB3 tunnel 9000/9000 Mb/s
>> [   36.030613] thunderbolt 0000:c7:00.6: 2: USB3 tunnel creation failed
>> [   36.031530] thunderbolt 0000:c7:00.6: PCIe Down path activation failed
>> [   36.031531] WARNING: drivers/thunderbolt/path.c:589 at 0x0, CPU#12: pool-/usr/libex/3145
>>
>> Second attempt dmesg (success):
>> --------------------------------
>> [   40.440012] thunderbolt 0000:c7:00.6: 2:16: available bandwidth for new USB3 tunnel 36000/36000 Mb/s
>> [   40.440261] thunderbolt 0000:c7:00.6: 2:16: maximum required bandwidth for USB3 tunnel 9000 Mb/s
>> [   40.440269] thunderbolt 0000:c7:00.6: 0:4 <-> 2:16 (USB3): activating
>> [   40.440271] thunderbolt 0000:c7:00.6: 0:4 <-> 2:16 (USB3): allocating initial bandwidth 9000/9000 Mb/s
>>
>> The bandwidth difference (9000 vs 36000 Mb/s) indicates the first attempt
>> occurs while the link is still in single-lane mode.
>>
>> Root Cause Analysis:
>> ====================
>> The error originates from the Thunderbolt/USB4 device hardware itself:
>>
>> 1. Port config space read/write returns TB_CFG_ERROR_PORT_NOT_CONNECTED
>> 2. This gets translated to -ENOTCONN in tb_cfg_get_error()
>> 3. The port's control channel is temporarily unavailable during state
>>     transition from single lane to dual lane (lane bonding)
>>
>> The comment in drivers/thunderbolt/ctl.c explains this is expected:
>>    "Port is not connected. This can happen during surprise removal.
>>     Do not warn."
>>
>> Attempted Solutions:
>> ====================
>> 1. Retry logic on -ENOTCONN in tb_path_activate():
>>     Result: Caused host port (0:0) lockup with hundreds of "downstream
>>     port is locked" errors. Rejected by user.
>>
>> 2. Increased tb_port_wait_for_link_width() timeout from 100ms to 3000ms:
>>     Result: Did not resolve the issue. The timeout increase alone is
>>     insufficient because the port state hasn't reached TB_PORT_UP when
>>     lane bonding is attempted.
>>
>> 3. Added msleep(2000) at various points in enumeration flow:
>>     Locations tested:
>>     - Before tb_switch_configure(): Works ✓
>>     - Before tb_switch_add(): Works ✓
>>     - Before usb4_port_hotplug_enable(): Works ✓
>>     - After tb_switch_add(): Doesn't work ✗
>>     - In tb_configure_link(): Doesn't work ✗
>>     - In tb_switch_lane_bonding_enable(): Doesn't work ✗
>>     - In tb_port_wait_for_link_width(): Doesn't work ✗
>>
>>     The pattern shows the delay must occur BEFORE hotplug enable, which
>>     happens early in tb_switch_port_hotplug_enable() -> usb4_port_hotplug_enable().
>>
>> Current Workaround:
>> ===================
>> Add a 2-second delay in tb_wait_for_port() when the port state reaches
>> TB_PORT_UP. This is the earliest point where we know:
>> - The link is physically established
>> - The device is responsive
>> - But lane width negotiation may still be in progress
>>
>> This location is chosen because:
>> 1. It's called during port enumeration before any tunnel creation
>> 2. The port has just transitioned to TB_PORT_UP state
>> 3. Allows sufficient time for lane bonding to complete
>> 4. Avoids affecting other code paths
>>
>> Testing Results:
>> ================
>> With this patch:
>> - No call trace on first plug
>> - Device enumerates correctly on first attempt
>> - Full bandwidth (36000 Mb/s) available immediately
>> - No disconnect/reconnect cycle
>> - USB and PCIe tunnels create successfully
>>
>> Without this patch:
>> - Call trace on every first plug
>> - Only 9000 Mb/s bandwidth (single lane) on first attempt
>> - Automatic disconnect/reconnect after ~3 seconds
>> - Second attempt works with 36000 Mb/s
>>
>> Discussion Points for RFC:
>> ===========================
>> 1. Is a fixed 2-second delay acceptable, or should we poll for a
>>     specific hardware state?
>>
>> 2. Should we check PORT_CS_18_TIP (Transition In Progress) bit instead
>>     of using a fixed delay?
>>
>> 3. Is there a better location for this delay in the enumeration flow?
>>
>> 4. Should this be device-specific (based on vendor/device ID) or apply
>>     to all USB4 devices?
>>
>> 5. The 100ms timeout in tb_switch_lane_bonding_enable() may be too
>>     short for other devices as well. Should we increase it universally?
> 
> We should understand the issue better. This is Intel Goshen Ridge based
> monitor which I'm pretty sure does not require additional quirks, at least
> I have not heard any issues like this. I suspect this is combination of the
> AMD and Intel hardware that is causing the issue.
> 
> Looking at your dmesg, even before your issue there is suspicious log
> entry:
> 
> [    5.852476] localhost kernel: [31] thunderbolt 0000:c7:00.5: acking hot unplug event on 0:6
> [    5.852492] localhost kernel: [12] thunderbolt 0000:c7:00.5: 0:6: DP IN resource unavailable: adapter unplug
> 
> This causes tearing down the DP tunnel. It is unexpected for the host
> router to send this unless you plugged monitor directly to some of the
> Type-C ports at this time?
> 
> I wonder if you could take trace logs too from the issue? Instructions:
> 
> https://github.com/intel/tbtools?tab=readme-ov-file#tracing
> https://github.com/intel/tbtools/wiki/Useful-Commands#tracing
> 
> Please provide both full dmesg and the trace.out or the merged one. That
> would allow us to look what is going on (hopefully).

We need to be careful trusting the LLM conclusions.

Hopefully the traces requested by Mika show what's going on.

If they don't, then I think the next step will be a USB4 analyzer.

> 
>> Hardware Details:
>> =================
>> Device: Dell U2725QE Thunderbolt Monitor
>> USB4 Router: 8087:b26 (Intel USB4 controller)
>> Host: AMD Thunderbolt 4 controller (0000:c7:00.6)

What sort of hardware is the AMD host?  PCI BDF is meaningless.

>>
>> Signed-off-by: Chia-Lin Kao (AceLan) <acelan.kao@...onical.com>
>> ---
>> Full dmesg log available at: https://paste.ubuntu.com/p/CXs2T4XzZ3/
>> ---
>>   drivers/thunderbolt/switch.c | 2 ++
>>   1 file changed, 2 insertions(+)
>>
>> diff --git a/drivers/thunderbolt/switch.c b/drivers/thunderbolt/switch.c
>> index b3948aad0b955..e0c65e5fb0dca 100644
>> --- a/drivers/thunderbolt/switch.c
>> +++ b/drivers/thunderbolt/switch.c
>> @@ -530,6 +530,8 @@ int tb_wait_for_port(struct tb_port *port, bool wait_if_unplugged)
>>   			return 0;
>>   
>>   		case TB_PORT_UP:
>> +			msleep(2000);
>> +			fallthrough;
>>   		case TB_PORT_TX_CL0S:
>>   		case TB_PORT_RX_CL0S:
>>   		case TB_PORT_CL1:
>> -- 
>> 2.43.0


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ