lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20241010232656.7fc6359e@kf-ir16>
Date: Thu, 10 Oct 2024 23:26:56 -0500
From: Aaron Rainbolt <arainbolt@...cus.org>
To: Mika Westerberg <mika.westerberg@...ux.intel.com>
Cc: YehezkelShB@...il.com, michael.jamet@...el.com,
 andreas.noever@...il.com, linux-usb@...r.kernel.org, mmikowski@...cus.org,
 linux-kernel@...r.kernel.org
Subject: Re: USB-C DisplayPort display failing to stay active with Intel
 Barlow Ridge USB4 controller, power-management related issue?

On Thu, 10 Oct 2024 07:49:19 +0300
Mika Westerberg <mika.westerberg@...ux.intel.com> wrote:

> Hi,
> 
> On Wed, Oct 09, 2024 at 10:01:18PM -0500, Aaron Rainbolt wrote:
> > We're experiencing a Linux kernel bug affecting multiple Clevo
> > X370SNx1 laptops (specifically the X370SNW1 variant). The bug
> > appears to be present in kernels greater than or equal to 6.5,
> > worsening significantly with kernel 6.11.2 (latest stable at time
> > of this writing). It is unclear if all of the issues encountered
> > are the same bug, however the primary problem we've run into
> > appears to be a consequence of the power management code involving
> > Intel Barlow Ridge controllers and DisplayPort. The issue occurs
> > with in-kernel Nouveau drivers and also with proprietary NVIDIA
> > drivers.
> > 
> > When a DisplayPort monitor is attached to these laptops via a USB-C
> > connection, the monitor is recognized by the system and comes on for
> > approximately 15 seconds. It then blanks out and is automatically
> > disconnected from the system as if it had been unplugged. It will
> > remain that way indefinitely until unplugged and replugged, or until
> > something "jiggles" (for lack of a better term) the thunderbolt
> > driver. When either of these things occur, the display will
> > re-attach and come back on for 15 seconds, then blank out and
> > detach again. There are various different things that can "jiggle"
> > the thunderbolt driver, including but not limited to:
> > 
> > * Running `lspci -k` (this one came as a particular surprise)
> > * Removing and re-inserting the thunderbolt driver (`sudo modprobe
> > -r thunderbolt; sleep 1; sudo modprobe thunderbolt`)
> > * Running `nvidia-detector` while proprietary NVIDIA drivers are
> > loaded  
> 
> Or just disabling runtime PM, I presume.
> 
> > It is possible to mitigate this issue by simply running
> > `sudo modprobe -r thunderbolt` or `sudo rmmod thunderbolt` and then
> > leaving the driver unloaded. USB-C displays become stable after
> > this - they are recognized when attached and remain recognized and
> > functional indefinitely as one would expect.
> > 
> > We believe this is related to the Intel Barlow Ridge USB4 controller
> > because:
> > 
> > * Removing the thunderbolt driver restores normal display operation.
> > * This issue was *not* a problem on Clevo X370SNx machines, which
> > are identical to the X370SNx1 except for the Maple Ridge TBT
> > controller on the board has been replaced with a Barlow Ridge USB4
> > controller.
> > * This problem does not occur on the affected models with the 6.1
> >   kernel. It occurs with the 6.5 kernel and on all newer kernels we
> >   have tried.
> > 
> > Furthermore, from inspecting the Thunderbolt driver code, we believe
> > this is related to the power management features of the driver,
> > because:
> > 
> > * There is only one 15-second timeout defined in the driver source
> >   code, that being TB_AUTOSUSPEND_DELAY in drivers/thunderbolt/tb.h
> > * On earlier kernels (Ubuntu’s variant of 6.8 at least), displays
> > are stable even when the thunderbolt driver is loaded if we:
> >   * Remove the thunderbolt driver
> >   * Attach a USB-C dock
> >   * Attach displays to the dock (we used 2 4K HDMI monitors)
> >   * Reload the thunderbolt driver
> > 
> > During our investigation, we discovered commit
> > a75e0684efe567ae5f6a8e91a8360c4c1773cf3a (patch on mailing list at
> > https://lore.kernel.org/linux-usb/20240213114318.3023150-1-mika.westerberg@linux.intel.com/)
> > which appears to be a fix for this exact problem. It adds a quirk
> > for Intel Barlow Ridge controllers, which detects when a
> > DisplayPort device has been plugged directly into the USB4 port
> > (thus using "redrive" mode), and instructs the power management
> > subsystem to not power the chip down during this time if so.
> > Unfortunately, this quirk seems to be silently ignored, as we built
> > a custom kernel with some `printk` lines added to the
> > `tb_enter_redrive` and `tb_exit_redrive` functions to announce when
> > they were called, and nothing in the dmesg log indicated that they
> > had been called when we did this.
> > 
> > This bug is easily reproducible using the stock kernels in Kubuntu
> > 22.04, Kubuntu 24.04, Kali Linux 2024.2, and Fedora Workstation
> > Rawhide. Similar behavior is observed across all of these
> > distributions.
> > 
> > We built the 6.11.2 kernel from source and tested it on Kubuntu
> > 24.04, but while the kernel built, installed, and functioned
> > properly in most respects, it actually made the problem with USB-C
> > displays worse. As long as the thunderbolt driver was loaded, no
> > displays were detected when plugged in (not for even a short length
> > of time), and when the thunderbolt driver was unloaded, displays
> > would only be recognized and function if there was only one display
> > attached. Attaching a second display resulted in the first external
> > display becoming detached and the second display not coming on.
> > Unplugging the second display resulted in the first display
> > reattaching. This machine supports up to three external displays
> > and this has proven to be achievable and stable with earlier
> > kernels. No valuable error messages were logged in dmesg when these
> > problems occurred.
> > 
> > Our testing has been limited to the Clevo X370SNW1 model, however we
> > expect that the X370SNV1 model will exhibit the same issues as it
> > uses very similar internal components on the system board.
> > 
> > This is basically the extent of our knowledge at this point. We
> > attempted various patches on Ubuntu's 6.8 kernel to resolve the
> > issue, all without success:
> > 
> > * We attempted reverting fd4d58d1fef9ae9b0ee235eaad73d2e0a6a73025
> >   (thunderbolt: Enable CL2 low power state), which had no effect.
> > * We noticed that one of the Barlow Ridge bridge controllers
> >   listed by `lspci -k` appeared to not have its device ID in
> >   drivers/thunderbolt/nhi.h and there was a corresponding quirk in
> >   drivers/thunderbolt/quirks.c that looked like it might be vaguely
> >   related to the issue (specifically quirk_usb3_maximum_bandwidth),
> > so we tried adding that device to the appropriate files in order to
> > make that quirk apply to that device as well, this had no visible
> > effect on the kernel's operation and did not resolve the issue.
> > * After narrowing it down to `quirk_block_rpm_in_redrive`, we
> > attempted adding a new `thunderbolt.kf_force_redrive` kernel
> > parameter in drivers/thunderbolt/tb.c that forced the code in
> >   `tb_enter_redrive` and `tb_exit_redrive` to be executed even *if*
> > the device didn't have the appropriate quirk bit set, in the hopes
> > that this would make the quirk execute and resolve the issue. What
> > ended up happening was somehow `tb_enter_redrive` was never called
> > at all and `tb_exit_redrive` was called. This in turn made it so
> > that no USB-C displays would even be recognized for a short period
> > of time if the thunderbolt driver was loaded.
> > * Looking at PCI vendor IDs, we noticed that the PCI vendor ID used
> > to recognize all Intel controllers in drivers/thunderbolt/quirks.c
> > was 0x8087, whereas the Barlow Ridge controller in our device
> > reported a vendor ID of 0x8086. On the off chance that this was a
> > typo of epic proportions, we tried adjusting all of the occurrences
> > of 0x8087 in the tb_quirks[] array to PCI_VENDOR_ID_INTEL (which is
> > defined as 0x8086 in include/linux/pci_ids.h). This has no visible
> > effect on the kernel's behavior, and did not resolve the issue.
> > (Presumably there's something going on with the IDs there that
> > we're not aware of.)
> > 
> > As to my speculation as to what's wrong, I believe this is likely a
> > combination of two things:
> > 
> > * Some data in the `tb_quirks` array in
> > drivers/thunderbolt/quirks.c is incorrect and leading to the Barlow
> > Ridge controllers not being recognized as needing the DisplayPort
> > redrive mode quirk.
> > * The code in drivers/thunderbolt/tb.c `tb_dp_resource_unavailable`
> >   that controls whether or not to run `tb_enter_redrive` is faulty
> > in some way and is not calling `tb_enter_redrive` in all scenarios
> > where it is necessary. To be clear, the exact code I'm talking
> > about is this chunk from the aforementioned function:
> > 
> >         tunnel = tb_find_tunnel(tb, TB_TUNNEL_DP, in, out);
> >         if (tunnel)
> >                 tb_deactivate_and_free_tunnel(tunnel);
> >         else 
> >                 tb_enter_redrive(port);
> > 
> > Finally, this is probably a result of me misreading the driver code
> > somehow, but I was surprised by the following conditional at the top
> > of `tb_enter_redrive`:
> > 
> >         if (!(sw->quirks & QUIRK_KEEP_POWER_IN_DP_REDRIVE))
> >                 return;
> > 
> > To me this reads as "if the DP redrive quirk bit is set, return and
> > do nothing. Otherwise, if the bit is not set, run the quirk
> > function."  
> 
> There is the "return;" which reads that if the quirk is not set,
> return from this function early.
> 
> > This is the opposite of what I would expect - shouldn't the code
> > run if the bit is set, not if it is clear? Or does the bit being
> > unset mean that the quirk is active? (I do not believe that this is
> > the root cause of the issue because even when I forced this
> > function to run any time it was invoked, it wasn't being invoked at
> > all.)  
> 
> Okay, thanks for the very detailed report.
> 
> We need bit more information to investigate this. The commit you
> referred is exactly for this purpose and I'm surprised it did not work
> but also the Barlow Ridge PCI IDs are suprised too, as if this would
> have some old firmware or something.
> 
> Can you share full dmesg with the repro and "thunderbolt.dyndbg=+p" in
> the kernel command line?

The full log is very long, so I've included it as an email attachment.
The exact steps taken after booting with the requested kernel parameter
were:

1. boot with thunderbolt.dyndbg=+p kernel param, no USB-C plugged in.
2. After login, hot-plug two USB-C cables. This time, the displays came
  up and stayed resident (this happens sometimes)
3. Unplugged both cables.
4. Replugged both. This time, the displays did not show anything.
5. lspci -k "jiggled" the displays and they came back on.
6. After ~15s, the displays blacked out again.
7. Save to the demsg file after about 30s.

The laptop's firmware is fully up-to-date. One of the fixes we tried
was installing Windows 11, updating the firmware, and then
re-installing Kubuntu 24.04. This had no effect on the issue.

Notes:

* Kernel 6.1 does not exhibit this time out. 6.5 and later do.
* Windows 11 had very similar behavior before installing Windows
  updates. After update, it was fixed.
* All distros and W11 were tested on the same hardware with the latest
  firmware, so we know this is not a hardware failure.

Thanks for your help!

View attachment "2024-10-10-thunderbolt.dyndbg+p.log" of type "text/x-log" (137940 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ