[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5c131927-87c1-4e21-90f8-8e3a34cd6dbf@panix.com>
Date: Thu, 27 Feb 2025 09:46:07 -0800
From: Kenneth Crudup <kenny@...ix.com>
To: Mika Westerberg <mika.westerberg@...ux.intel.com>,
Kenneth Crudup <kenny@...ix.com>
Cc: Bjorn Helgaas <helgaas@...nel.org>, ilpo.jarvinen@...ux.intel.com,
Bjorn Helgaas <bhelgaas@...gle.com>, Jian-Hong Pan <jhp@...lessos.org>,
linux-pci@...r.kernel.org, linux-kernel@...r.kernel.org,
Niklāvs Koļesņikovs <pinkflames.linux@...il.com>,
Andreas Noever <andreas.noever@...il.com>,
Michael Jamet <michael.jamet@...el.com>, Lukas Wunner <lukas@...ner.de>,
Yehezkel Bernat <YehezkelShB@...il.com>, linux-usb@...r.kernel.org
Subject: Re: diagnosing resume failures after disconnected USB4 drives (Was:
Re: PCI/ASPM: Fix L1SS saving (linus/master commit 7507eb3e7bfac))
So I think, the failure mode may be related in some part to
DP/Tunneling, too- I finally got another lockup (this time, after a
hibernate, which I guess is some of the same facility) but what was
different about this time where I couldn't reproduce the lockups (and
what happens when I use my CalDigit dock) was I had an external USB-C
monitor connected when I resumed, and when I'm home (where I sometimes
forget to remove the NVMe USB4 adaptor) I always have my monitor
connected to the dock.
See attached dump log. I'm using the (somewhat still experimental) Xe
display driver, but I've seen this same lockup happen with i915.
In any case, I've now reverted 9d573d19, and when I get back to my
CalDigit I can try instrumenting the code paths in the commit and see
exactly where we're locking up.
-K
On 2/26/25 13:14, Kenneth Crudup wrote:
> OK, just did a resume after suspended (for an hour, which somehow seems
> to matter) while my CalDigit dock was attached with the ASMedia NVMe
> adaptor at suspend, but both disconnected on resume, and I am indeed
> locked up.
>
> I can attached the "pstore" report if necessary.
>
> Unfortunately I won't be able to get back to the CalDigit until Saturday
> afternoon California time.
>
> I'll be trying all the reverts/commits listed herein and at least check
> for regressions in other cases, though.
>
> -Kenny
>
> On 2/26/25 00:44, Mika Westerberg wrote:
>> Hi Kenneth,
>>
>> On Fri, Feb 14, 2025 at 09:39:33AM -0800, Kenneth Crudup wrote:
>>>
>>> This is excellent news that you were able to reproduce it- I'd
>>> figured this
>>> regression would have been caught already (as I do remember this working
>>> before) and was worried it may have been specific to a particular
>>> piece of
>>> hardware (or software setup) on my system.
>>>
>>> I'll see what I can dig up on my end, but as I'm not expert in these
>>> subsystems I may not be able to diagnose anything until your return.
>>
>> [Back now]
>>
>> My git bisect ended up to this commit:
>>
>> 9d573d19547b ("PCI: pciehp: Detect device replacement during system
>> sleep")
>>
>> Adding Lukas who is the expert.
>>
>> My steps to reproduce on Intel Meteor Lake based reference system are:
>>
>> 1. Boot the system up, nothing connected.
>> 2. Once up, connect Thunderbolt 4 dock and Thunderbolt 3 NVMe in a chain:
>>
>> [Meteor Lake host] <--> [TB 4 dock] <--> [TB 3 NVMe]
>>
>> 3. Authorize PCIe tunnels (whatever your distro provides, my buildroot
>> just
>> has the debugging tools so running 'tbauth -r 301')
>>
>> 4. Check that the PCIe topology matches the expected (lspci)
>>
>> 5. Enter s2idle:
>>
>> # rtcwake -s 30 -mmem
>>
>> 6. Once it is suspended, unplug the cable between the host and the dock.
>>
>> 7. Wait for the resume to happen.
>>
>> Expectation: The system wakes up fine, notices that the TB and PCIe
>> devices
>> are gone, stays responsive and usable.
>>
>> Actual result: Resume never completes.
>>
>> I added "no_console_suspend" to the command line and the did sysrq-w to
>> get list of blocked tasks. I've attached it just in case it is needed.
>>
>> If I revert the above commit the issue is gone. Now I'm not sure if
>> this is
>> exactly the same issue that you are seeing but nevertheless this is
>> kind of
>> normal use case so definitely something we should get fixed.
>>
>> Lukas, if you need any more information let me know. I can reproduce this
>> easily.
>>
>>> I also saw some DRM/connected fixes posted to Linus' master so maybe
>>> one of
>>> them corrects this new display-crash issue (I'm not home on my big
>>> monitor
>>> to be able to test yet).
>>>
>>> -Kenny
>>>
>>> On 2/14/25 08:29, Mika Westerberg wrote:
>>>> Hi,
>>>>
>>>> On Thu, Feb 13, 2025 at 11:19:35AM -0800, Kenneth Crudup wrote:
>>>>>
>>>>> On 2/13/25 05:59, Mika Westerberg wrote:
>>>>>
>>>>>> Hi,
>>>>>
>>>>> As Murphy's would have it, now my crashes are display-driver
>>>>> related (this
>>>>> is Xe, but I've also seen it with i915).
>>>>>
>>>>> Attached here just for the heck of it, but I'll be better testing
>>>>> the NVMe
>>>>> enclosure-related failures this weekend. Stay tuned!
>>>>
>>>> Okay, I checked quickly and no TB related crash there but I was
>>>> actually
>>>> able to reproduce hang when I unplug the device chain during
>>>> suspend. I did
>>>> not yet have time to look into it deeper. I'm sure this has been
>>>> working
>>>> fine in the past as we tested all kinds of topologies including
>>>> similar to
>>>> this.
>>>>
>>>> I will be out next week for vacation but will continue after that if
>>>> the
>>>> problem is not alraedy solved ;-)
>>>>
>>>
>>> --
>>> Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange
>>> County
>>> CA
>
--
Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange
County CA
Download attachment "pstore-202502262249.tar.bz2" of type "application/x-bzip" (8376 bytes)
Powered by blists - more mailing lists