[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6d7a7ecc-1364-5cbe-0485-01d693dbdc6c@arm.com>
Date: Mon, 21 Aug 2023 13:39:11 +0100
From: Robin Murphy <robin.murphy@....com>
To: Shreeya Patel <shreeya.patel@...labora.com>,
Saravana Kannan <saravanak@...gle.com>
Cc: Geert Uytterhoeven <geert+renesas@...der.be>,
Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
stable@...r.kernel.org, John Stultz <jstultz@...gle.com>,
"David S. Miller" <davem@...emloft.net>,
Alexey Kuznetsov <kuznet@....inr.ac.ru>,
Hideaki YOSHIFUJI <yoshfuji@...ux-ipv6.org>,
Jakub Kicinski <kuba@...nel.org>,
Rob Herring <robh@...nel.org>,
Geert Uytterhoeven <geert@...ux-m68k.org>,
Yoshihiro Shimoda <yoshihiro.shimoda.uh@...esas.com>,
Andy Shevchenko <andy.shevchenko@...il.com>,
Sudeep Holla <sudeep.holla@....com>,
Andy Shevchenko <andriy.shevchenko@...ux.intel.com>,
Naresh Kamboju <naresh.kamboju@...aro.org>,
Basil Eljuse <Basil.Eljuse@....com>,
Ferry Toth <fntoth@...il.com>, Arnd Bergmann <arnd@...db.de>,
Anders Roxell <anders.roxell@...aro.org>,
linux-pm@...r.kernel.org, Nathan Chancellor <nathan@...nel.org>,
Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
"Rafael J. Wysocki" <rafael@...nel.org>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Sasha Levin <sashal@...nel.org>, linux-kernel@...r.kernel.org,
"gustavo.padovan@...labora.com" <gustavo.padovan@...labora.com>,
Ricardo Cañuelo Navarro
<ricardo.canuelo@...labora.com>,
Guillaume Charles Tucker <guillaume.tucker@...labora.com>,
usama.anjum@...labora.com, kernelci@...ts.linux.dev
Subject: Re: [PATCH 5.17 127/298] driver core: Fix wait_for_device_probe() &
deferred_probe_timeout interaction
On 2023-08-21 12:35, Shreeya Patel wrote:
>
> On 19/08/23 01:49, Saravana Kannan wrote:
>> On Thu, Aug 17, 2023 at 4:13 PM Shreeya Patel
>> <shreeya.patel@...labora.com> wrote:
>>> Hi Geert, Saravana,
>>>
>>> On 18/08/23 00:03, Saravana Kannan wrote:
>>>> On Thu, Aug 17, 2023 at 4:37 AM Shreeya Patel
>>>> <shreeya.patel@...labora.com> wrote:
>>>>> Hi Greg,
>>>>>
>>>>> On 16/08/23 20:33, Greg Kroah-Hartman wrote:
>>>>>> On Wed, Aug 16, 2023 at 03:09:27PM +0530, Shreeya Patel wrote:
>>>>>>> On 13/06/22 15:40, Greg Kroah-Hartman wrote:
>>>>>>>> From: Saravana Kannan<saravanak@...gle.com>
>>>>>>>>
>>>>>>>> [ Upstream commit 5ee76c256e928455212ab759c51d198fedbe7523 ]
>>>>>>>>
>>>>>>>> Mounting NFS rootfs was timing out when deferred_probe_timeout was
>>>>>>>> non-zero [1]. This was because ip_auto_config() initcall times out
>>>>>>>> waiting for the network interfaces to show up when
>>>>>>>> deferred_probe_timeout was non-zero. While ip_auto_config() calls
>>>>>>>> wait_for_device_probe() to make sure any currently running deferred
>>>>>>>> probe work or asynchronous probe finishes, that wasn't
>>>>>>>> sufficient to
>>>>>>>> account for devices being deferred until deferred_probe_timeout.
>>>>>>>>
>>>>>>>> Commit 35a672363ab3 ("driver core: Ensure
>>>>>>>> wait_for_device_probe() waits
>>>>>>>> until the deferred_probe_timeout fires") tried to fix that by
>>>>>>>> making
>>>>>>>> sure wait_for_device_probe() waits for deferred_probe_timeout to
>>>>>>>> expire
>>>>>>>> before returning.
>>>>>>>>
>>>>>>>> However, if wait_for_device_probe() is called from the
>>>>>>>> kernel_init()
>>>>>>>> context:
>>>>>>>>
>>>>>>>> - Before deferred_probe_initcall() [2], it causes the boot
>>>>>>>> process to
>>>>>>>> hang due to a deadlock.
>>>>>>>>
>>>>>>>> - After deferred_probe_initcall() [3], it blocks kernel_init() from
>>>>>>>> continuing till deferred_probe_timeout expires and beats
>>>>>>>> the point of
>>>>>>>> deferred_probe_timeout that's trying to wait for userspace
>>>>>>>> to load
>>>>>>>> modules.
>>>>>>>>
>>>>>>>> Neither of this is good. So revert the changes to
>>>>>>>> wait_for_device_probe().
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> -https://lore.kernel.org/lkml/TYAPR01MB45443DF63B9EF29054F7C41FD8C60@TYAPR01MB4544.jpnprd01.prod.outlook.com/
>>>>>>>> [2]
>>>>>>>> -https://lore.kernel.org/lkml/YowHNo4sBjr9ijZr@dev-arch.thelio-3990X/
>>>>>>>> [3] -https://lore.kernel.org/lkml/Yo3WvGnNk3LvLb7R@linutronix.de/
>>>>>>> Hi Saravana, Greg,
>>>>>>>
>>>>>>>
>>>>>>> KernelCI found this patch causes the
>>>>>>> baseline.bootrr.deferred-probe-empty test to fail on r8a77960-ulcb,
>>>>>>> see the following details for more information.
>>>>>>>
>>>>>>> KernelCI dashboard link:
>>>>>>> https://linux.kernelci.org/test/plan/id/64d2a6be8c1a8435e535b264/
>>>>>>>
>>>>>>> Error messages from the logs :-
>>>>>>>
>>>>>>> + UUID=11236495_1.5.2.4.5
>>>>>>> + set +x
>>>>>>> + export
>>>>>>> 'PATH=/opt/bootrr/libexec/bootrr/helpers:/lava-11236495/1/../bin:/sbin:/usr/sbin:/bin:/usr/bin'
>>>>>>> + cd /opt/bootrr/libexec/bootrr
>>>>>>> + sh helpers/bootrr-auto
>>>>>>> e6800000.ethernet
>>>>>>> e6700000.dma-controller
>>>>>>> e7300000.dma-controller
>>>>>>> e7310000.dma-controller
>>>>>>> ec700000.dma-controller
>>>>>>> ec720000.dma-controller
>>>>>>> fea20000.vsp
>>>>>>> feb00000.display
>>>>>>> fea28000.vsp
>>>>>>> fea30000.vsp
>>>>>>> fe9a0000.vsp
>>>>>>> fe9af000.fcp
>>>>>>> fea27000.fcp
>>>>>>> fea2f000.fcp
>>>>>>> fea37000.fcp
>>>>>>> sound
>>>>>>> ee100000.mmc
>>>>>>> ee140000.mmc
>>>>>>> ec500000.sound
>>>>>>> /lava-11236495/1/../bin/lava-test-case
>>>>>>> <8>[ 17.476741] <LAVA_SIGNAL_TESTCASE
>>>>>>> TEST_CASE_ID=deferred-probe-empty RESULT=fail>
>>>>>>>
>>>>>>> Test case failing :-
>>>>>>> Baseline Bootrr deferred-probe-empty test
>>>>>>> -https://github.com/kernelci/bootrr/blob/main/helpers/bootrr-generic-tests
>>>>>>>
>>>>>>> Regression Reproduced :-
>>>>>>>
>>>>>>> Lava job after reverting the commit 5ee76c256e92
>>>>>>> https://lava.collabora.dev/scheduler/job/11292890
>>>>>>>
>>>>>>>
>>>>>>> Bisection report from KernelCI can be found at the bottom of the
>>>>>>> email.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Shreeya Patel
>>>>>>>
>>>>>>> #regzbot introduced: 5ee76c256e92
>>>>>>> #regzbot title: KernelCI: Multiple devices deferring on
>>>>>>> r8a77960-ulcb
>>>>>>>
>>>>>>> ---------------------------------------------------------------------------------------------------------------------------------------------------
>>>>>>>
>>>>>>> * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * **
>>>>>>> * If you do send a fix, please include this trailer: *
>>>>>>> * Reported-by: "kernelci.org bot" <bot@...> *
>>>>>>> * *
>>>>>>> * Hope this helps! *
>>>>>>> * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
>>>>>>>
>>>>>>> stable-rc/linux-5.10.y bisection:
>>>>>>> baseline.bootrr.deferred-probe-empty on
>>>>>>> r8a77960-ulcb
>>>>>> You are testing 5.10.y, yet the subject says 5.17?
>>>>>>
>>>>>> Which is it here?
>>>>> Sorry, I accidentally used the lore link for 5.17 while reporting this
>>>>> issue,
>>>>> but this test does fail on all the stable releases from 5.10 onwards.
>>>>>
>>>>> stable 5.15 :-
>>>>> https://linux.kernelci.org/test/case/id/64dd156a5ac58d0cf335b1ea/
>>>>> mainline :-
>>>>> https://linux.kernelci.org/test/case/id/64dc13d55cb51357a135b209/
>>>>>
>>>> Shreeya, can you try the patch Geert suggested and let us know if it
>>>> helps? If not, then I can try to take a closer look.
>>> I tried to test the kernel with 9be4cbd09da8 but it didn't change the
>>> result.
>>> https://lava.collabora.dev/scheduler/job/11311615
>>>
>>> Also, I am not sure if this can change things but just FYI, KernelCI
>>> adds some kernel parameters when running these tests and one of the
>>> parameter is deferred_probe_timeout=60.
>> Ah this is good to know.
>>
>>> You can check this in the definition details given in the Lava job. I
>>> also tried to remove this parameter and rerun the test but again I got
>>> the same result.
>> How long does the test wait after boot before checking for the
>> deferred devices list?
>>
>
> AFAIK, script for running the tests is immediately ran after the boot
> process is complete so there is no wait time.
Regardless of what the kernel is doing, it seems like a fundamentally
dumb test to specifically ask deferred probe to wait for up to a minute
then complain that it hasn't finished after 11 seconds :/
If anything, it seems plausible that the "regression" might actually be
the correct behaviour, and it was wrong before. I can't manage to pull
up a boot log for a pre-5.10 kernel since all the async stuff on the
KernelCI dashboard always just times out for me with a helpful "Error
while loading data from the server (error code: 0)", but what would be
interesting is whether those devices on the list are expected to
successfully probe anyway - the mainline log below also shows other
stuff failing to probe and CPUs failing to come online, so it's clearly
not a very happy platform to begin with.
Robin.
>>> I will try to add 9be4cbd09da8 to mainline kernel and see what results I
>>> get.
>> Now I'm confused. What do you mean by mainline? Are you saying the tip
>> of tree of Linus's tree is also hitting this issue?
>
>
> KernelCI runs tests on different kernel branches and trees, we also have
> this same test running on mainline tree.
> Following is the link to the dashboard for it and as you can see, it
> does fail there too.
>
>
> https://linux.kernelci.org/test/case/id/64dc13d55cb51357a135b209/
>
>
>> -Saravana
>>
Powered by blists - more mailing lists