linux-kernel - Re: [BUG] amba: Remove deferred device addition

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4a8b0a6d-b1d5-ffe9-8e31-61844cb9bd89@huawei.com>
Date:   Fri, 10 Sep 2021 15:59:43 +0800
From:   Kefeng Wang <wangkefeng.wang@...wei.com>
To:     Saravana Kannan <saravanak@...gle.com>
CC:     Rob Herring <robh+dt@...nel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Frank Rowand <frowand.list@...il.com>,
        <devicetree@...r.kernel.org>, Russell King <linux@...linux.org.uk>,
        "Linus Walleij" <linus.walleij@...aro.org>,
        linux-arm-kernel <linux-arm-kernel@...ts.infradead.org>,
        Dmitry Torokhov <dmitry.torokhov@...il.com>,
        <linux-input@...r.kernel.org>
Subject: Re: [BUG] amba: Remove deferred device addition


On 2021/9/9 11:30, Saravana Kannan wrote:
> On Fri, Aug 27, 2021 at 6:09 PM Kefeng Wang <wangkefeng.wang@...wei.com> wrote:
>>
>> On 2021/8/28 3:09, Saravana Kannan wrote:
>>> On Fri, Aug 27, 2021 at 7:38 AM Kefeng Wang <wangkefeng.wang@...wei.com> wrote:
>>>> On 2021/8/27 8:04, Saravana Kannan wrote:
>>>>> On Thu, Aug 26, 2021 at 1:22 AM Kefeng Wang <wangkefeng.wang@...wei.com> wrote:
>>>>>>>>> Btw, I've been working on [1] cleaning up the one-off deferred probe
>>>>>>>>> solution that we have for amba devices. That causes a bunch of other
>>>>>>>>> headaches. Your patch 3/3 takes us further in the wrong direction by
>>>>>>>>> adding more reasons for delaying the addition of the device.
>>>>>> Hi Saravana, I try the link[1], but with it, there is a crash when boot
>>>>>> (qemu-system-arm -M vexpress-a15),
>>> I'm assuming it's this one?
>>> arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts
>> I use arch/arm/boot/dts/vexpress-v2p-ca15-tc1.dts.
>>
>> qemu-system-arm -M vexpress-a15 -dtb vexpress-v2p-ca15-tc1.dtb -cpu
>> cortex-a15 -smp 2 -m size=3G -kernel zImage -rtc base=localtime -initrd
>> initrd-arm32 -append 'console=ttyAMA0 cma=0 kfence.sample_interval=0
>> earlyprintk debug ' -device virtio-net-device,netdev=net8 -netdev
>> type=tap,id=net8,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown
>> -nographic
>>
>>>>> Hi,
>>>>>
>>>>> It's hard to make sense of the logs. Looks like two different threads
>>>>> might be printing to the log at the same time? Can you please enable
>>>>> the config that prints the thread ID (forgot what it's called) and
>>>>> collect this again? With what I could tell the crash seems to be
>>>>> happening somewhere in platform_match(), but that's not related to
>>>>> this patch at all?
>>>> Can you reproduce it? it is very likely related(without your patch, the
>>>> boot is fine),
>>> Sorry, I haven't ever setup qemu and booted vexpress. Thanks for your help.
>>>
>>>> the NULL ptr is about serio, it is registed from amba driver.
>>>>
>>>> ambakmi_driver_init
>>>>
>>>>     -- amba_kmi_probe
>>>>
>>>>       -- __serio_register_port
>>> Thanks for the pointer. I took a look at the logs and the code. It's
>>> very strange. As you can see from the backtrace, platform_match() is
>>> being called for the device_add() from serio_handle_event(). But the
>>> device that gets added there is on the serio_bus which obviously
>>> should be using the serio_bus_match.
>> Yes, I am confused too.
>>>> +Dmitry and input maillist, is there some known issue about serio ?
>>>>
>>>> I add some debug, the full log is attached.
>>>>
>>>> [    2.958355][   T41] input: AT Raw Set 2 keyboard as
>>>> /devices/platform/bus@...0000/bus@...0000:motherboard-bus/bus@...0000:motherboard-bus:iofpga-bus@...000000/1c060000.kmi/serio0/input/input0
>>>> [    2.977441][   T41] serio serio1: pdev c1e05508, pdev->name (null),
>>>> drv c1090fc0, drv->name vexpress-reset
>>> Based on the logs you added, it's pretty clear we are getting to
>>> platform_match(). It's also strange that the drv->name is
>>> vexpress-reset
>> ...
>>>> [    3.003113][   T41] Backtrace:
>>>> [    3.003451][   T41] [<c0560bb4>] (strcmp) from [<c0646358>] (platform_match+0xdc/0xf0)
>>>> [    3.003963][   T41] [<c064627c>] (platform_match) from [<c06437d4>] (__device_attach_driver+0x3c/0xf4)
>>>> [    3.004769][   T41] [<c0643798>] (__device_attach_driver) from [<c0641180>] (bus_for_each_drv+0x68/0xc8)
>>>> [    3.005481][   T41] [<c0641118>] (bus_for_each_drv) from [<c0642f40>] (__device_attach+0xf0/0x16c)
>>>> [    3.006152][   T41] [<c0642e50>] (__device_attach) from [<c06439d4>] (device_initial_probe+0x1c/0x20)
>>>> [    3.006853][   T41] [<c06439b8>] (device_initial_probe) from [<c0642030>] (bus_probe_device+0x94/0x9c)
>>>> [    3.007259][   T41] [<c0641f9c>] (bus_probe_device) from [<c063f9cc>] (device_add+0x408/0x8b8)
>>>> [    3.007900][   T41] [<c063f5c4>] (device_add) from [<c071c1cc>] (serio_handle_event+0x1b8/0x234)
>>>> [    3.008824][   T41] [<c071c014>] (serio_handle_event) from [<c01475a4>] (process_one_work+0x238/0x594)
>>>> [    3.009737][   T41] [<c014736c>] (process_one_work) from [<c014795c>] (worker_thread+0x5c/0x5f4)
>>>> [    3.010638][   T41] [<c0147900>] (worker_thread) from [<c014feb4>] (kthread+0x178/0x194)
>>>> [    3.011496][   T41] [<c014fd3c>] (kthread) from [<c0100150>] (ret_from_fork+0x14/0x24)
>>>> [    3.011860][   T41] Exception stack(0xc1675fb0 to 0xc1675ff8)
>>> But the platform_match() is happening for the device_add() from
>>> serio_event_handle() that's adding a device to the serio_bus and it
>>> should be using serio_bus_match().
>>>
>>> I haven't reached any conclusion yet, but my current thought process
>>> is that it's either:
>>> 1. My patch is somehow causing list corruption. But I don't directly
>>> touch any list in my change (other than deleting a list entirely), so
>>> it's not clear how that would be happening.
>> Maybe some concurrent driver load？
>>
>>> 2. Without my patch, these AMBA device's probe would be delayed at
>>> least until 5 seconds or possibly later. I'm wondering if my patch is
>>> catching some bad timing assumptions in other code.
>> After Rob's patch, It will retry soon.
>>
>> commit 039599c92d3b2e73689e8b6e519d653fd4770abb
>> Author: Rob Herring <robh@...nel.org>
>> Date:   Wed Apr 29 15:58:12 2020 -0500
>>
>>       amba: Retry adding deferred devices at late_initcall
>>
>>       If amba bus devices defer when adding, the amba bus code simply retries
>>       adding the devices every 5 seconds. This doesn't work well as it
>>       completely unsynchronized with starting the init process which can
>>       happen in less than 5 secs. Add a retry during late_initcall. If the
>>       amba devices are added, then deferred probe takes over. If the
>>       dependencies have not probed at this point, then there's no improvement
>>       over previous behavior. To completely solve this, we'd need to retry
>>       after every successful probe as deferred probe does.
>>
>>       The list_empty() check now happens outside the mutex, but the mutex
>>       wasn't necessary in the first place.
>>
>>       This needed to use deferred probe instead of fragile initcall ordering
>>       on 32-bit VExpress systems where the apb_pclk has a number of probe
>>       dependencies (vexpress-sysregs, vexpress-config).
>>
>>
>>> You might be able to test out theory (2) by DEFERRED_DEVICE_TIMEOUT to
>>> a much smaller number. Say 500ms or 100ms. If it doesn't crash, it
>>> doesn't mean it's not (2), but if it does, then we know for sure it's
>>> (2).
>> ok, I will try this one, but due to above patch, it may not work.
> Were you able to find anything more?
I can't find any clue， and have no time to check this for now, is there 
any news from your side?
>
> -Saravana
> .
>