[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <05f3f8bf-1fab-f48f-e581-479c6fa123ca@samsung.com>
Date: Wed, 4 May 2022 08:05:22 +0200
From: Marek Szyprowski <m.szyprowski@...sung.com>
To: Petr Mladek <pmladek@...e.com>
Cc: John Ogness <john.ogness@...utronix.de>,
Sergey Senozhatsky <senozhatsky@...omium.org>,
Steven Rostedt <rostedt@...dmis.org>,
Thomas Gleixner <tglx@...utronix.de>,
linux-kernel@...r.kernel.org,
Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
linux-amlogic@...ts.infradead.org
Subject: Re: [PATCH printk v5 1/1] printk: extend console_lock for
per-console locking
Hi,
On 03.05.2022 08:49, Petr Mladek wrote:
> On Tue 2022-05-03 01:13:19, Marek Szyprowski wrote:
>> On 02.05.2022 15:17, Petr Mladek wrote:
>>> On Mon 2022-05-02 11:19:07, Marek Szyprowski wrote:
>>>> On 30.04.2022 18:00, John Ogness wrote:
>>>>> On 2022-04-29, Marek Szyprowski <m.szyprowski@...sung.com> wrote:
>>>>>> The same issue happens if I boot with init=/bin/bash
>>>>> Very interesting. Since you are seeing all the output up until you try
>>>>> doing something, I guess receiving UART data is triggering the issue.
>>>> Right, this is how it looks like.
>>>>
>>>>>> I found something really interesting. When lockup happens, I'm still
>>>>>> able to log via ssh and trigger any magic sysrq action via
>>>>>> /proc/sysrq-trigger.
>>>>> If you boot the system and directly login via ssh (without sending any
>>>>> data via serial), can you trigger printk output on the UART? For
>>>>> example, with:
>>>>>
>>>>> echo hello > /dev/kmsg
>>>>>
>>>>> (You might need to increase your loglevel to see it.)
>>>> Data written to /dev/kmsg and all kernel logs were always displayed
>>>> correctly. Also data written directly to /dev/ttyAML0 is displayed
>>>> properly on the console. The latter doesn't however trigger the input
>>>> related activity.
>>>>
>>>> It looks that the data read from the uart is delivered only if other
>>>> activity happens on the kernel console. If I type 'reboot' and press
>>>> enter, nothing happens immediately. If I type 'date >/dev/ttyAML0' via
>>>> ssh then, I only see the date printed on the console. However if I type
>>>> 'date >/dev/kmsg', the the date is printed and reboot happens.
>>> This is really interesting.
>>>
>>> 'date >/dev/kmsg' should be handled like a normal printk().
>>> It should get pushed to the console using printk kthread,
>>> that calls call_console_driver() that calls con->write()
>>> callback. In our case, it should be meson_serial_console_write().
>>>
>>> I am not sure if meson_serial_console_write() is used also
>>> when writing via /dev/ttyAML0.
>>>
>>>>>> It turned out that the UART console is somehow blocked, but it
>>>>>> receives and buffers all the input. For example after issuing "echo
>>>>>> >/proc/sysrq-trigger" from the ssh console, the UART console has been
>>>>>> updated and I see the magic sysrq banner and then all the commands I
>>>>>> blindly typed in the UART console! However this doesn't unblock the
>>>>>> console.
>>>>> sysrq falls back to direct printing. This would imply that the kthread
>>>>> printer is somehow unable to print.
>>>>>
>>>>>> Here is the output of 't' magic sys request:
>>>>>>
>>>>>> https://protect2.fireeye.com/v1/url?k=8649f24d-e73258c4-86487902-74fe48600034-a2ca6bb18361467d&q=1&e=1bc4226f-a422-42b9-95e8-128845b8609f&u=https%3A%2F%2Fpastebin.com%2FfjbRuy4f
>>>>> It looks like the call trace for the printing kthread (pr/ttyAML0) is
>>>>> corrupt.
>>>> Right, good catch. For comparison, here is a 't' sysrq result from the
>>>> 'working' serial console (next-20220429), which happens usually 1 of 4
>>>> boots:
>>>>
>>>> https://protect2.fireeye.com/v1/url?k=610106f1-008a13b6-61008dbe-000babff99aa-11083c39c44861df&q=1&e=2eafad9e-c5d2-4696-9d78-f3b5885256dc&u=https%3A%2F%2Fpastebin.com%2Fmp8zGFbW
>>> Strange. The backtrace is weird here too:
>>>
>>> [ 50.514509] task:pr/ttyAML0 state:R running task stack: 0 pid: 65 ppid: 2 flags:0x00000008
>>> [ 50.514540] Call trace:
>>> [ 50.514548] __switch_to+0xe8/0x160
>>> [ 50.514561] meson_serial_console+0x78/0x118
>>>
>>> There should be kthread() and printk_kthread_func() on the stack.
>>>
>>> Hmm, meson_serial_console+0x78/0x118 is weird on its own.
>>> meson_serial_console is the name of the structure. I would
>>> expect a name of the .write callback, like
>>> meson_serial_console_write()
>> Well, this sounds a bit like a stack corruption. For the reference, I've
>> checked what magic sysrq 't' reports for my other test boards:
>>
>> RaspberryPi4:
>>
>> [ 166.702431] task:pr/ttyS0 state:R stack: 0 pid: 64
>> ppid: 2 flags:0x00000008
>> [ 166.711069] Call trace:
>> [ 166.713647] __switch_to+0xe8/0x160
>> [ 166.717216] __schedule+0x2f4/0x9f0
>> [ 166.720862] log_wait+0x0/0x50
>> [ 166.724081] task:vfio-irqfd-clea state:I stack: 0 pid: 65
>> ppid: 2 flags:0x00000008
>> [ 166.732698] Call trace:
>>
>>
>> ARM Juno R1:
>>
>> [ 74.356562] task:pr/ttyAMA0 state:R running task stack: 0
>> pid: 47 ppid: 2 flags:0x00000008
>> [ 74.356605] Call trace:
>> [ 74.356617] __switch_to+0xe8/0x160
>> [ 74.356637] amba_console+0x78/0x118
>> [ 74.356657] task:kworker/2:1 state:I stack: 0 pid: 48
>> ppid: 2 flags:0x00000008
>> [ 74.356695] Workqueue: 0x0 (mm_percpu_wq)
>> [ 74.356738] Call trace:
>>
>>
>> QEMU virt/arm64:
>>
>> [ 174.155760] task:pr/ttyAMA0 state:S stack: 0 pid: 26
>> ppid: 2 flags:0x00000008
>> [ 174.156305] Call trace:
>> [ 174.156529] __switch_to+0xe8/0x160
>> [ 174.157131] 0xffff5ebbbfdd62d8
> You mentioned in the other mail that the other boards work as
> expected. I mean that console gets stuck only on the meson board.
> Is it true, please?
Right. Even on Meson based boards the console is operational about 1 of
4 boots.
> The stack looks really weird. But another weird thing is that
> even the meson board is able to show the messages, for example,
> using echo hello >/dev/kmsg. It suggests that the kthreads
> somehow work.
>
> There is also a possibility that this code path is optimized
> some special way and the unwinder has troubles to show
> the stack correctly.
I doubt that this is a result of the compiler's optimization. See my
logs from QEMU's virt machine. I've managed to capture 2 states of
ttyAMA0 task. One shows some kind of stack corruption imho. It doesn't
happen always though.
> ...
Best regards
--
Marek Szyprowski, PhD
Samsung R&D Institute Poland
Powered by blists - more mailing lists