linux-kernel - Re: Regression: Failed boots bisected to 4cd13c21b207 "softirq: Let ksoftirqd do its job"

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20161122152733.GH25080@e106950-lin.cambridge.arm.com>
Date:   Tue, 22 Nov 2016 15:27:33 +0000
From:   Brian Starkey <brian.starkey@....com>
To:     Eric Dumazet <edumazet@...gle.com>
Cc:     Thomas Gleixner <tglx@...utronix.de>,
        LKML <linux-kernel@...r.kernel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...nel.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Alexander Potapenko <glider@...gle.com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Subject: Re: Regression: Failed boots bisected to 4cd13c21b207 "softirq: Let
 ksoftirqd do its job"

Hi Eric,

On Tue, Nov 22, 2016 at 06:29:33AM -0800, Eric Dumazet wrote:
>On Tue, Nov 22, 2016 at 2:33 AM, Brian Starkey <brian.starkey@....com> wrote:
>>
>> Hi,
>>
>> On Fri, Nov 18, 2016 at 01:40:43AM +0100, Thomas Gleixner wrote:
>>>
>>> Brian,
>>>
>>> On Thu, 17 Nov 2016, Brian Starkey wrote:
>>>>
>>>> No joy with this patch :-(
>>>>
>>>> I had to add an ioaddr argument because apparently that macro depends
>>>> on local context (yuck...), but it doesn't help my issue.
>>>>
>>>> FWIW I don't see any timeouts, either with or without the patch.
>>>> (I don't know for sure, but I would guess that the model of the
>>>> network card doesn't model whatever stall that loop is checking for.
>>>> It probably just completes all MMU operations immediately)
>>>
>>>
>>> Is there a chance that you enable trace points at the kernel command line?
>>>
>>>  trace_event=sched_wakeup,sched_switch,irq_handler_entry,irq_handler_exit,softirq_raise,softirq_entry,softirq_exit
>>>
>>> should be enough for a start. All we need aside of that is a trigger to
>>> stop the trace so we can actually see the events around the time where
>>> things go stale.
>>>
>>> I assume that the whole issue is visible throughout the slow progress of
>>> init towards a working system, so for a start it would be sufficient to add
>>> something like this into the startup sequence at some point:
>>>
>>> mount -t debugfs debugfs /sys/kernel/debug
>>> echo 0 >/sys/kernel/debug/tracing/tracing_on
>>>
>>> The only interesting challange is to get the trace data out of the
>>> system. The trace is accessible via:
>>>
>>> cat /sys/kernel/tracing/trace
>>>
>>> So if your ssh works at some point, that might be an option or you just try
>>> to store it over NFS (which will be slow, but better than nothing). Maybe
>>> you have a better idea :)
>>
>>
>> I finally managed to pry some traces out this morning. It seems like
>> the system struggles to even invoke echo when it's doing badly.
>>
>> Trace before 4cd13c21b207: https://drive.google.com/open?id=0B8siaK6ZjvEwU21wNTdZS29kVXc
>> Trace after 4cd13c21b207: https://drive.google.com/open?id=0B8siaK6ZjvEwbXVzcnpieVkzWFU
>> (btw, if there's a preferred way to send the logs let me know. I
>> wasn't sure large or non-text attachments would be well received)
>>
>> I'm not sure how much help the trace is, but it does look like the
>> system is spending far too much time in the ethernet device's IRQ
>> handler to be healthy.
>>
>
>Thanks a lot Brian
>
>Can you confirm interrupt handler is smc911x_interrupt() ?
>
>(ie : is SMC_USE_PXA_DMA / SMC_USE_DMA defined or not ?)
>

Looks like there's a few similarly named devices and drivers. Mine is
an SMSC LAN91C111 using the smc91x driver in
drivers/net/ethernet/smsc/smc91x.c, rather than smc911x.c. So the
interrupt handler is smc_interrupt()

CONFIG_ARCH_PXA is not set, nor is SMC_USE_PXA_DMA or SMC_USE_DMA.

-Brian

>
>>
>> Thanks,
>> Brian
>>>
>>>
>>> Thanks,
>>>
>>>         tglx
>>>