linux-kernel - Re: [BUG RT] dump-capture kernel not executed for panic in interrupt context

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <87k0wws3vi.fsf@x220.int.ebiederm.org>
Date:   Mon, 14 Sep 2020 11:46:25 -0500
From:   ebiederm@...ssion.com (Eric W. Biederman)
To:     Joerg Vehlow <lkml@...coder.de>
Cc:     peterz@...radead.org, Steven Rostedt <rostedt@...dmis.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
        Huang Ying <ying.huang@...el.com>,
        linux-kernel@...r.kernel.org,
        Joerg Vehlow <joerg.vehlow@...-tech.de>,
        Kexec Mailing List <kexec@...ts.infradead.org>
Subject: Re: [BUG RT] dump-capture kernel not executed for panic in interrupt context


Adding the kexec list as well.

Joerg Vehlow <lkml@...coder.de> writes:

> Hi Eric,
>> What is this patch supposed to be doing?
>>
>> What bug is it fixing?
> This information is part in the first message of this mail thread.
> The patch was intendedfor the active discussion in this thread,
> not for a broad review.

> A short summary: In the rt kernel, a panic in an interrupt context does
> not start the dump-capture kernel, because there is a mutex_trylock in
> __crash_kexe. If this is called in interrupt context, it always fails.
> In the non-rt kernel calling mutex_trylock is not allowed according to
> the comment of the function, but it still works.

Thanks.  For whatever reason I did not see the rest of this thread
when I was replying to your patch.

I get the feeling the rt kernel is breaking this case deliberately.
I don't know of any reason why a trylock couldn't work.

That said I won't propose fixing up the locks that way.

>> A BUG_ON that triggers inside of BUG_ONs seems not just suspect but
>> outright impossible to make use of.
> I am not entirely sure what would happen here. But even if it gets in
> some kind ofendless loop, I guess this is ok, because it allows finding
> the problem. A piece of code in the function, that ensures the precondition
> is a lot better than relying on only a comment.
> If this was in mtex_trylock, the bug described above wouldn't have sneaked
> in 12 years ago...

BUG_ON's are more likely to hide a problem then to show it.
Sometimes they are appropriate but the should be avoided as much as
possible.


>> I get the feeling skimming this that it is time to sort out and simplify
>> the locking here, rather than make it more complex, and more likely to
>> fail.
> I would very much like that, but sadly it looks like it is not possible.
> Either it wouldrequire blocking locks, that may fail, or not locking at
> all, that may also fail.Using a different kind of lock (like spinlock)
> is also not possible, becausespinlock_trylock again uses mutex_trylock
> in the rt kernel.

I think it is possible but the locking needs to be relooked at.

>> I get the feeling that over the years somehow the assumption that the
>> rest of the kernel is broken and that we need to get out of the broken
>> kernel as fast and as simply as possible has been lost.
> Yes I also have the feeling, that the mutexes need fixing, but I wouldn't
> to post any patch for that. At the moment, given the interface of the mutex,
> this is clearly a bug in kexec, even if it works in the non-rt kernel.

Cleanups that break the code. Sigh.

The code was written correctly for this case and was fine until
8c5a1cf0ad3a ("kexec: use a mutex for locking rather than xchg()").

Mostly because I didn't trust locks given their comparatively high level
of abstraction and what do you know that turned out to be correct in
this case.

It definitely looks time to see how the locking can be improved on the
kexec on panic code path.

Eric