linux-kernel - Re: [PATCH] printk: Correctly handle preemption in console

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20170125123410.GE6620@pathway.suse.cz>
Date:   Wed, 25 Jan 2017 13:34:10 +0100
From:   Petr Mladek <pmladek@...e.com>
To:     Sergey Senozhatsky <sergey.senozhatsky.work@...il.com>
Cc:     Steven Rostedt <rostedt@...dmis.org>,
        Sergey Senozhatsky <sergey.senozhatsky@...il.com>,
        Tetsuo Handa <penguin-kernel@...ove.SAKURA.ne.jp>,
        Peter Zijlstra <peterz@...radead.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        Jiri Slaby <jslaby@...e.cz>, linux-fbdev@...r.kernel.org,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH] printk: Correctly handle preemption in console_unlock()

On Wed 2017-01-18 16:21:41, Sergey Senozhatsky wrote:
> On (01/18/17 14:45), Sergey Senozhatsky wrote:
> [..]
> > 
> > there is a function that clears @console_may_schedule out of
> > console_sem scope - console_flush_on_panic().
> > so I *may be* can think about a worst case scenario of race
> > condition between
> > 	console_flush_on_panic()->console_may_schedule = 0 on panic CPU
> > and
> > 	console_unlock()->console_may_schedule = 1 from CPU that panic CPU
> > failed to stop (smp_send_stop() can return with secondary CPUs still being
> > online).
> 
> what I mean, is that we can have, let's say, 2 CPUs spinning in
> console_unlock(), both with @console_may_schedule == 1 (because secondary
> CPU restores global @console_may_schedule value). now, suppose, we have
> misbehaving scheduler (well, we are in panic after all). secondary CPU
> will cond_resched() and may be lockup somewhere in the scheduler. which is
> fine, we don't care about that secondary CPU anyway. but the same can happen
> to panic CPU as well.
> 
> what do you think?

Great catch!

console_flush_on_panic() is called after smp_send_stop();
so only one CPU should be running. But it is not guaranteed.

Better be on the safe side. I am going to use a conservative
solution that will only move the "again" goto label.

Just some thoughts for a future work:

The dependencies between console_sem, console_may_schedule,
console_locked, and console_suspended are complex like hell.
There are several surprises.

For example, console_trylock() and console_lock() behave differently
when console_suspended is set. console_trylock() completely fails.
console_lock() succeeds but it does not modify console_locked
and console_may_schedule.

This is the reason why we do not need to check console_suspended
after the "again" goto target.

IMHO, the key to make it more straightforward is to split
console flushing functionality from console_unlock().

It is a bit problematic. console_unlock() guarantees that all
messages are flushed when the semaphore is finally released.
IMHO, it might get more relaxed with some deferred techniques.
The deferred handling is perfectly fine most of the time.
In emergency situations, the console_sem is either available
or we rely on console_flush_on_panic() anyway.

Best Regards,
Petr