linux-kernel - Re: [RFC][PATCHv3 2/5] printk: introduce printing kernel thread

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170705050253.GA9132@jagdpanzerIV.localdomain>
Date:   Wed, 5 Jul 2017 14:02:53 +0900
From:   Sergey Senozhatsky <sergey.senozhatsky.work@...il.com>
To:     Petr Mladek <pmladek@...e.com>
Cc:     Sergey Senozhatsky <sergey.senozhatsky.work@...il.com>,
        Sergey Senozhatsky <sergey.senozhatsky@...il.com>,
        Steven Rostedt <rostedt@...dmis.org>, Jan Kara <jack@...e.cz>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Peter Zijlstra <peterz@...radead.org>,
        "Rafael J . Wysocki" <rjw@...ysocki.net>,
        Eric Biederman <ebiederm@...ssion.com>,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        Jiri Slaby <jslaby@...e.com>, Pavel Machek <pavel@....cz>,
        Andreas Mohr <andi@...as.de>,
        Tetsuo Handa <penguin-kernel@...ove.SAKURA.ne.jp>,
        linux-kernel@...r.kernel.org
Subject: Re: [RFC][PATCHv3 2/5] printk: introduce printing kernel thread

Hello,

Petr, thanks for commenting/taking a look

On (07/04/17 16:03), Petr Mladek wrote:
> On Fri 2017-06-30 21:42:24, Sergey Senozhatsky wrote:
> > Hello,
> > 
> > On (06/30/17 13:54), Petr Mladek wrote:
> > > > but.....
> > > > the opposite possibility is that messages either won't be printed
> > > > soon (until next printk or console_unlock()) or won't be printed
> > > > ever at all (in case of sudden system death). I don't think it's
> > > > a good alternative.
> > > 
> > > I see it like a weighing machine. There is a "guaranteed" output on
> > > one side and a softlockups prevention on the other side. The more
> > > we prevent the softlockups the less we guarantee the output.
> > 
> > I apply a very simple litmus test. if the answer to the question
> > "so we leave console_unlock() and there are pending messages,
> > who and when is going to flush the remaining messages?" is
> > "something sometime in the future" then it's a no-no.
> >
> > "something sometime in the future" is equal to "no one".
> 
> I am afraid that it is hard or even impossible to have offloading
> and be sure that the other side continues. IMHO, there always
> will be a possibility that something could go wrong.
> 
> I think that resisting on keeping the current guarantees might
> force use to keep the current code (status quo).
> 
> Please, let me better explain the problems that I see with
> the current code.
> 
> 
> > > My main unresolved doubts about this patchset are:
> > > 
> > > 1. It gives other processes only very small change to take
> > >    over the job. They either need to call console_trylock()
> > >    in very small "race" window or they need to call
> > >    console_lock(). Where console_lock() only signalizes
> > >    that the caller is willing to take the job and puts
> > >    him into sleep.
> > 
> > printk_kthread does console_lock().
> 
> But this is not enough. In fact, it makes things worse. Let me
> explain:
> 
> We are talking about flood of messages when we go over
> atomic_print_limit and activate offloading. Let's look
> at it in more details:
> 
> CPU0				CPU1
> 
> 
> console_unlock()
> 
>   // handle atomic_print_limit lines
>   // enable offloading
> 
>   wake_up_process(printk_kthread)
> 
>   // handly many other lines
> 
> 				// printk_kthread finally gets CPU
> 				// waken in printk_kthread_func()
> 
> 				console_lock()
> 				  down()
> 				    // lock most likely taken by CPU0
> 				    list_add_tail(&waiter.list)
> 				    schedule()
> 
>    // handle one more line
>    up_console_sem()
>      list_del(&waiter.list)
>      wake_up_process(waiter->task)
> 
>    console_trylock()
>    // fails because the lock is
>    // taken by the waiter
> 
> Regression 1: Nobody flushes the messages to the console until
> 	the printk_kthread() is scheduled again, really takes
> 	the console_lock(), and calls console_unlock().

yeah, ok. I'm not sure I can fully agree that this is a regression.

this is how the code is meant to work, isn't it?
some other CPU could have slept on console_sem from e.g. tty_open()
by the time CPU0 attempts to re-take the lock, forcing CPU0
  a) to fail to console_trylock()
and
  b) delay flush of the pending messages.

so, sorry, I'm sure I see why this is a regression. the behaviour
is pretty much the same. what am I missing?

... yet, after offloading we leave console_unlock() with the knowledge
that there is a process in console_sem waiting for that lock. which
sounds like something, but at the same... that's pretty much it.
IOW, "so what?". we just know that there is a process in the waiter
list. if we can't be sure that wake_up_process() from up_console_sem()
succeeds, then that knowledge doesn't buy us anything at all. real
guarantees (console_sem owner is running on some CPU are a bit harder
to achieve, given that console_unlock() can schedule).

so I think I'll drop that "wait for a new console_sem owner" thing
after all. not because we still can't agree on it, but because may
be it's not worth it. sorry, it took me so long.


> 	The original code flushed the messages
> 	all the time until all were flushed. (Except
> 	if it was allowed to sleep in console_unlock).

but any console_lock()/console_trylock() call from any other CPU
(fbcon, tty, etc.) prevents complete flush of the pending messages
until that other CPU invokes console_unlock(). we don't introduce
anything new here.

> Regression 2: printk_kthread() takes console_lock() in
> 	   preemtible context => is allowed to sleep
> 	   and nobody is able to get console_lock
> 	   in the meantime => slowdown and risk of
> 	   lost messages.

do you mean scheduling under console_sem? but this is true for
console_unlock() in general. we already schedule with console_sem,
quite a lot. I can't see how printk_kthread makes it any worse.

> In addition, the printk kthread need not be scheduled
> at all when:
> 
>   + It is assigned to the same CPU and the current
>     console_lock() owner.
> 
>   + It is assigned to a CPU where some higher priority
>     task got mad.

need to think more.

but we probably can forbid preemption in console_unlock() if atomic
limit is not 0. it kinda makes sense to me.

> Finally, it might be scheduled too late when we already
> handled too many lines over the atomic limit and switched
> to the emergency mode.

the possibility of lockup still exists, yes. I never claimed
the otherwise, and that is the part I'm not happy with.

in general,

offloading based on atomic limit watermark does not magically
address lockups. the limit simply can be too high. so *may be*
we need to use `watchdog_thresh' based check at the end of the
day and drop the atomic limit sysfs knob.

	console_unlock()
	{
		for (;;) {
			if (spent_in_unlock > watchdog_thresh / 2)
				offload-and-return;
		}
	}

which, once again, won't fully address every single case possible (can
we even do it?). the problem here is that we only know how much time we
spent in console_unlock(), but we have no idea how much time the CPU had
been running with preemption or local IRQs disabled before it got into
console_unlock().

it's the watchdog who knows that information - touch_ts, etc. so may
be we need to make printk more "watchdog aware" and make offloading
decisions based using is_softlockup()/etc. which may look like
unnecessary dependency/coupling, but this dependency/coupling is
already there, in fact - lockups that we are trying to avoid are
coming from watchdog. so watchdog may hint/help printk, not just
report the lockup and kill the system.

IOW, watchdog timeout seems to be the only reliable number to
look at when we want to know if lockup is about to happen.

need to think more

> How big should be atomic limit to give the kthread chance
> to get scheduled in time? It might be pretty complicated
> mathematics. It will depend on the hardware, system
> load, used scheduler, name spaces limitations, ...

yep yep, that's why there is a sysfs knob. there is no "the right number" 
for all :)

> > we may (and need to) improve the retry path in console_unlock().
> > but we must not leave it until other process locks the console_sem.
> 
> Several possibilities came to my mind but they all have problems.
> Let me explain.
> 
> 
> 1. Make printk_kthread() special and avoid sleeping with
>    console_lock() taken => force offload when need_resched
>    gets true. But this still keeps regression 1.
> 
>    Also it means that we are offloading without guarantees.
>    But the same is true for sleeping with console_lock.
> 
> 
> 2. Increase the chance that other printk() callers will take
>    over console_lock():
> 
>      + Add delay before console_trylock() in console_unlock().
>        But this is very ugly. Waste of resources => no way

yeah, I thought about it.

>      + console_unlock() might signalize offloading and
>        printk() caller might call console_lock() instead
>        of console_trylock().
> 
>        It must be only one printk() caller (first one?). It does
>        not make sense to block all CPUs and the entire system.
> 
>        Anyway, console_lock() might sleep and printk() does not know
>        the context => no way
> 
>        The only chance is that we combine this with throttling
>        tasks that do too many printk() calls. But it is still
>        ugly and hard to predict.
> 
> 
>      + Replace console_lock() with spin lock.
>        But this would require complete rework/clean up
>        => task for years

yep.

> Did I miss a nice trick?
> 

I need more time to answer :)


> > >    Another detailed description of this problem can be found
> > >    in my previous mail, see
> > >    https://lkml.kernel.org/r/20170628121925.GN1538@pathway.suse.cz
> > > 
> > > 
> > > 2. It adds rather complex dependency on the scheduler. I know
> > >    that my simplified solution do this as well but another way[*]
> > >    Let me explain. I would split the dependency on the code
> > >    and behavior relation.
> > > 
> > >    From the code side: The current printk() calls wake_up_process()
> > >    and we need to use printk_deferred() in the related scheduler code.
> > >    This patch does this as well, so there is no win and no lose.
> > >    Well, you talk about changing the affinity and other tricks
> > >    in the other mails. This might add more locations where
> > >    printk_deferred() would be needed.
> > 
> > we are in printk_safe all the way through console_offload_printing(),
> > the context is derived from console_unlock().
> > 
> > why we would need printk_deferred()?
> 
> I think that the reason is the same why we still need
> printk_deferred() now.
> 
> printk_safe() helps only when the recursion starts inside
> printk(). It fails when printk() is called from a function
> that is used inside printk().

what I meant is that in console_unlock()/console_offload_printing() we
are in printk_safe. any printk from here will not end up in vprintk_emit().

> > >    From the behavior side: The current code wakes the process
> > >    and is done. The code in this patch wakes the process and
> > >    waits until it[**] gets CPU and really runs. It switches to
> > >    the emergency mode when the other process does not run in time.
> > >    By other words, this patch depends on more actions done
> > >    by the scheduler and changes behavior based on it. IMHO,
> > >    this will be very hard to code, tune, and debug.
> > >    A proper solution might require more code dependency.
> > > 
> > >    [*] My solution depends on the scheduler in the sense
> > >        that messages will get lost when nobody else will take
> > >        over the console job.
> > 
> > which is precisely and exactly the thing that we should never
> > let to happen. there is no _win_, because we _lost_ the messages.
> 
> I do not see a way how to avoid this at the moment. Therefore I
> suggest changes that might reduce the possible damage as much
> as possible:
> 
>    + reduce sleeping with console_lock() => this will actually
>      increase the change to flush the messages in panic()
> 
> 
>    + use high enough atomic_print_limit so that we prevent
>      blocking in atomic context for seconds but that we trigger
>      offloading only in very rare situations
> 
>      Maybe we could use a number in secons for the limit
>      instead of number of lines.

oh, nice. I talked about it a bit earlier in this mail.

>      Note that I still consider flood of messages as a broken
>      system. If anyone gets into the situation and lose messages,
>      he should fix/reduce the flood first. The atomic limits
>      should flush enought messages to do so

good that you mentioned that. I was going to ask if we can leave
the "flood" case behind. I agree that this case is broken and printk
can't help with it. if messages can be ratelimited, then they better
be ratelimited. it's when we can't ratelimit and begin to lockup the
system that we are trying to fix. for example, a massive
show_state_filter() over a slow serial console.


>    + allow to throttle printk() caller that cause the flood;
>      this looks like a win-win solution; it handles the root
>      of the problem

it's tricky. I would like to throttle, but it's tricky. need to
think more how we can make it more or less reliable.

	local_irq_disable();
	printk();
	...
	printk();		<< throttle
	local_irq_restore();	<< lockup


calling log_store() faster than other CPU can call_console_drivers()
is bad, but hard lockup is even worse.

OK, I'll re-shape and re-work the patch set.
will push it to my tree somewhere along the week.


>    + eventually write on the console that offloading
>      has happended, so that we could see when it failed
> 
>    + add the early printk stuff from Peter Zijstra so that
>      developers have another chance to debug these problems
> 
>      This is needed already now when printk() fails.
> 
> 
> Does this really sound that bad?
> 
> Of course, there might be situations where printk() might behave
> worse. But we still should allow to disable offloading completely
> for this cases.
> 
> Well, I admit that it is hard to be sure. printk() is used
> in too many situations. But we either want to do something
> or we could just keep the status quo.
> 
> 
> > 
> > > 3. The prevention of soft-lockups is questionable. If you are in
> > >    soft-lockup prone situation, the only prevention is to do an
> > >    offload. But if you switch to the emergency mode and give
> > >    up offloading too early, the new code stops preventing
> > >    the softlockup.
> > > 
> > >    Of course, the patchset does not make it worse but the question
> > >    is how much it really helps. It would be bad to add a lot of
> > >    code/complexity with almost no gain.
> > > 
> > > 
> > > IMHO, if we try to solve the 1st problem (chance of offloading),
> > > it might add a risk of deadlocks and/or make the 2nd problem
> > > (dependency on scheduler) worse. Also I am afraid that we would
> > > repeat many dead ways already tried by Jan Kara.
> > 
> > what deadlock?
> 
> The real risk semms to be a missing printk_deferred()
> needed in a newly used code.
> 
> I had also some wrongly done hand-shake of console_lock that
> cause several deadlocks. For example, we might wait for someone in
> console_unlock() who will never be scheduled. Or we will
> start waiting in printk() for console_lock() but there
> might be already another waiters in the list. There might
> be some dependecy between them. OK, this is rather
> theoreticall.
> 
> Anyway, using more code in printk() might cause recursion
> and deadlocks. So, we need to be careful.
> 
> Uff, all this is complicated. I wish I managed to explain
> this more easily.
> 
> Best Regards,
> Petr
> 
> PS: I am going to be off until the next Monday. It might take
> some time until I process all the other mails. I try hard to
> sort the ideas and keep some clear view.

thanks. have fun.

	-ss