[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20171219005248.GA8892@jagdpanzerIV>
Date: Tue, 19 Dec 2017 09:52:48 +0900
From: Sergey Senozhatsky <sergey.senozhatsky.work@...il.com>
To: Steven Rostedt <rostedt@...dmis.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@...il.com>,
Tejun Heo <tj@...nel.org>,
Sergey Senozhatsky <sergey.senozhatsky@...il.com>,
Petr Mladek <pmladek@...e.com>, Jan Kara <jack@...e.cz>,
Andrew Morton <akpm@...ux-foundation.org>,
Peter Zijlstra <peterz@...radead.org>,
Rafael Wysocki <rjw@...ysocki.net>,
Pavel Machek <pavel@....cz>,
Tetsuo Handa <penguin-kernel@...ove.SAKURA.ne.jp>,
linux-kernel@...r.kernel.org
Subject: Re: [RFC][PATCHv6 00/12] printk: introduce printing kernel thread
Hello Steven,
I couldn't reply sooner.
On (12/15/17 10:19), Steven Rostedt wrote:
> > On (12/14/17 22:18), Steven Rostedt wrote:
> > > > Steven, your approach works ONLY when we have the following preconditions:
> > > >
> > > > a) there is a CPU that is calling printk() from the 'safe' (non-atomic,
> > > > etc) context
> > > >
> > > > what does guarantee that? what happens if there is NO non-atomic
> > > > CPU or that non-atomic simplky missses the console_owner != false
> > > > point? we are going to conclude
> > > >
> > > > "if printk() doesn't work for you, it's because you are holding it wrong"?
> > > >
> > > >
> > > > what if that non-atomic CPU does not call printk(), but instead
> > > > it does console_lock()/console_unlock()? why there is no handoff?
>
> The case here, you are talking about a CPU doing console_lock() from a
> non printk() case. Which is what I was asking about how often this
> happens.
I'd say often enough. but the point I was trying to make is that we can
have non-atomic CPUs which can do the print out, instead of "sharing the
load" between atomic CPUs.
> As for why there's no handoff. Does the non printk()
> console_lock/unlock ever happen from a critical location? I don't think
> it does (but I haven't checked). Then it is the perfect candidate to do
> all the printing.
that's right. that is the point I was trying making. we can have better
candidates to do all the printing.
[..]
> > deep-stack spin_lock_irqsave() lockup reports from multiple CPUs (3 cpus)
> > happening at the same moment + NMI backtraces from all the CPUs (more
> > than 3 cpus) that follows the lockups, over not-so-fast serial console.
> > exactly the bug report I received two days ago. so which one of the CPUs
> > here is a good candidate to successfully emit all of the pending logbuf
> > entries? none. all of them either have local IRQs disabled, or dump_stack()
> > from either backtrace IPI or backtrace NMI (depending on the configuration).
> >
>
> Is the above showing an issue of console_lock happening in the non
> printk() case?
>
> > do we periodically do console_lock() on a running system? yes, we do.
> > add to console_unlock()
>
> Right, and the non printk() console_lock() should be fine to do all
> printing when it unlocks.
that's right.
> > this argument is really may be applied against your patch as well. I
> > really don't want us to have this type of "technical" discussion.
>
> Sure, but my patch fixes the unfair approach that printk currently does.
I did tests yesterday, traces are available. I can't conclude that
the patch fixes the unfairness of printk().
> > printk() is a tool for developers. but developers can't use.
> >
> >
> > > > c) the task that is looping in console_unlock() sees non-atomic CPU when
> > > > console_owner is set.
> > >
> > > I haven't looked at the latest code, but my last patch didn't care
> > > about "atomic" and "non-atomic"
> >
> > I know. and I think it is sort of a problem.
>
> Please show me the case that it is. And don't explain where it is.
> Please apply the patch and have the problem occur and show it to me.
> That's all that I'm asking for.
I did some tests yesterday. I posted analysis and traces.
[..]
> No, because it is unrealistic. For example:
right.
> +static void test_noirq_console_unlock(void)
> +{
> + unsigned long flags;
> + unsigned long num_messages = 0;
> +
> + pr_err("=== TEST %s\n", __func__);
> +
> + num_messages = 0;
> + console_lock();
> + while (num_messages++ < max_num_messages)
> + pr_info("=== %s Append message %lu out of %lu\n",
> + __func__,
> + num_messages,
> + max_num_messages);
> +
> + local_irq_save(flags);
> + console_unlock();
>
> Where in the kernel do we do this?
the funny thing is that we _are going to start doing this_ with
the console_owner hand off enabled.
consider the following case
we have console_lock() from non-atomic context. console_sem owner is
getting preempted, under console_sem. which is totally possible and
happens a lot. in the mean time we have OOM, which can print a lot of
info. by the time console_sem returns back to TASK_RUNNING logbuf
contains some number of pending messages [lets say 10 seconds worth
of printing]. console owner goes to console_unlock(). accidentally
we have printk from IRQ on CPUz. console_owner hands over printing
duty to CPUz. so now we have to print 10 seconds worth of OOM messages
from irq.
CPU0 CPU1 ~ CPUx CPUz
console_lock
<< preempted >>
OOM OOM printouts, lots
of OOM traces, etc.
OOM end [progress done].
<< back to RUNNING >>
console_unlock()
for (;;)
sets console_owner
call_console_drivers() IRQ
printk
sees console_owner
sets console_waiter
clears console_owner
sees console_waier
handoff
for (;;) {
call_console_drivers()
??? lockup
}
up()
this is how we down() from non-atomic and up() from atomic [if we make
it to up(). we might end up in NMI panic]. this scenario is totally possible,
isn't it? the optimistic expectation here is that some other printk() from
non-atomic CPU will jump in and take over printing from atomic CPUz. but I
don't see why we are counting on it.
-ss
Powered by blists - more mailing lists