[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJmjG2-e6f6p=pE5uDECMc=W=81SYyGCmoabrC1ePXwL5DFdSw@mail.gmail.com>
Date: Sun, 21 Oct 2018 11:09:22 -0700
From: Daniel Wang <wonderfly@...gle.com>
To: sergey.senozhatsky.work@...il.com
Cc: Petr Mladek <pmladek@...e.com>, rostedt@...dmis.org,
stable@...r.kernel.org, Alexander.Levin@...rosoft.com,
akpm@...ux-foundation.org, byungchul.park@....com,
dave.hansen@...el.com, hannes@...xchg.org, jack@...e.cz,
linux-kernel@...r.kernel.org, linux-mm@...ck.org,
Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
Mel Gorman <mgorman@...e.de>, mhocko@...nel.org, pavel@....cz,
penguin-kernel@...ove.sakura.ne.jp, peterz@...radead.org,
tj@...nel.org, torvalds@...ux-foundation.org, vbabka@...e.cz,
Cong Wang <xiyou.wangcong@...il.com>,
Peter Feiner <pfeiner@...gle.com>
Subject: Re: 4.14 backport request for dbdda842fe96f: "printk: Add console
owner and waiter logic to load balance console writes"
Just got back from vacation. Thanks for the continued discussion. Just so
I understand the current state. Looks like we've got a pretty good explanation
of what's going on (though not completely sure), and backporting Steven's
patches is still the way to go? I see that Sergey had sent an RFC series
for similar things. Are those trying to solve the deadlock problem in a
different way?On Thu, Oct 4, 2018 at 1:55 AM Sergey Senozhatsky
<sergey.senozhatsky.work@...il.com> wrote:
>
> On (10/04/18 10:36), Petr Mladek wrote:
> >
> > This looks like a reasonable explanation of what is happening here.
> > It also explains why the console owner logic helped.
>
> Well, I'm still a bit puzzled, frankly speaking. I've two theories.
>
> Theory #1 [most likely]
>
> Steven is a wizard and his code cures whatever problem we throw it at.
>
> Theory #2
>
> console_sem hand over actually spreads print out, so we don't have one CPU
> doing all the printing job. Instead every CPU prints its backtrace, while the
> CPU which issued all_cpus_backtrace() waits for them. So all_cpus_backtrace()
> still has to wait for NR_CPUS * strlen(bakctrace), which still probably
> truggers NMI panic on it at some point. The panic CPU send out stop IPI, then
> it waits for foreign CPUs to ACK stop IPI request - for 10 seconds. So each
> CPU prints its backtrace, then ACK stop IPI. So when panic CPU proceeds with
> flush_on_panic() and emergency_reboot() uart_port->lock is unlocked. Without
> the patch we probably declare NMI panic on the CPU which does all the printing
> work, and panic sometimes jumps in when that CPU is in busy in
> serial8250_console_write(), holding the uart_port->lock. So we can't re-enter
> the 8250 driver from panic CPU and we can't reboot the system. In other
> words... Steven is a wizard.
>
> > > serial8250_console_write()
> > > {
> > > if (port->sysrq)
> > > locked = 0;
> > > else if (oops_in_progress)
> > > locked = spin_trylock_irqsave(&port->lock, flags);
> > > else
> > > spin_lock_irqsave(&port->lock, flags);
> > >
> > > ...
> > > uart_console_write(port, s, count, serial8250_console_putchar);
> > > ...
> > >
> > > if (locked)
> > > spin_unlock_irqrestore(&port->lock, flags);
> > > }
> > >
> > > Now... the problem. A theory, in fact.
> > > panic() sets oops_in_progress back to zero - bust_spinlocks(0) - too soon.
> >
> > I see your point. I am just a bit scared of this way. Ignoring locks
> > is a dangerous and painful approach in general.
>
> Well, I agree. But 8250 is not the only console which does ignore
> uart_port lock state sometimes. Otherwise sysrq would be totally unreliable,
> including emergency reboot. So it's sort of how it has been for quite some
> time, I guess. We are in panic(), it's over, so we probably can ignore
> uart_port->lock at this point.
>
> -ss
--
Best,
Daniel
Download attachment "smime.p7s" of type "application/pkcs7-signature" (4849 bytes)
Powered by blists - more mailing lists