linux-kernel - Re: [PATCH] serial: 8250: Avoid "too much work" from bogus rx timeout interrupt

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1482168788.9552.100.camel@linux.intel.com>
Date:   Mon, 19 Dec 2016 19:33:08 +0200
From:   Andy Shevchenko <andriy.shevchenko@...ux.intel.com>
To:     Doug Anderson <dianders@...omium.org>,
        Peter Hurley <peter@...leysoftware.com>,
        california.l.sullivan@...el.com,
        "Liakhovetski, Guennadi" <guennadi.liakhovetski@...el.com>
Cc:     Greg Kroah-Hartman <gregkh@...uxfoundation.org>, jslaby@...e.com,
        Brian Norris <briannorris@...omium.org>,
        "open list:ARM/Rockchip SoC..." <linux-rockchip@...ts.infradead.org>,
        陈渐飞 <jeffy.chen@...k-chips.com>,
        高美才 <eric.gao@...k-chips.com>,
        phillip.raffeck@....de, anton.wuerfel@....de,
        yegorslists@...glemail.com, matwey@....msu.ru,
        tthayer@...nsource.altera.com, linux-serial@...r.kernel.org,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] serial: 8250: Avoid "too much work" from bogus rx
 timeout interrupt

On Mon, 2016-12-19 at 09:12 -0800, Doug Anderson wrote:
> Hi,
> 
> On Mon, Dec 19, 2016 at 4:59 AM, Andy Shevchenko
> <andriy.shevchenko@...ux.intel.com> wrote:
> > On Sun, 2016-12-18 at 17:14 -0800, Douglas Anderson wrote:
> > > On a Rockchip rk3399-based board during suspend/resume testing, we
> > > found that we could get the console UART into a state where it
> > > would
> > > print this to the console a lot:
> > >   serial8250: too much work for irq42
> > 
> > Have you read the following discussion
> > https://www.spinics.net/lists/kernel/msg2059543.html
> 
> No, I wasn't aware of that discussion.  Yup, basically the exact same
> thing is happening here.  Good to know I'm not alone.  Any idea if the
> Baytrail UART is also based on DesignWare IP?

Yes. Almost all Intel HW is using DesignWare IP for HS UARTs.

> In that thread, Peter said:
> 
> > I think there is every likelihood of spurious RX timeout interrupts
> > tripping this patch, sorry.
> > 
> > Unfortunately, I think UART_BUG_ is the only viable possibility.
> > Or perhaps fixing the port type as PORT_8250 (thus disabling the
> > fifos).
> 
> My change is slightly different than California's in that I'm actually
> throwing away the bogus byte and his patch was treating it as a valid
> byte.  I don't know if that makes the patch more or less palatable.

We need to test, especially in DMA case.

> I would hate to lose access to the FIFOs just due to this weird corner
> case.
> 
> Do we really think there's a case where there's an RX Timeout
> interrupt w/ no "data ready" but that later the data ready will show
> up?  Can you quantify how much later you think it will show up?  If we
> can quantify how much longer the data will show up in then we should
> probably just do a timeout loop right where I added my patch.
> 
> Specifically, here's what's happening today with RX Timeout interrupt
> without "data ready":
> 
> 1. We'll get the interrupt
> 2. We won't do _anything_ to service the interrupt.
> 3. We'll return back to serial8250_interrupt(), where we'll keep
> looping until we get "too much work"
> 4. We'll break out, but the interrupt will still be active.
> 5. Go back to #1
> 
> ...and since this interrupt will keep firing and firing and firing
> with no delay in-between, we'll effectively lock the CPU up.

And the root cause of that is... ?

> If there are some UARTs that eventually get themselves out of this
> state by asserting "data ready" then the above won't be an "infinite"
> loop but it will effectively be a tight loop where we won't let
> userspace run and won't service other interrupts until we actually get
> the data ready.  Since we're already blocking everything else, it
> seems like it might be better to directly loop in
> serial8250_handle_irq() with a timeout of some sort (how long?  100
> us?  1 ms?).  Then we if we get the timeout then we can do the read
> and safely work ourselves free.

What I think is that the root cause of this is still unknown and either
above looks like a hack.

-- 
Andy Shevchenko <andriy.shevchenko@...ux.intel.com>
Intel Finland Oy