lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Wed, 23 Sep 2015 09:14:09 +0100
From:	David Woodhouse <dwmw2@...radead.org>
To:	David Miller <davem@...emloft.net>
Cc:	netdev@...r.kernel.org, romieu@...zoreil.com
Subject: Re: [PATCH 1/7] 8139cp: Improve accuracy of cp_interrupt() return,
 to survive IRQ storms

On Tue, 2015-09-22 at 16:45 -0700, David Miller wrote:
> And if we are getting Rx* interrupts with napi_schedule_prep()
> returning false, that's a serious problem.  It can mean that the TX
> timeout handler's resetting of the chip is either miscoded or is
> racing with either NAPI polling or this interrupt handler.

Such bugs happen. That's why we have IRQ-storm detection, to help
protect us.

> And if that's the case your patch is making the chip's IRQ line get
> disabled when this race triggers.
> 
> This change is even worse, in my opinion, if patch #2 indeed makes
> the problem go away.

Even worse than what?

When a bug like the one I've fixed in #2 happens, let's look at the
options:

  - With this patch, the IRQ storm is detected. We disable the IRQ 
    line and the driver continues to work, albeit with a higher
    latency.

  - Without this patch, the box just dies completely. First the 
    hardware watchdog was triggering an apparently spontaneous reset. 
    Then when I disabled the hardware watchdog, the machine locked up 
    and doesn't even respond to serial sysrq. Eventually I got a
    little bit of sense out of it using the NMI watchdog.

Trust me, the disabled IRQ you get with my patch really *isn't* worse
than what happened without it :)

If cp_interrupt() had already worked the way I suggest, I would
literally have two days of my life back right now. I spent a *lot* of
time working out that it was dying in an interrupt storm. And trying to
find the underlying problem while having to physically visit it under
the stairs and reset it all the time.

In fact, I've been seeing undebuggable spontaneous resets of this
router for a while now. If I'd had an 'IRQ disabled' backtrace and a
smoking gun, I might have been able to fix it a long time ago.

I even discounted the IRQ-storm hypothesis, early on, on the basis that
"Linux handles those quite gracefully now, and disables the interrupt".
Except of course *that* relies on the IRQ handler returning an
appropriate value. So it doesn't work with 8139cp as things stand.

That's why I'm really quite keen on fixing cp_interrupt() to be more
defensive in when it claims to have handled the interrupt.

Surely that's *why* we have the IRQ-storm detection in the first place
— to help us survive precisely this situation. IRQ storms are either
caused by software bugs like the one fixed in patch #2, or hardware
misbehaviour. And if IRQ handlers blindly return 'handled' just because
the hardware actually *did* assert an interrupt, regardless of whether
they've made any *progress*, then we don't get that protection in a
*lot* of the situations where we'd want it.

But sure, for now I'll drop this from the series and I can try to
convince you separately. In fact I think it only affects the last patch
in the series which reduces the banging on TxPoll. And I'm going to
drop that one too for now anyway. I'll repost shortly. And I think I'll
add a new one at the end, enabling TX checksums, SG and TSO by default.

-- 
dwmw2

Download attachment "smime.p7s" of type "application/x-pkcs7-signature" (5691 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ