netdev - Re: [PATCH 0/1] NIU: fix spurious interrupts

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <4A1D6A72.5070801@windriver.com>
Date:	Wed, 27 May 2009 12:29:38 -0400
From:	"Hong H. Pham" <hong.pham@...driver.com>
To:	David Miller <davem@...emloft.net>
CC:	netdev@...r.kernel.org, matheos.worku@....com
Subject: Re: [PATCH 0/1] NIU: fix spurious interrupts

David Miller wrote:
> I wonder if it's the niu_interrupt path, and all the v0 bits are
> clear.  Yeah, I bet that's it.  We're taking some slowpath interrupt
> for RX or TX counter overflows or errors, and then we try to rearm the
> LDG even though we're already handling normal RX/TX via NAPI.
> 
> But that shouldn't happen, the thing that went into RX/TX NAPI work
> should have turned those interrupt off.  We handle RX normal work and
> error interrupts in the same LDG, and similar for TX, and thus using
> the same interrupt.
> 
> Can you check to see who calls niu_ldg_rearm() when we see it trigger
> the interrupt with NAPI already scheduled?  That will help narrow this
> down even further.  Probably the best thing to do is to get a full
> stack trace using show_stack() or dump_stack().
> 
> This is looking more and more like a driver bug at this point.

I've added a check for v0 being zero in niu_interrupt().  I have not seen
this check succeed, so niu_ldg_rearm() is never invoked prior to or
during a spurious interrupt in niu_interrupt().  So it seems the only
other place where the LDG is rearmed during network activity is in
niu_poll(), which is expected.

Here's something else that doesn't make sense.  I've added a check
in __niu_fastpath_interrupt() and niu_rx_work() to see if the RCR qlen
is zero, i.e. we're scheduling NAPI with no work.  This could happen if
there's a spurious interrupt, but the interrupt is not a run away that
would hang the CPU.  This check happens quite frequently, much more than
the spurious interrupt check.  Also, it happens with both the PCI-E and
XAUI cards (I could not reproduce run away spurious interrupts with a
PCI-E card).

This is a log from a test with 1 TCP stream.  If there's no work to do,
one would expect to see for each __niu_fastpath_interrupt(), there is only
one corresponding niu_rx_work().  i.e. work_done in niu_poll is 0, so NAPI
is done and should not have been rescheduled.  The check for spurious
interrupts in niu_schedule_napi() is not reached.

[  474.387984] niu: __niu_fastpath_interrupt() CPU=58 rx_channel=12 qlen is 0!
[  474.388009] niu: niu_rx_work() CPU=58 rx_channel=12 qlen is zero!
[  474.663008] niu: __niu_fastpath_interrupt() CPU=58 rx_channel=12 qlen is 0!
[  474.663034] niu: niu_rx_work() CPU=58 rx_channel=12 qlen is zero!
[  474.805663] niu: niu_rx_work() CPU=58 rx_channel=12 qlen is zero!
[  475.657501] niu: niu_rx_work() CPU=58 rx_channel=12 qlen is zero!
[  476.139072] niu: niu_rx_work() CPU=58 rx_channel=12 qlen is zero!
[  476.264352] niu: niu_rx_work() CPU=58 rx_channel=12 qlen is zero!
[  476.474596] niu: niu_rx_work() CPU=58 rx_channel=12 qlen is zero!
[  476.698176] niu: __niu_fastpath_interrupt() CPU=58 rx_channel=12 qlen is 0!
[  476.698201] niu: niu_rx_work() CPU=58 rx_channel=12 qlen is zero!

Here's another log excerpt with 1 TCP stream.  In this case NAPI is being
rescheduled multiple times with no work to do.  Again, run away spurious
interrupts did not happen.

[ 1098.870170] niu: niu_rx_work() CPU=50 rx_channel=11 qlen is zero!
[ 1098.879759] niu: __niu_fastpath_interrupt() CPU=50 rx_channel=11 qlen is 0!
[ 1098.904478] niu: __niu_fastpath_interrupt() CPU=50 rx_channel=11 qlen is 0!
[ 1098.908674] niu: __niu_fastpath_interrupt() CPU=50 rx_channel=11 qlen is 0!
[ 1098.908703] niu: niu_rx_work() CPU=50 rx_channel=11 qlen is zero!

In any case, interrupts with qlen being 0 always happen before we see the
run away spurious interrupts (but that could be due to the latter happening
a lot less frequently).

Regards,
Hong

View attachment "niu-debug-spurious-interrupts.patch" of type "text/plain" (2870 bytes)