lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date:	Mon, 22 Jun 2009 00:27:22 -0700 (PDT)
From:	David Miller <davem@...emloft.net>
To:	linux-kernel@...r.kernel.org
CC:	sparclinux@...r.kernel.org
Subject: NMI watchdog + NOHZ question


If some expert in this area can help I'd appreciate it.
And I'll note immediately that the issue I'm looking into
I've only investigated thoroughly with 2.6.29 vanilla.

In 2.6.29 we added an NMI watchdog timer to sparc64, it
operates identically to how the x86 one works except that
it's on by default :-)

When the qla2xxx driver is built into the kernel statically,
the firmware load causes an NMI watchdog timeout.

The qla2xxx driver is fine, it only actually disables interrupts for
very short periods to program the chip registers, telling it to load a
few blocks of the firmware via DMA or similar.

Then it waits for the interrupt to signal the firmware partial-load is
done using wait_for_completion_timeout() (see qla2x00_mailbox_command
in drivers/scsi/qla2xx/qla_mbx.c)

Assuming NOHZ is enabled, what if qla2xxx driver init is the only
running task on a cpu, no timers (at least for 5 seconds, the NMI
timeout) are due to fire, and the qla2xxx code loops in this manner
for more than 5 seconds loading the firmware?

As far as I can see it, the NOHZ code has no reason to start the timer
firing again in this situation.

So we'll just loop continuously into the scheduler (to wait for
the qla2xxx driver completion).  I believe the events trigger quick
enough that need_resched() is not true if the scheduler even makes
it to the idle thread.

So the sequence seems to be scheduling in and out of a pure kernel
thread, with no pending non-scheduler timers for a long time, and all
this happening for longer than the NMI watchdog timeout, with NOHZ
enabled.

I'll note that adding printk's (this is a serial console) to the
qla2xxx mailbox command code makes the NMI watchdog problem go away :)
But if I only put printk's around the entire firmware loading
sequence, the NMI watchdog does trigger.

Is there something fundamental that should be preventing this?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists