netdev - Re: watchdog timeout panic in e1000 driver

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <4574C14E.2060108@intel.com>
Date:	Mon, 04 Dec 2006 16:46:06 -0800
From:	Auke Kok <auke-jan.h.kok@...el.com>
To:	Kenzo Iwami <k-iwami@...jp.nec.com>
CC:	"Brandeburg, Jesse" <jesse.brandeburg@...el.com>,
	Shaw Vrana <shaw@...nix.com>, netdev@...r.kernel.org,
	"Ronciak, John" <john.ronciak@...el.com>
Subject: Re: watchdog timeout panic in e1000 driver

Kenzo Iwami wrote:
> Hi,
> 
>>> Doesn't this just mean that we need a spinlock or some other kind of
>>> semaphore around acquiring, using, and releasing this resource?  We keep
>>> going around and around about this but I'm pretty sure spinlocks are
>>> meant to be able to solve exactly this issue.
>>>
>>> The problem is going to get considerably more nasty if we need to hold a
>>> spinlock with interrupts disabled for a significant amount of time, at
>>> which point a semaphore of some kind with a spinlock around it would
>>> seem to be more useful.
>> Even if spin_lock() was used to protect this resource, it is still possible
>> for an interrupt to kick in and call e1000_watchdog. In this case,
>> e1000_get_software_semaphore() will be called from within the interrupt
>> handler and the problem will still occur.
>>
>> In order to solve this problem, interrupt should be disabled (for example,
>> spin_lock_irqsave).
>> The interrupt handler can't run while the process is holding this resource,
>> and this problem doesn't occur.
>>
>>> I'll work with Auke to see if we can come up with another try.
>> Do you have any updates about your test code?
> 
> Does the fix I previously proposed have problems?
> If it does, I'd like to help find investigate another fix to solve
> this problem.

There are several issues that are conflicting and mixing that make it less than 
intuitive to decide what the better fix is.

Most of all, we discussed that adding a spinlock is not going to fix the underlying 
problem of contention, as the code that would need to be spinlocked can sleep. Not a 
good thing.

Adding state tracking code in the form of atomics might solve the issue too, but then we 
need to do this in quite a few locations. And it comes down to the fact that we really 
want all users of the semaphore to halt in case it is in use.

Reducing the swfw semaphore time is a usefull exercise, but requires an amazing amount 
of changes to all of the phy code to make sure we're not locking it too long, and even 
then I doubt that we will reduce the maximum lock time to acceptable levels.

The watchdog then, appears to needlessly lock the semaphore every two seconds. this is 
because even though the link is up and we're already setup, we go through the trouble of 
doing all the PHY reads, which are protected by the semaphores.

I'm currently testing a watchdog version which completely bypasses these checks in case 
the MAC didn't detect a link change, and we already are setup completely. In that case, 
all we need to do is update stats and reschedule the timer.

I'll keep you posted on progress.

Cheers,

Auke
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html