[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-id: <4B3BC2DC.7030201@majjas.com>
Date: Wed, 30 Dec 2009 16:15:08 -0500
From: Michael Breuer <mbreuer@...jas.com>
To: Stephen Hemminger <shemminger@...tta.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
"Berck E. Nash" <flyboy@...il.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
netdev@...r.kernel.org
Subject: Re: audit.c skb - tty race condition - was sky2 panic in 2.6.32.1
under load (new oops)
On 12/30/2009 3:44 PM, Michael Breuer wrote:
> A couple more observations:
>
> 1) enabling auditd for runlevel 3 mitigates the issue
> 2) starting a remote x session (XDMCP) while under load and while
> auditd is already running also triggers the sky2 interrupt status
> messages - so maybe not tty1 - but some sort of X & auditd
> interaction. Even in this case, the frequency of the error messages is
> much less than when auditd is started in runlevel 5 for the first time.
And just for some certainty - running with audit=0, the errors do not
manifest.
>
> On 12/30/2009 2:15 PM, Michael Breuer wrote:
>> And now looking at audit.c it seems reasonable that there is a race
>> condition when auditd is started at roughly the same time as X. I'm
>> guessing that the kaudit thread is fired up; the tty connected; and
>> at the same time X grabs the tty. Somewhere in there an skb gets
>> hosed and is then reused by whatever comes along - in my case sky2 as
>> that's where the subsequent demand is. If the demand happens first,
>> the contaminated skb (dk in what way yet) is probably waiting to
>> manifest as some other bug that's been frustrating people.
>> On 12/30/2009 12:49 PM, Michael Breuer wrote:
>>> On 12/30/2009 2:58 AM, Stephen Hemminger wrote:
>>>> On Wed, 30 Dec 2009 02:23:20 -0500
>>>> Michael Breuer<mbreuer@...jas.com> wrote:
>>>>
>>>>> Ok - I called dump_txring from sky2_net_intr:
>>>>> --- a/drivers/net/sky2.c
>>>>> +++ b/drivers/net/sky2.c
>>>>> @@ -2725,8 +2791,10 @@ static void sky2_watchdog(unsigned long arg)
>>>>> /* Hardware/software error handling */
>>>>> static void sky2_err_intr(struct sky2_hw *hw, u32 status)
>>>>> {
>>>>> - if (net_ratelimit())
>>>>> + if (net_ratelimit()) {
>>>>> dev_warn(&hw->pdev->dev, "error interrupt
>>>>> status=%#x\n", status);
>>>>> + dump_txring(hw, 0);
>>>>> + }
>>>>>
>>>>> if (status& Y2_IS_HW_ERR)
>>>>> sky2_hw_intr(hw);
>>>>>
>>>>> And got this:
>>>>> Dec 30 02:17:23 mail kernel: sky2 0000:06:00.0: error interrupt
>>>>> status=0x40000008
>>>>> Dec 30 02:17:23 mail kernel: sky2 0000:06:00.0: error interrupt
>>>>> status=0x40000008
>>>>> Dec 30 02:17:23 mail kernel: sky2 Tx ring pending=28...30
>>>>> report=29 done=29
>>>>> Dec 30 02:17:23 mail kernel: sky2 Tx ring pending=28...30
>>>>> report=29 done=29
>>>>> Dec 30 02:17:23 mail kernel: sky2 0000:06:00.0: error interrupt
>>>>> status=0x8
>>>>> Dec 30 02:17:23 mail kernel: sky2 0000:06:00.0: error interrupt
>>>>> status=0x8
>>>>> Dec 30 02:17:23 mail kernel: sky2 Tx ring pending=30...32
>>>>> report=30 done=31
>>>>> Dec 30 02:17:23 mail kernel: sky2 Tx ring pending=30...32
>>>>> report=30 done=31
>>>>>
>>>> I notice that you have NOUVEAU Nvidia drivers loaded? The one
>>>> difference in HW
>>>> between your board and mine is that I have ATI video card.
>>>>
>>> Seems the problem is linked to auditd and X11 (but not nouveau).
>>>
>>> Today, I ran a bunch of scenarios. I first determined that the
>>> problem only manifest in runlevel 5. Next, this occurred with or
>>> without KMS and with or without nouveau. This happened whether or
>>> not I was logged in (local or remote), and regardless of window
>>> manager (xdm, gdm, kdm). I then checked to see what else was
>>> different between runlevel 3 and 5 - only thing was auditd. I
>>> disabled auditd and reran - no errors.
>>>
>>> Now for the odd stuff:
>>>
>>> The errors only manifest if the high throughput data transfer is
>>> initiated when the system is in runlevel 5 and auditd was started by
>>> init when transitioning from runlevel 3 to 5. For example, the
>>> following scenarios do not cause the errors to manifest:
>>>
>>> runlevel3; start auditd runlevel 5; start transfer
>>> runlevel3; chkconfig auditd off; runlevel5; start auditd; start
>>> transfer
>>> runlevel3; start transfer (note: errors do not occur if I transition
>>> to runlevel 5 after the high bandwidth transfer has started)
>>> runlevel3; startx; start transfer
>>>
>>> The only way I get the problem to manifest is transition to runlevel
>>> 5 with chkconfig auditd on (level 5 only) and then initate the
>>> windows backup.
>>>
>>> I'm guessing that there is some sort of race condition happening
>>> between X (xdm/gdm/kdm/greeter?) and auditd that is somehow
>>> corrupting something. I'd hazard a more or less obvious guess that
>>> whatever's being corrupted differs when there is already a high
>>> throughput transfer under way.
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe
>>> linux-kernel" in
>>> the body of a message to majordomo@...r.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at http://www.tux.org/lkml/
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe
>> linux-kernel" in
>> the body of a message to majordomo@...r.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>
> --
> To unsubscribe from this list: send the line "unsubscribe
> linux-kernel" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists