netdev - Re: audit.c skb - tty race condition - was sky2 panic in 2.6.32.1 under load (new oops)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Wed, 30 Dec 2009 15:44:20 -0500
From:	Michael Breuer <mbreuer@...jas.com>
To:	Stephen Hemminger <shemminger@...tta.com>
Cc:	Andrew Morton <akpm@...ux-foundation.org>,
	"Berck E. Nash" <flyboy@...il.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	netdev@...r.kernel.org
Subject: Re: audit.c skb - tty race condition - was sky2 panic in 2.6.32.1
 under load (new oops)

A couple more observations:

1) enabling auditd for runlevel 3 mitigates the issue
2) starting a remote x session (XDMCP) while under load and while auditd 
is already running also triggers the sky2 interrupt status messages - so 
maybe not tty1 - but some sort of X & auditd interaction. Even in this 
case, the frequency of the error messages is much less than when auditd 
is started in runlevel 5 for the first time.

On 12/30/2009 2:15 PM, Michael Breuer wrote:
> And now looking at audit.c it seems reasonable that there is a race 
> condition when auditd is started at roughly the same time as X. I'm 
> guessing that the kaudit thread is fired up; the tty connected; and at 
> the same time X grabs the tty. Somewhere in there an skb gets hosed 
> and is then reused by whatever comes along - in my case sky2 as that's 
> where the subsequent demand is. If the demand happens first, the 
> contaminated skb (dk in what way yet) is probably waiting to manifest 
> as some other bug that's been frustrating people.
> On 12/30/2009 12:49 PM, Michael Breuer wrote:
>> On 12/30/2009 2:58 AM, Stephen Hemminger wrote:
>>> On Wed, 30 Dec 2009 02:23:20 -0500
>>> Michael Breuer<mbreuer@...jas.com>  wrote:
>>>
>>>> Ok - I called dump_txring from sky2_net_intr:
>>>> --- a/drivers/net/sky2.c
>>>> +++ b/drivers/net/sky2.c
>>>> @@ -2725,8 +2791,10 @@ static void sky2_watchdog(unsigned long arg)
>>>>    /* Hardware/software error handling */
>>>>    static void sky2_err_intr(struct sky2_hw *hw, u32 status)
>>>>    {
>>>> -       if (net_ratelimit())
>>>> +       if (net_ratelimit()) {
>>>>                   dev_warn(&hw->pdev->dev, "error interrupt
>>>> status=%#x\n", status);
>>>> +               dump_txring(hw, 0);
>>>> +       }
>>>>
>>>>           if (status&  Y2_IS_HW_ERR)
>>>>                   sky2_hw_intr(hw);
>>>>
>>>> And got this:
>>>> Dec 30 02:17:23 mail kernel: sky2 0000:06:00.0: error interrupt
>>>> status=0x40000008
>>>> Dec 30 02:17:23 mail kernel: sky2 0000:06:00.0: error interrupt
>>>> status=0x40000008
>>>> Dec 30 02:17:23 mail kernel: sky2 Tx ring pending=28...30 report=29 
>>>> done=29
>>>> Dec 30 02:17:23 mail kernel: sky2 Tx ring pending=28...30 report=29 
>>>> done=29
>>>> Dec 30 02:17:23 mail kernel: sky2 0000:06:00.0: error interrupt 
>>>> status=0x8
>>>> Dec 30 02:17:23 mail kernel: sky2 0000:06:00.0: error interrupt 
>>>> status=0x8
>>>> Dec 30 02:17:23 mail kernel: sky2 Tx ring pending=30...32 report=30 
>>>> done=31
>>>> Dec 30 02:17:23 mail kernel: sky2 Tx ring pending=30...32 report=30 
>>>> done=31
>>>>
>>> I notice that you have NOUVEAU Nvidia drivers loaded? The one 
>>> difference in HW
>>> between your board and mine is that I have ATI video card.
>>>
>> Seems the problem is linked to auditd and X11 (but not nouveau).
>>
>> Today, I ran a bunch of scenarios. I first determined that the 
>> problem only manifest in runlevel 5. Next, this occurred with or 
>> without KMS and with or without nouveau. This happened whether or not 
>> I was logged in (local or remote), and regardless of window manager 
>> (xdm, gdm, kdm). I then checked to see what else was different 
>> between runlevel 3 and 5 - only thing was auditd. I disabled auditd 
>> and reran - no errors.
>>
>> Now for the odd stuff:
>>
>> The errors only manifest if the high throughput data transfer is 
>> initiated when the system is in runlevel 5 and auditd was started by 
>> init when transitioning from runlevel 3 to 5. For example, the 
>> following scenarios do not cause the errors to manifest:
>>
>> runlevel3; start auditd runlevel 5; start transfer
>> runlevel3; chkconfig auditd off; runlevel5; start auditd; start transfer
>> runlevel3; start transfer (note: errors do not occur if I transition 
>> to runlevel 5 after the high bandwidth transfer has started)
>> runlevel3; startx; start transfer
>>
>> The only way I get the problem to manifest is transition to runlevel 
>> 5 with chkconfig auditd on (level 5 only) and then initate the 
>> windows backup.
>>
>> I'm guessing that there is some sort of race condition happening 
>> between X (xdm/gdm/kdm/greeter?) and auditd that is somehow 
>> corrupting something. I'd hazard a more or less obvious guess that 
>> whatever's being corrupted differs when there is already a high 
>> throughput transfer under way.
>>
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe 
>> linux-kernel" in
>> the body of a message to majordomo@...r.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>
> -- 
> To unsubscribe from this list: send the line "unsubscribe 
> linux-kernel" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html