netdev - Re: BUG in netxen_release_tx_buffers when TSO enabled on kernels >= 3.3 and <= 3.6

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1358861524.3464.3768.camel@edumazet-glaptop>
Date:	Tue, 22 Jan 2013 05:32:04 -0800
From:	Eric Dumazet <eric.dumazet@...il.com>
To:	christoph.paasch@...ouvain.be
Cc:	Ian Campbell <Ian.Campbell@...rix.com>,
	Sony Chacko <sony.chacko@...gic.com>,
	Rajesh Borundia <rajesh.borundia@...gic.com>,
	David Miller <davem@...emloft.net>, netdev@...r.kernel.org
Subject: Re: BUG in netxen_release_tx_buffers when TSO enabled on kernels
 >= 3.3 and <= 3.6

On Tue, 2013-01-22 at 11:15 +0100, Christoph Paasch wrote:
> Hello,
> 

Hi Christoph

> I have a scenario where I can trigger a bug on kernels >= 3.3 and <= 3.6. 
> Thus, I can produce it with the latest longterm-stable v3.4.26.
> 
> The crashdumps/warning can be seen below. Sometimes it is only the warning, 
> sometimes it also produces the crash. But, it happens each time I try out my 
> scenario.
> 
> 
> How to reproduce the bug (I have HP Proliant DL165 machines with HP NC375T 1Gb 
> interface):
> 
>  * Launch an iperf-session ( -t 10 ) to a server over a 1Gbps interface.
> 
>  * After 5 seconds on the client, remove the IP-address from the interface
> with ip addr del dev [itf] [ip]
> 
>  * Wait 10 more seconds and kill the iperf on the client and the server.
> 
>  * Then do: ifconfig down [itf]
> 
> Now the crash happens.
> 
> What I observe in netxen_release_tx_buffers is that upon the 18th iteration (j 
> == 17), buffrag->length == 0. buffrag->frag_count is 18.
> Sometimes (much more rare), buffrag->length rather looks like garbage (e.g., > 
> 2^32)
> 
> 
> I bisected this, and it was introduced by commit 9d4dde521577 (net: only use a 
> single page of slop in MAX_SKB_FRAGS). 
> It was fixed by Eric in commit 5640f7685831 (net: use a per task frag 
> allocator) since kernel > 3.6.
> 

Its a side effect of this patch, as it permits to build a full TSO
packet using 2 or 3 frags, instead of 16 to 17 frags.

But you could theoretically still hit the bug if the application uses
several sockets and does short write() on them. Because each short
write() would use a small frag.

> As this bug is present in the longterm-stable 3.4, should Eric's patch be 
> backported?

I don't think so. It had some side effects that we are still sorting
out. (see recent splice() fixes for example)

> If not, does somebody (with more knowledge than I have of this part of the 
> code) can have a look at it, or maybe give me a pointer on how I could solve 
> this properly?
> 
> Reverting commit 9d4dde521577 (net: only use a single page of slop in 
> MAX_SKB_FRAGS) fixes it for me on 3.4.26.
> 

Something doesn't properly test MAX_SKB_FRAGS, we should track it and
fix.




--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html