linux-kernel - Re: [PATCH] af_packet: Don't use skb after dev_queue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-id: <4B54E4EF.8080602@majjas.com>
Date:	Mon, 18 Jan 2010 17:47:11 -0500
From:	Michael Breuer <mbreuer@...jas.com>
To:	Jarek Poplawski <jarkao2@...il.com>
Cc:	Stephen Hemminger <shemminger@...ux-foundation.org>,
	David Miller <davem@...emloft.net>, akpm@...ux-foundation.org,
	flyboy@...il.com, linux-kernel@...r.kernel.org,
	netdev@...r.kernel.org
Subject: Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit()

On 1/18/2010 5:17 PM, Jarek Poplawski wrote:
> On Mon, Jan 18, 2010 at 11:08:14PM +0100, Jarek Poplawski wrote:
>    
>> Btw, I wonder if you could test it skipping the (HP?) switch?
>>      
> If so, then of course don't forget to try tcpdump on the router.
>
> Jarek P.
>    
Well - no.... but I'm not sure that would show anything.

Setup diagram:

Server->gb switch-> (100mb) wifi router -> devices
                     |
               Win7 PC (gb)

The problem does not occur (at least I haven't been able to recreate it) 
at 100mb, and the wifi router doesn't do 1Gb. I drive the traffic from 
the win7 PC to the server. I've seen the loss when the only traffic 
going through the wifi router was ping & dhcp. I've also never seen any 
loss on a device directly attached to the 1GB switch. I can drive load 
through the wifi router while driving load from the Win7 box, but don't 
see TX packet loss at all when not doing DHCP RELEASE/RENEW.

As there is no packet loss to devices not involved in the DHCP sequence 
through the same path, I'm not really sure that the GB switch is implicated.

As I don't have a standalone sniffer, I'm thinking that it might be 
easier to instrument places where the TX packet could be dropped and see 
at least whether it's getting to the card.

Given the circumstances of the TX drop, and that it was DHCP traffic 
while under load that caused the oops rectified with the two patches, 
I'm thinking that the packet loss is the current manifestation of 
whatever the underlying problem is. Given the extra hop required to 
break things, and given that a dhcp release/renew seems to trigger 
things, I keep coming back to arp logic as being somehow implicated.

If arp is somehow involved, then I'd expect to see manifestations under 
similar circumstances with other drivers. As the pskb_may_pull patch 
stopped the crash, perhaps other drivers do suffer packet loss and it's 
just not been widely noticed or attributed to the kernel - especially if 
the network topology is a factor. I do know people at large enterprises 
who have been complaining of what *could* be this same issue, however 
they're currently blaming their switch vendors. As most traffic is TCP, 
this is really only noticed by those few places deeply concerned with 
latency. It's likely something altogether different, but then again, 
maybe not.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/