netdev - Re: r8169 driver fails to see IGMPv2 SAP announcements

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <857EF17F-C708-4656-A7AF-D64A63223854@lincor.com>
Date:	Fri, 11 Apr 2008 08:43:55 +0100
From:	Glen Gray <glen.gray@...cor.com>
To:	David Stevens <dlstevens@...ibm.com>, netdev@...r.kernel.org
CC:	David Stevens <dlstevens@...ibm.com>,
	Francois Romieu <romieu@...zoreil.com>
Subject: Re: r8169 driver fails to see IGMPv2 SAP announcements

Ok, further to this, I've managed to do some testing at last but I'm  
looking for further advice on how to debug this further.

I'm still sure this is tied to the igmp version as that's all that's  
different from what I can see. So if I know why the net device is  
getting setup as igmp v3 and not v2 then I might be closer to solving  
this. From looking at the code, it seems the igmp version is decided  
upon by looking for igmp packets. If version 1 or version 2 are seen  
in a particular time frame, then the version is set to v1 or v2  
depending on what was seen otherwise it's set to v3 by default. Is  
this correct ?

I've built the latest r8169 driver from Realtek on my current kernel  
and the latest Fedora 8 2.6.24.4 kernel rpm with some debugs in the  
rtl_set_rx_mode function. I can see that under both kernels, the  
mc_filter[0/1] elements are getting set to the same values, as is the  
rx_mode

I performed a somewhat crude test by doing the following

ethtool -S eth0; noting the multicast and broadcast counts
tcpdump -i eth0 ether multicast; dumping to a file for a period of time
ethtool -S eth0; got the new counts and worked out the differences and  
compared that to a wc -l of the tcpdump.

I ran that test a couple of times and in all cases there's a large  
difference between what tcpdump reports (in promiscuous mode) and what  
the net device stats are showing, namely that there's more packets  
reported by ethtool -S eth0 than I can see with tcpdump

Unfortunately, I can't do a similar test against the working Realtek  
driver on the older kernel as it doesn't support the ethtool -S  
command. And I can't compile the Realtek driver against the current  
2.6.24 kernel due to API changes.

A tcpdump on a working Realtek driver/older kernel shows the following  
packets (same device, same net/switch)
16:05:54.147532 IP exterity1.labs.lincor.com.sapv1 >  
239.255.255.255.sapv1: UDP, length 296
16:05:54.148298 IP exterity1.labs.lincor.com.sapv1 >  
239.255.255.255.sapv1: UDP, length 287
16:05:54.149046 IP exterity1.labs.lincor.com.sapv1 >  
239.255.255.255.sapv1: UDP, length 292
16:05:54.149776 IP exterity1.labs.lincor.com.sapv1 >  
239.255.255.255.sapv1: UDP, length 291
16:05:54.150490 IP exterity1.labs.lincor.com.sapv1 >  
239.255.255.255.sapv1: UDP, length 279

A working /proc/net/igmp
Idx	Device    : Count Querier	Group    Users Timer	Reporter
1	lo        :     0      V3
				010000E0     1 0:00000000		0
2	eth0      :     8      V2
				FF0000E0     1 0:00000000		1
				FFFFFFEF     2 0:00000000		1
				FFFFC3EF     1 0:00000000		1
				FE7F02E0     1 0:00000000		1
				010000E0     1 0:00000000		0

A working /proc/net/dev_mcast
2    eth0            12    0     333300027ffe
2    eth0            1     0     01005e0000ff
2    eth0            1     0     01005e7fffff
2    eth0            1     0     01005e43ffff
2    eth0            1     0     01005e027ffe
2    eth0            1     0     3333ff402cbe
2    eth0            1     0     333300000001
2    eth0            1     0     01005e000001


For a 2.6.24 kernel, tcpdump just doesn't have the SAP packets

Not working /proc/net/igmp
Idx	Device    : Count Querier	Group    Users Timer	Reporter
1	lo        :     0      V3
				010000E0     1 0:00000000		0
2	eth0      :     8      V3
				FF0000E0     1 0:00000000		0
				FFFFFFEF     2 0:00000000		0
				FFFFC3EF     1 0:00000000		0
				FE7F02E0     1 0:00000000		0
				010000E0     1 0:00000000		0

[root@...note root]# cat /proc/net/dev_mcast
2    eth0            12    0     333300027ffe
2    eth0            1     0     01005e0000ff
2    eth0            1     0     01005e7fffff
2    eth0            1     0     01005e43ffff
2    eth0            1     0     01005e027ffe
2    eth0            1     0     3333ff402cbe
2    eth0            1     0     333300000001
2    eth0            1     0     01005e000001

Pointers on where to look next are most welcome.

Kind Regards,
--
Glen Gray <glen.gray@...cor.com>         Digital Depot, Thomas Street
Senior Software Engineer                            Dublin 8, Ireland
Lincor Solutions Ltd.                          Ph: +353 (0) 1 4893682

On 20 Mar 2008, at 20:48, David Stevens wrote:
> Hi, Glen,
>        From your detailed description, and particularly the fact
> that the problem seems to be tied to the driver & device, I think
> I'd recommend looking at the multicast address filter code in the
> driver. IGMP is not device dependent, so I doubt that is the
> source of the problem.
>        If you can reproduce the problem, then while it's
> happening:
>
> 1) catch the group memberships by saving /proc/net/igmp
> 2) catch the hardware group memberships by saving
>        /proc/net/dev_mcast
> [I expect from the symptoms that 1) is ok, 2) may or may not be...]
> and...
> 3) run tcpdump or wireshark in promiscuous mode
>        - if the device address filter is the problem, when you
>        put the device in promiscuous mode, everything will
>        start working again, until you exit tcpdump. You will
>        also see the packets you aren't receiving are being
>        sent, if that's the problem.
>
> I understand you probably can't directly reproduce it, and
> the visual artifacts you mentioned in that one test may or
> may not be the same issue as the other one.
>
> Another possibility that comes to mind is a memory leak,
> if the response problems are related to a low memory
> condition. So, that might be something else to look for.
> Compare memory usage with ordinary usage, check
> for log messages of allocation failures and check "netstat -s"
> output for any indication of drops.
>
> If you can set a program or script to monitor the system
> and detect when you hit the problem, then you could use
> that to trigger running a script that captures the data you
> need when it happens.
>
>                                        +-DLS
>

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html