netdev - Re: [REGRESSION] asix: Lots of asix_rx_fixup() errors and slow transmissions

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <57291539.6080405@mentor.com>
Date:	Tue, 3 May 2016 22:16:41 +0100
From:	Dean Jenkins <Dean_Jenkins@...tor.com>
To:	"David B. Robins" <linux@...idrobins.net>,
	John Stultz <john.stultz@...aro.org>,
	Dean Jenkins <Dean_Jenkins@...tor.com>
CC:	lkml <linux-kernel@...r.kernel.org>,
	Mark Craske <Mark_Craske@...tor.com>,
	"David S. Miller" <davem@...emloft.net>,
	YongQin Liu <yongqin.liu@...aro.org>,
	Guodong Xu <guodong.xu@...aro.org>,
	<linux-usb@...r.kernel.org>, <netdev@...r.kernel.org>,
	Ivan Vecera <ivecera@...hat.com>
Subject: Re: [REGRESSION] asix: Lots of asix_rx_fixup() errors and slow
 transmissions

On 03/05/16 15:42, David B. Robins wrote:
>
> I don't think the first one is giving you problems (except as 
> triggered by the second) but I had concerns about the second myself 
> (and emailed the author off-list, but received no reply), and we did 
> not take that commit for our own product.
>
Sorry, I might have missed your original E-mail.

> Specifically, the second change, 3f30... (original patch: 
> https://www.mail-archive.com/netdev@vger.kernel.org/msg80720.html) (1) 
> appears to do the exact opposite of what it claims, i.e., instead of 
> "resync if this looks like a header", it does "resync if this does NOT 
> look like a (packet) header", where "looks like a header" means "bits 
> 0-10 (size) are equal to the bitwise-NOT of bits 16-26", and (2) can 
> happen by coincidence for 1/2048 32-bit values starting a continuation 
> URB (easy to hit dealing with large volumes of video data as we were). 
> It appears to expect the header for every URB whereas the rest of the 
> code at least expects it only once per network packet (look at 
> following code that only reads it for remaining == 0).

David, I think that your interpretation is incorrect. Please see below.

Here is the code snippet from the patch with my annotations between # #, 
I will try to explain my intentions. Feel free to point out any flaws:

     if (rx->remaining && (rx->remaining + sizeof(u32) <= skb->len)) {
         # Only runs when rx->remaining !=0 and the end of the Ethernet 
frame + next 32-bit header word is within the URB buffer. #
         # Therefore, this code does not run when the end of an Ethernet 
frame has been reached in the previous URB #
         # or when the end of the Ethernet frame + next 32-bit header 
word will be in a later URB buffer #

         # offset is an index to the expected next 32-bit header word 
after the end of the Ethernet frame #
         offset = ((rx->remaining + 1) & 0xfffe) + sizeof(u32);

         # rx->header contains the expected 32-bit header value 
corrected for Endianness and alignment #
         rx->header = get_unaligned_le32(skb->data + offset);
         offset = 0;

         # check the data integrity of the size value from the header word #
         size = (u16)(rx->header & 0x7ff);
         # if the size value fails the integrity check then we are not 
looking at a valid header word so #
         # synchronisation has been lost #
         if (size != ((~rx->header >> 16) & 0x7ff)) {
             netdev_err(dev->net, "asix_rx_fixup() Data Header 
synchronisation was lost, remaining %d\n",
                    rx->remaining);
             if (rx->ax_skb) {
                 kfree_skb(rx->ax_skb);
                 rx->ax_skb = NULL;
                 /* Discard the incomplete netdev Ethernet frame
                  * and assume the Data header is at the start of
                  * the current URB socket buffer.
                  */
             }
             rx->remaining = 0;
         }
     }


>
> So that change made no sense to me, but I don't have significant 
> kernel dev experience. Effectively it will drop/truncate every 
> (2047/2048) split (longer than an URB) packet, and report an error for 
> the second URB and then again for treating said second URB as a first 
> URB for a packet. I would expect your problems will go away just 
> removing the second change. You could also change the != to == in "if 
> (size != ...)" but then you'd still have 1/2048 (depending on data 
> patterns) false positives.
The code only runs when the Ethernet frame spans across URBs and is 
checking that the next 32-bit header word is present and valid.

Upon loss of synchronisation, the strategy is to assume that the 32-bit 
header is at the start of the URB buffer. Obviously, that might not be 
true every time but it is the most likely location especially when 
Ethernet frames are not spanning URBs at that point at time.

Looking at the error messages:

> [  239.037310] asix 1-1.1:1.0 eth0: asix_rx_fixup() Bad Header Length
> 0x54ebb5ec, offset 4 

The offset 4 means that the 32-bit header word was invalid at the start 
of the URB buffer. This could be a consequence of data synchronisation 
being lost however, we would expect the timestamps between the error 
messages of "synchronisation was lost" and "Bad Header Length" to very 
close as they would be consecutive URBs. The evidence is showing 10ms 
gaps which does not suggest consecutive URBs. In other words, an 
Ethernet frame should not be spanned over a time gap of 10ms as that 
would be very inefficient. If that were true then there would be USB 
communications problem with the USB to Ethernet adaptor which I hope is 
not true.

[  239.027993] asix 1-1.1:1.0 eth0: asix_rx_fixup() Data Header
synchronisation was lost, remaining 988

This error message consistently shows the remaining value to be 988, at 
least for the 3 examples provided by John. This does not suggest a 
random failure unless there are other examples of a non 988 remaining 
value error message. 988 is well within a Ethernet frame length so seems 
to be valid.

I think a good step would be to add some debug to print the 
rx->remaining value at entry to asix_rx_fixup_internal(). This would 
generate a lot of debug but a pattern of the values might emerge.

A good test would be to run "ping -c 1 -s $packet_length $ip_address" 
inside a script which has a loop with an increasing payload length 
$packet_length with a small delay between ping calls. This will show 
whether particular packet sizes trigger the failures.

Then try with "ping -f -c 200 -s $packet_length $ip_address" to load up 
the USB link.

Seems that I need kernel v4.4 or later to get a kernel with my patch in. 
This will take me a few days to find time to rig something up to test...

Regards,
Dean

-- 
Dean Jenkins
Embedded Software Engineer
Linux Transportation Solutions
Mentor Embedded Software Division
Mentor Graphics (UK) Ltd.