netdev - Re: [PATCH 2/3] net: macb: fix dropped RX frames due to a race

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <f15d8b59-7e89-225b-9c52-52a61713cc05@bitwise.fi>
Date:   Mon, 3 Dec 2018 12:31:52 +0200
From:   Anssi Hannula <anssi.hannula@...wise.fi>
To:     Harini Katakam <harinik@...inx.com>
Cc:     Nicolas Ferre <nicolas.ferre@...rochip.com>,
        David Miller <davem@...emloft.net>, netdev@...r.kernel.org
Subject: Re: [PATCH 2/3] net: macb: fix dropped RX frames due to a race

Hi,

On 3.12.2018 6:52, Harini Katakam wrote:
> Hi Anssi,
> On Fri, Nov 30, 2018 at 11:53 PM Anssi Hannula <anssi.hannula@...wise.fi> wrote:
>> Bit RX_USED set to 0 in the address field allows the controller to write
>> data to the receive buffer descriptor.
>>
>> The driver does not ensure the ctrl field is ready (cleared) when the
>> controller sees the RX_USED=0 written by the driver. The ctrl field might
>> only be cleared after the controller has already updated it according to
>> a newly received frame, causing the frame to be discarded in gem_rx() due
>> to unexpected ctrl field contents.
>>
>> A message is logged when the above scenario occurs:
>>
>>   macb ff0b0000.ethernet eth0: not whole frame pointed by descriptor
>>
>> Fix the issue by ensuring that when the controller sees RX_USED=0 the
>> ctrl field is already cleared.
>>
>> This issue was observed on a ZynqMP based system.
>>
> Thanks for the patch.
> Could you please describe the test in which this behavior was observed?

Sure. The testcase I used for the patches is:

- RT_FULL kernel,
- CPU-bound SCHED_FF RT priority 15 process (with
rcutree.kthread_prio=20 to avoid RCU starvation),
- Pyropus memtester running for 3GB (system has 4GB memory),
- "ping -f -l 5000 -s 100" running from a PC.

The "not whole frame pointed by descriptor" issue occurs within minutes
and the RX memory corruption within an hour. I did not try to reduce the
testcase to a minimum.

Both were also observed using real production loads (that of course do
not have CPU-bound RT tasks).

> Were you able to confirm that this was because of the ctrl field being
> cleared late? This error can also be observed under stress when RX UBR
> is observed.

I observed that the issue occurred without this patch, and didn't occur
after applying this patch (individually), but I didn't check it further
than that. If you have anything you'd like me to test, let me know.

> I understand it makes sense to clear ctrl field before setting RX used bit.
> But I'm trying to understand if a dmb is necessary in the receive data path.

A barrier is needed as otherwise writes to the ctrl field and addr
fields may be freely reordered as they are just regular writes to
memory, as far as I've understood it, potentially undoing the effect of
changing the order.

If you mean the third patch, same issue with reads - they may be freely
reordered (by e.g. the compiler) as they are just regular reads.

But I don't claim to be an expert on these so please correct me if I'm
wrong :)

-- 
Anssi Hannula / Bitwise Oy