[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAFcVECLhZAcQFxB7FxJyXYfyNdGZ3oJf0Sei8DFige5YSU1DWw@mail.gmail.com>
Date: Mon, 3 Dec 2018 16:06:38 +0530
From: Harini Katakam <harinik@...inx.com>
To: anssi.hannula@...wise.fi
Cc: Nicolas Ferre <nicolas.ferre@...rochip.com>,
David Miller <davem@...emloft.net>, netdev@...r.kernel.org
Subject: Re: [PATCH 2/3] net: macb: fix dropped RX frames due to a race
Hi Anssi,
On Mon, Dec 3, 2018 at 4:02 PM Anssi Hannula <anssi.hannula@...wise.fi> wrote:
>
> Hi,
>
> On 3.12.2018 6:52, Harini Katakam wrote:
> > Hi Anssi,
> > On Fri, Nov 30, 2018 at 11:53 PM Anssi Hannula <anssi.hannula@...wise.fi> wrote:
> >> Bit RX_USED set to 0 in the address field allows the controller to write
> >> data to the receive buffer descriptor.
> >>
> >> The driver does not ensure the ctrl field is ready (cleared) when the
> >> controller sees the RX_USED=0 written by the driver. The ctrl field might
> >> only be cleared after the controller has already updated it according to
> >> a newly received frame, causing the frame to be discarded in gem_rx() due
> >> to unexpected ctrl field contents.
> >>
> >> A message is logged when the above scenario occurs:
> >>
> >> macb ff0b0000.ethernet eth0: not whole frame pointed by descriptor
> >>
> >> Fix the issue by ensuring that when the controller sees RX_USED=0 the
> >> ctrl field is already cleared.
> >>
> >> This issue was observed on a ZynqMP based system.
> >>
> > Thanks for the patch.
> > Could you please describe the test in which this behavior was observed?
>
> Sure. The testcase I used for the patches is:
>
> - RT_FULL kernel,
> - CPU-bound SCHED_FF RT priority 15 process (with
> rcutree.kthread_prio=20 to avoid RCU starvation),
> - Pyropus memtester running for 3GB (system has 4GB memory),
> - "ping -f -l 5000 -s 100" running from a PC.
>
> The "not whole frame pointed by descriptor" issue occurs within minutes
> and the RX memory corruption within an hour. I did not try to reduce the
> testcase to a minimum.
>
> Both were also observed using real production loads (that of course do
> not have CPU-bound RT tasks).
>
> > Were you able to confirm that this was because of the ctrl field being
> > cleared late? This error can also be observed under stress when RX UBR
> > is observed.
>
> I observed that the issue occurred without this patch, and didn't occur
> after applying this patch (individually), but I didn't check it further
> than that. If you have anything you'd like me to test, let me know.
Thanks for the details.
Regards,
Harini
Powered by blists - more mailing lists