netdev - Re: [Linuxptp-devel] strangeness

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAD56B7cP_xEW_gtHm3AYdWrAig9a3zrX8o936b0dEU3ScdEB+w@mail.gmail.com>
Date:   Fri, 8 Mar 2019 16:41:26 -0500
From:   Paul Thomas <pthomas8589@...il.com>
To:     Harini Katakam <harinik@...inx.com>
Cc:     "linuxptp-devel@...ts.sourceforge.net" 
        <linuxptp-devel@...ts.sourceforge.net>, netdev@...r.kernel.org
Subject: Re: [Linuxptp-devel] strangeness

On Fri, Mar 8, 2019 at 1:07 PM Paul Thomas <pthomas8589@...il.com> wrote:
>
> Hi Harini,
>
> On Fri, Mar 8, 2019 at 1:08 AM Harini Katakam <harinik@...inx.com> wrote:
> >
> > Hi Paul,
> > On Fri, Mar 8, 2019 at 12:33 AM Paul Thomas <pthomas8589@...il.com> wrote:
> > >
> > > On Thu, Mar 7, 2019 at 12:32 AM Harini Katakam <harinik@...inx.com> wrote:
> > > >
> > > > Hi Paul,
> > > > On Thu, Mar 7, 2019 at 4:38 AM Paul Thomas <pthomas8589@...il.com> wrote:
> > > > >
> > > > > On Fri, Mar 1, 2019 at 1:24 AM Harini Katakam <harinik@...inx.com> wrote:
> > > > > >
> > > > > > +netdev
> > > > > >
> > > > > > Hi Paul,
> > > > > > On Fri, Mar 1, 2019 at 12:29 AM Richard Cochran
> > > > > > <richardcochran@...il.com> wrote:
> > > > > > >
> > > > > > > On Thu, Feb 28, 2019 at 12:33:26PM -0500, Paul Thomas wrote:
> > > > > > > > Yes changing it to TSTAMP_ALL_PTP_FRAMES instead of TSTAMP_ALL_FRAMES
> > > > > > > > does seem to fix the ssh issue. My worry is that there is still a bug
> > > > > > > > somewhere in the network stack that this is just masking.
> > > > > >
> > > > > > Ok thanks.
> > > > > > One place to check in the driver will be:
> > > > > > if (gem_ptp_do_txstamp(queue, skb, desc) == 0) {
> > > > > > /* skb now belongs to timestamp buffer
> > > > > > * and will be removed later
> > > > > > */
> > > > > > tx_skb->skb = NULL;
> > > > > > }
> > > > > > When all TX packets are timestamped, the skb always belongs to the
> > > > > > timestamp buffer.
> > > > > >
> > > > > > >
> > > > > > > Or the HW isn't sending the frames in the first place.
> > > > > > >
> > > > > > > Check that first!
> > > > > >
> > > > > > To check this, the statistics registers in MAC will be one way.
> > > > > > But if there is no TX completion interrupt, then I wouldn't expect
> > > > > > these statistics to increase either. The used bit status in BD dump
> > > > > > might be of more use.
> > > > > >
> > > > > > I will also try to reproduce (with TX timestamp ALL) and see if any of
> > > > > > the above gives some clue.
> > > > > >
> > > > > > Regards,
> > > > > > Harini
> > > > >
> > > > > Hi Harini, any luck looking at this?
> > > >
> > > > I'm sorry, I was not able to debug this further.
> > > >
> > > > >
> > > > > I didn't get very far, even in the "broken" state I see plenty of tx_frames:
> > > > > root@xu5:/opt/linuxptp# ethtool -S eth0
> > > > > NIC statistics:
> > > > >      ...
> > > > >      tx_frames: 39763
> > > > >      ...
> > > > >
> > > > > When you said "registers in the MAC" is ethtool -S displaying that?
> > > >
> > > > Yes, ethtool does display these statistics.
> > > > I was referring to the registers starting offset 0xFF0B0108 (for GEM0) here:
> > > > https://www.xilinx.com/html_docs/registers/ug1087/ug1087-zynq-ultrascale-registers.html
> > > > If you see this value increasing, then the MAC is transmitting successfully.
> > > > Although, I realize it could be other traffic. To see if specific
> > > > packets (for the
> > > > failed SSH connection) are not being queued, a BD dump might help.
> > > >
> > > > Regards,
> > > > Harini
> > >
> > > OK, I think things are becoming more clear. After just doing ioctl(fd,
> > > SIOCSHWTSTAMP, &ifreq) from userspace (tx_bd_control =
> > > TSTAMP_ALL_FRAMES in macb_ptp.c) then with the nc experiment some udp
> > > transmits do not make it to macb_start_xmit() until receive traffic on
> > > the nc connection comes in (one-to-one, one new rx packet means one
> > > old tx packet goes out).
> >
> > Could you please share any wireshark log or dump for what is being
> > received here?
>
> Here are two wireshark captures, the thing to note in the bad one is
> that packets No. 5, 7, 9 from .102 to .103 were actually sent just
> after packet No. 2 but they don't show up on the wire until the
> packets the other way (one for one).
>
> >
> > >
> > > Working setup:
> > > Before the tx_bd_control = TSTAMP_ALL_FRAMES.
> > > Every time I hit "sN Enter" from nc I see a macb_start_xmit
> > > print_hex_dump() and I see the packet on the nc client side:
> > > # nc -l -u -p 9999
> > > ...
> > > s11
> > > [  347.517080] macb_start_xmit data: 00000000: 20 b0 f7 04 0a 29 20 b0
> > > f7 04 0a 26 08 00 45 00   ....) ....&..E.
> > > s12
> > > [  348.964369] macb_start_xmit data: 00000000: 20 b0 f7 04 0a 29 20 b0
> > > f7 04 0a 26 08 00 45 00   ....) ....&..E.
> > > ...
> > >
> > > Broken setup:
> > > After the tx_bd_control = TSTAMP_ALL_FRAMES.
> > > Not the first nc packet, but many of the subsequent ones never make it
> > > to macb_start_xmit()
> > > # nc -l -u -p 9999
> > > ...
> > > s3
> > > s4
> > > s5
> > > ...
> > > Eventually after I send data from the client nc I do see the
> > > macb_start_xmit() lines.
> >
> > Thanks for this debug. If macb_start_xmit is never called, one of
> > the preceeding checks (such as if skb is present or if the TX queues
> > are off  etc)
> > should fail. I'm still tracing this but I'm not sure under what
> > circumstances only
> > some UDP packets will be prevented from being transmitted.
> In this specific test the first tx packets always goes through, and
> the subsequent ones don't until rx packets. So it's not random when
> they go through, I could have been clearer about that.
>
> > Just to be sure, could you please confirm you are not seeing any
> > "buffer exhausted" messaged from TX error tasks?
> Correct, I'm not seeing any "buffer exhausted" errors.
>
> thanks,
> Paul

And one more piece that may be helpful. I think I narrowed down what's
happening in the receive that finally flushes out a pending tx packet.
It seems to be the netif_receive_skb(skb); line in gem_rx() (line
1067). I tested with an mdelay before and after this call:
        mdelay(1000);//mdelay here is slow to flush the pending tx
packet (as seen by nc client)
        netif_receive_skb(skb);
        //mdelay(1000);//mdelay here is fast to flush the pending tx
packet (as seen by nc client)
This seems very strange to me, I quickly glanced at what
netif_receive_skb() is doing and didn't see anything connected with
the TX path, but those are the symptoms.

thanks,
Paul