netdev - Re: [Linuxptp-devel] strangeness

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <CAFcVECL5+wes5gQ2R6zF=V_-8bWff73FoAV1FYRB1ZvNf1URiw@mail.gmail.com>
Date:   Tue, 12 Mar 2019 15:40:05 +0530
From:   Harini Katakam <harinik@...inx.com>
To:     Paul Thomas <pthomas8589@...il.com>
Cc:     "linuxptp-devel@...ts.sourceforge.net" 
        <linuxptp-devel@...ts.sourceforge.net>, netdev@...r.kernel.org
Subject: Re: [Linuxptp-devel] strangeness

Hi Paul,
On Tue, Mar 12, 2019 at 8:26 AM Paul Thomas <pthomas8589@...il.com> wrote:
>
> Hi All,
>
> Let me do a quick clean recap of this issue.
>
> On a Debian arm64 system with a 5.0rc8 kernel using the macb driver on
> zynqmp, enabling tx timestamping (1) breaks networking! The first and
> most noticeable way is that you can no longer connect with ssh. This
> is a serious bug somewhere and merits some attention.
>
> Trying to debug ssh is a possibility, but I was trying to debug with
> something easier and thus the netcat testing. The specific issue can
> be seen in the following strace. In this setup nc just connects to a
> server and tries to send two packets (2). The first packet goes
> through fine, but the second doesn't because nc is stuck forever
> trying to read from the socket.
> pselect6(4, [0 3], NULL, NULL, NULL, NULL) = 1 (in [0]) <-- waiting on
> stdin and UDP sock
> read(0, "c1\n", 8192) = 3 <-- read three chars from stdin
> write(3, "c1\n", 3) = 3 <-- write those out on the UDP sock
> pselect6(4, [0 3], NULL, NULL, NULL, NULL) = 1 (in [3])  <-- waiting
> on stdin and UDP sock
> read(3, <-- waits forever here as there is no data to read
>
> I've been reading more, an old patch and the timestamping.txt doc
> helped me understand a little more of what's going on:
> https://lore.kernel.org/netdev/20130328211925.7644.15781.stgit@jekeller-hub.jf.intel.com/
> https://www.kernel.org/doc/Documentation/networking/timestamping.txt
>
> So it is clear that if the SO_SELECT_ERR_QUEUE flag is set then in
> fact the select should return, but it is not set in this case. I can
> see everything that is going on in datagram_poll() in datagram.c. The
> main difference being that in the broken case the mask is 0x30c and in
> the working case it is 0x304. The difference is EPOLLERR, which is
> there clearly in the code if !skb_queue_empty(&sk->sk_error_queue).
>
> Then in select.c POLLIN_SET includes EPOLLERR. It almost looks as if
> it's behaving as it should (except that things break). My first
> question is should the sk_error_queue be empty if there is a tx
> timestamp available (in datagram_poll() in datagram.c)? If it's not
> empty I don't see what else SO_SELECT_ERR_QUEUE flag is doing for the
> select() and I don't see what would be different about the macb/arm64
> setup?

Thanks for the summary.
I think sk_error_queue should be empty because packets are queued to
that via skb_complete_timestamp (sock_queue_err_skb) and this should
not be called in this flow. I'm sorry if I'm missing something - I'll let others
from netdev comment.
I'm not sure why EPOLLERR in being set in this case.

Regards,
Harini

>
> Any insight here would be very much appreciated.
>
> thanks,
> Paul
>
> (1) hwstamp_ctl -i eth0 -t 1
>
> (2) The actual script to be able to run nc and strace from a single
> serial console is slightly clever:
> (sleep 3; echo "c1"; sleep 1; echo "c2") | nc -u 10.1.155.100 9999 &
> strace -p $(ps -A | grep nc | awk '{print $1}')