netdev - Re: [Linuxptp-devel] strangeness

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAD56B7fuQ9_UmMUwQ3PXm0hRA1YusS--q3_D76jehTtkCfZGNA@mail.gmail.com>
Date:   Mon, 11 Mar 2019 22:55:31 -0400
From:   Paul Thomas <pthomas8589@...il.com>
To:     Harini Katakam <harinik@...inx.com>
Cc:     "linuxptp-devel@...ts.sourceforge.net" 
        <linuxptp-devel@...ts.sourceforge.net>, netdev@...r.kernel.org
Subject: Re: [Linuxptp-devel] strangeness

Hi All,

Let me do a quick clean recap of this issue.

On a Debian arm64 system with a 5.0rc8 kernel using the macb driver on
zynqmp, enabling tx timestamping (1) breaks networking! The first and
most noticeable way is that you can no longer connect with ssh. This
is a serious bug somewhere and merits some attention.

Trying to debug ssh is a possibility, but I was trying to debug with
something easier and thus the netcat testing. The specific issue can
be seen in the following strace. In this setup nc just connects to a
server and tries to send two packets (2). The first packet goes
through fine, but the second doesn't because nc is stuck forever
trying to read from the socket.
pselect6(4, [0 3], NULL, NULL, NULL, NULL) = 1 (in [0]) <-- waiting on
stdin and UDP sock
read(0, "c1\n", 8192) = 3 <-- read three chars from stdin
write(3, "c1\n", 3) = 3 <-- write those out on the UDP sock
pselect6(4, [0 3], NULL, NULL, NULL, NULL) = 1 (in [3])  <-- waiting
on stdin and UDP sock
read(3, <-- waits forever here as there is no data to read

I've been reading more, an old patch and the timestamping.txt doc
helped me understand a little more of what's going on:
https://lore.kernel.org/netdev/20130328211925.7644.15781.stgit@jekeller-hub.jf.intel.com/
https://www.kernel.org/doc/Documentation/networking/timestamping.txt

So it is clear that if the SO_SELECT_ERR_QUEUE flag is set then in
fact the select should return, but it is not set in this case. I can
see everything that is going on in datagram_poll() in datagram.c. The
main difference being that in the broken case the mask is 0x30c and in
the working case it is 0x304. The difference is EPOLLERR, which is
there clearly in the code if !skb_queue_empty(&sk->sk_error_queue).

Then in select.c POLLIN_SET includes EPOLLERR. It almost looks as if
it's behaving as it should (except that things break). My first
question is should the sk_error_queue be empty if there is a tx
timestamp available (in datagram_poll() in datagram.c)? If it's not
empty I don't see what else SO_SELECT_ERR_QUEUE flag is doing for the
select() and I don't see what would be different about the macb/arm64
setup?

Any insight here would be very much appreciated.

thanks,
Paul

(1) hwstamp_ctl -i eth0 -t 1

(2) The actual script to be able to run nc and strace from a single
serial console is slightly clever:
(sleep 3; echo "c1"; sleep 1; echo "c2") | nc -u 10.1.155.100 9999 &
strace -p $(ps -A | grep nc | awk '{print $1}')