[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANe27jK63DU-7frhE79TipVXZe9U3uegTNUJay9TrD0R1W+hBw@mail.gmail.com>
Date: Sat, 13 Aug 2016 12:12:26 +0300
From: Jouni Malinen <jkmalinen@...il.com>
To: rcochran@...utronix.de
Cc: Thomas Gleixner <tglx@...utronix.de>,
Rik van Riel <riel@...hat.com>, Len Brown <lenb@...nel.org>,
Peter Zijlstra <peterz@...radead.org>,
Frederic Weisbecker <fweisbec@...il.com>,
LKML <linux-kernel@...r.kernel.org>,
George Spelvin <linux@...encehorizons.net>,
Josh Triplett <josh@...htriplett.org>,
Chris Mason <clm@...com>, Eric Dumazet <edumazet@...gle.com>,
rt@...utronix.de, Paul McKenney <paulmck@...ux.vnet.ibm.com>,
Ingo Molnar <mingo@...nel.org>,
Arjan van de Ven <arjan@...radead.org>, j <j@...fi>
Subject: Re: [PREEMPT-RT] [patch 4 14/22] timer: Switch to a non cascading wheel
On Thu, Aug 11, 2016 at 11:25 PM, <rcochran@...utronix.de> wrote:
> On Thu, Aug 11, 2016 at 06:21:26PM +0300, Jouni Malinen wrote:
>> The test code looked like this in python:
>>
>> addr = (url.hostname, url.port)
>> socks = {}
>> for i in range(20):
>> socks[i] = socket.socket(socket.AF_INET, socket.SOCK_STREAM,
>> socket.IPPROTO_TCP)
>> socks[i].connect(addr)
>
> You getting a timeout on TCP connect()? Isn't that timeout really
> long, like 75 seconds or something?
Yes, it looks like a TCP connect() timeout. I use a significantly
reduced timeout in the test scripts since they are run unattended and
are supposed to terminate in reasonable amount of time.. That said,
this specific test case should not really have used as short a timeout
as it did, i.e., just one second. Interestingly enough, increasing
that to just 1.1 seconds was enough to make the test case pass..
Looking at the time it takes to execute connect(), it is 1.02 - 1.08
seconds for the timing out (with timeout=1sec) case which is now at
i=14 while all the other 19 calls take 0.0 seconds..
If I increase that 20 to 50, I get more of such about 1.03 second
results at i=17, i=34, i=48..
Looking more at what exactly is happening at the TCP layer, this is
likely related to the server behavior since listen() backlog is set to
10 and if there are 10 parallel connections, the last one if
immediately closed before reading anything. Somehow this combination
with this kernel patch applied makes one of the connect() calls take
that surprisingly long 1.02 or so seconds.
Looking at a sniffer capture (*), the three-way TCP connection goes
through fine for the first 15 connect() calls, but the 15th one does
not get a response to SYN. This SYN is the frame 47 in the capture
file with srcport == 60802. There is no SYN,ACK for it. The about one
second unexpected time for connect() comes from this, i.e., the
connection is completed only after the client side does TCP
retransmission of the SYN (frame #77) a second later and the server
side replies with RST,ACK (frame #78).
2 0.039135 127.0.0.1 -> 127.0.0.1 TCP 74 60772 > 49152 [SYN]
Seq=0 Win=43690 Len=0 MSS=65495 SACK_PERM=1 TSval=4294937755 TSecr=0
WS=64
3 0.039146 127.0.0.1 -> 127.0.0.1 TCP 74 49152 > 60772 [SYN,
ACK] Seq=0 Ack=1 Win=43690 Len=0 MSS=65495 SACK_PERM=1
TSval=4294937755 TSecr=4294937755 WS=64
4 0.039156 127.0.0.1 -> 127.0.0.1 TCP 66 60772 > 49152 [ACK]
Seq=1 Ack=1 Win=43712 Len=0 TSval=4294937755 TSecr=4294937755
...
47 0.042559 127.0.0.1 -> 127.0.0.1 TCP 74 60802 > 49152 [SYN]
Seq=0 Win=43690 Len=0 MSS=65495 SACK_PERM=1 TSval=4294937756 TSecr=0
WS=64
77 1.119943 127.0.0.1 -> 127.0.0.1 TCP 74 [TCP
Retransmission] 60802 > 49152 [SYN] Seq=0 Win=43690 Len=0 MSS=65495
SACK_PERM=1 TSval=4294937864 TSecr=0 WS=64
78 1.119953 127.0.0.1 -> 127.0.0.1 TCP 54 49152 > 60802 [RST,
ACK] Seq=1 Ack=1 Win=0 Len=0
So it looks like the issue is in one of the SYN,ACK frames getting
completely lost..
(*) http://w1.fi/p/tcp-lo.pcap
> Can you provide a simple test case or explain in more detail how you
> run your test? I would like to reproduce the issue it here.
I'm not sure how to make a simple test case for this taken into
account this seems to have some unknown timing dependencies.. A quick
loop of 20 TCP socket() + connect() calls with a server side that does
listen() with backlog 10 and non-blocking operations with 10th and
following incoming sockets getting close() immediately is what the
test case ends up doing, but whether a simple program doing that
without all of python and wpa_supplicant processing giving suitable
timing is unclear..
These files describe the test setup that I'm using to run this:
http://w1.fi/cgit/hostap/plain/tests/hwsim/README
http://w1.fi/cgit/hostap/plain/tests/hwsim/vm/README
with the actual test case being executed with
tests/hwsim/vm$ ./vm-run.sh ap_wps_er_http_proto
- Jouni
Powered by blists - more mailing lists