netdev - Re: Linux ECN Handling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date:   Mon, 20 Nov 2017 21:58:24 -0800
From:   Steve Ibanez <sibanez@...nford.edu>
To:     Neal Cardwell <ncardwell@...gle.com>
Cc:     Daniel Borkmann <daniel@...earbox.net>,
        Netdev <netdev@...r.kernel.org>, Florian Westphal <fw@...len.de>,
        Mohammad Alizadeh <alizadeh@...il.mit.edu>,
        Lawrence Brakmo <brakmo@...com>,
        Yuchung Cheng <ycheng@...gle.com>,
        Eric Dumazet <edumazet@...gle.com>
Subject: Re: Linux ECN Handling

Hi Neal,

I tried your suggestion to disable tcp_tso_should_defer() and it does
indeed look like it is preventing the host from entering timeouts.
I'll have to do a bit more digging to try and find where the packets
are being dropped. I've verified that the bottleneck link queue is
capacity is at about the configured marking threshold when the timeout
occurs, so the drops may be happening at the NIC interfaces or perhaps
somewhere unexpected in the switch.

I wonder if you can explain why the TLP doesn't fire when in the CWR
state? It seems like that might be worth having for cases like this.

Btw, thank you very much for all the help! It is greatly appreciated :)

Best,
-Steve


On Mon, Nov 20, 2017 at 7:01 AM, Neal Cardwell <ncardwell@...gle.com> wrote:
> Going back to one of your Oct 19 trace snapshots (attached), AFAICT at the
> time of the timeout there is actually almost 64KBytes  (352553398 + 1448 -
> 352489686 = 65160) of unacknowledged data. So there really does seem to be a
> significant chunk of packets that were in-flight that were then declared
> lost.
>
> So here is a possibility: perhaps the combination of CWR+PRR plus
> tcp_tso_should_defer() means that PRR can make cwnd so gentle that
> tcp_tso_should_defer() thinks we should wait for another ACK to send, and
> that ACK doesn't come. Breaking it, down, the potential sequence would be:
>
> (1) tcp_write_xmit() does not send, because the CWR behavior, using PRR,
> does not leave enough cwnd for tcp_tso_should_defer() to think we should
> send (PRR was originally designed for recovery, which did not have TSO
> deferral)
>
> (2) TLP does not fire, because we are in state CWR, not Open
>
> (3) The only remaining option is an RTO, which fires.
>
> In other words, the possibility is that, at the time of the stall, the cwnd
> is reasonably high, but tcp_packets_in_flight() is also quite high, so
> either there is (a) literally no unused cwnd left ( tcp_packets_in_flight()
> == cwnd), or (b) some mechanism like tcp_tso_should_defer() is deciding that
> there is not enough available cwnd for it to make sense to chop off a
> fraction of a TSO skb to send now.
>
> One way to test that conjecture would be to disable tcp_tso_should_defer()
> by adding a:
>
>    goto send_now;
>
> at the top of tcp_tso_should_defer().
>
> If that doesn't prevent the freezes then I would recommend adding printks or
> other instrumentation to  tcp_write_xmit() to log:
>
> - time
> - ca_state
> - cwnd
> - ssthresh
> - tcp_packets_in_flight()
> - the reason for breaking out of the tcp_write_xmit() loop (tso deferral, no
> packets left, tcp_snd_wnd_test, tcp_nagle_test, etc)
>
> cheers,
> neal
>
>
>
> On Mon, Nov 20, 2017 at 2:31 AM, Steve Ibanez <sibanez@...nford.edu> wrote:
>>
>> Hi Folks,
>>
>> I wanted to check back in on this for another update and to solicit
>> some more suggestions. I did a bit more digging to try an isolate the
>> problem.
>>
>> As I explained earlier, the log generated by tcp_probe indicates that
>> the snd_cwnd is set to 1 just before the end host receives an ECN
>> marked ACK and unexpectedly enters a timeout (
>> https://drive.google.com/open?id=1iyt8PvBxQga2jpRpBJ8KdQw3Q_mPTzZF ).
>> I was trying to track down where this is happening, but the only place
>> I could find that might be setting the snd_cwnd to 1 is in the
>> tcp_enter_loss() function. I inserted a printk() call in this function
>> to see when it is being invoked and it looks like it is only called by
>> the tcp_retransmit_timer() function after the timer expires.
>>
>> I decided to try recording the snd_cwnd, ss-thresh, and icsk_ca_state
>> inside the tcp_fastretrans_alert() function whenever it processes an
>> ECN marked ACK (
>> https://drive.google.com/open?id=17GD77lb9lkCSu0_s9p40GZ5r4EU8B4VB )
>> This plot also shows when the tcp_retransmit_timer() and
>> tcp_enter_loss() functions are invoked (red and purple dots
>> respectively). And I see that the ACK state machine is always either
>> in the TCP_CA_Open or TCP_CA_CWR state whenever the
>> tcp_fastretrans_alert() function processes ECN marked ACKs (
>> https://drive.google.com/open?id=1xwuPxjgwriT9DSblFx2uILfQ95Fy-Eqq ).
>> So I'm not sure where the snd_cwnd is being set to 1 (or possibly 0 as
>> Neal suggested) just before entering a timeout. Any suggestions here?
>>
>> In order to do a bit of profiling of the tcp_dctcp code I added
>> support into tcp_probe for recording the dctcp alpha parameter. I see
>> that alpha oscillates around about 0.1 when the flow rates have
>> converged, it goes to zero when the other host enters a timeout, and I
>> don't see any unexpected behavior just before the timeout (
>> https://drive.google.com/open?id=1zPdyS57TrUYZIekbid9p1UNyraLYrdw7 ).
>>
>> So I haven't had much luck yet trying to track down where the problem
>> is. If you have any suggestions that would help me to focus my search
>> efforts, I would appreciate the comments.
>>
>> Thanks!
>> -Steve
>>
>>
>> On Mon, Nov 6, 2017 at 3:31 PM, Steve Ibanez <sibanez@...nford.edu> wrote:
>> > Hi Daniel,
>> >
>> > Apologies for the delay. I tried out Neal's suggestion to printk the
>> > cwnd and ss-thresh in the tcp_init_cwnd_reduction(),
>> > tcp_cwnd_reduction(), and tcp_end_cwnd_reduction() functions in
>> > tcp_input.c. From what I can tell, none of these functions are setting
>> > the cwnd to 0.
>> >
>> > Here is the kernel log with the cwnd and ss-thresh print statements:
>> > https://drive.google.com/open?id=1LEWIkz64NuZN3yuDpBOAXbUfJfiju55O
>> > And here is the corresponding packet trace at this end host:
>> > https://drive.google.com/open?id=1qf4cSW3wzsiwPngcYpZY-AoBspuqONLH
>> > (The kernel log buffer was not large enough to capture the full
>> > 3-second experiment, so there's only about a second of data for the
>> > log buffer and about 3 seconds for the packet trace.)
>> >
>> > Here are is a plot of the cwnd and ss-thresh from each of the three
>> > functions:
>> >     - tcp_init_cwnd_reduction:
>> > https://drive.google.com/open?id=1KOEXG2ISJQMi9c6KyPOQ6rpVUVsQwtWU
>> >     - tcp_cwnd_reduction:
>> > https://drive.google.com/open?id=1awoPWC3hi4CGZt7HyuI4aAaLG1LPLwJE
>> >     - tcp_end_cwnd_reduction:
>> > https://drive.google.com/open?id=1G7XUSnkX8tP7Z5XdY2O97OWj6jguQHO5
>> >
>> > Here is a plot of the measured flow rates:
>> > https://drive.google.com/open?id=1XwmGve10J4qa1nPE3LustK8NbvhZscac
>> >
>> > The kernel log and packet trace data was collected on the 10.0.0.3
>> > host. The cwnd and ss-thresh plots are from the final second or so of
>> > the experiment and they show two timeout events. In the first event,
>> > the 10.0.0.1 host times out allowing 10.0.0.3 to increase it's cwnd.
>> > And in the second event, the 10.0.0.3 host times out causing the cwnd
>> > to decrease from ~100 to about ~10. The cwnd samples from tcp_probe (
>> > https://drive.google.com/open?id=1QCuPspLqbGoA68MKTaAh7rx2wCv3Cr_e )
>> > indicate that the cwnd is 1 MSS just before the timeout event, but I
>> > don't see that in the data collected from the tcp_*_cwnd_reduction
>> > functions.
>> >
>> > Here is a diff of the changes that I applied to the tcp_input.c file:
>> > https://drive.google.com/open?id=1k5x3AkfTr3tJhohSIcmQp-3g2yTVNMWm
>> >
>> > Are there other places in the code that you would suggest I check for
>> > how the cwnd and ss-thresh are changing?
>> >
>> > Thanks,
>> > -Steve
>> >
>> >
>> > On Mon, Nov 6, 2017 at 6:08 AM, Daniel Borkmann <daniel@...earbox.net>
>> > wrote:
>> >> On 10/24/2017 03:11 AM, Neal Cardwell wrote:
>> >>>
>> >>> On Mon, Oct 23, 2017 at 6:15 PM, Steve Ibanez <sibanez@...nford.edu>
>> >>> wrote:
>> >>>>
>> >>>> Hi All,
>> >>>>
>> >>>> I upgraded the kernel on all of our machines to Linux
>> >>>> 4.13.8-041308-lowlatency. However, I'm still observing the same
>> >>>> behavior where the source enters a timeout when the CWND=1MSS and it
>> >>>> receives ECN marks.
>> >>>>
>> >>>> Here are the measured flow rates:
>> >>>>
>> >>>>
>> >>>> <https://drive.google.com/file/d/0B-bt9QS-C3ONT0VXMUt6WHhKREE/view?usp=sharing>
>> >>>>
>> >>>> Here are snapshots of the packet traces at the sources when they both
>> >>>> enter a timeout at t=1.6sec:
>> >>>>
>> >>>> 10.0.0.1 timeout event:
>> >>>>
>> >>>>
>> >>>> <https://drive.google.com/file/d/0B-bt9QS-C3ONcl9WRnRPazg2ems/view?usp=sharing>
>> >>>>
>> >>>> 10.0.0.3 timeout event:
>> >>>>
>> >>>>
>> >>>> <https://drive.google.com/file/d/0B-bt9QS-C3ONeDlxRjNXa0VzWm8/view?usp=sharing>
>> >>>>
>> >>>> Both still essentially follow the same sequence of events that I
>> >>>> mentioned earlier:
>> >>>> (1) receives an ACK for byte XYZ with the ECN flag set
>> >>>> (2) stops sending for RTO_min=300ms
>> >>>> (3) sends a retransmission for byte XYZ
>> >>>>
>> >>>> The cwnd samples reported by tcp_probe still indicate that the
>> >>>> sources
>> >>>> are reacting to the ECN marks more than once per window. Here are the
>> >>>> cwnd samples at the same timeout event mentioned above:
>> >>>>
>> >>>>
>> >>>> <https://drive.google.com/file/d/0B-bt9QS-C3ONdEZQdktpaW5JUm8/view?usp=sharing>
>> >>>>
>> >>>> Let me know if there is anything else you think I should try.
>> >>>
>> >>>
>> >>> Sounds like perhaps cwnd is being set to 0 somewhere in this DCTCP
>> >>> scenario. Would you be able to add printk statements in
>> >>> tcp_init_cwnd_reduction(), tcp_cwnd_reduction(), and
>> >>> tcp_end_cwnd_reduction(), printing the IP:port, tp->snd_cwnd, and
>> >>> tp->snd_ssthresh?
>> >>>
>> >>> Based on the output you may be able to figure out where cwnd is being
>> >>> set to zero. If not, could you please post the printk output and
>> >>> tcpdump traces (.pcap, headers-only is fine) from your tests?
>> >>
>> >>
>> >> Hi Steve, do you have any updates on your debugging?
>> >>
>> >>> thanks,
>> >>> neal
>> >>>
>> >>
>>
>