netdev - Re: Linux ECN Handling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CADVnQymmMeo=cemUPQCzHpNvFPKCaqY1+VPchmui+HWuCL7CuA@mail.gmail.com>
Date:   Mon, 23 Oct 2017 21:11:38 -0400
From:   Neal Cardwell <ncardwell@...gle.com>
To:     Steve Ibanez <sibanez@...nford.edu>
Cc:     Netdev <netdev@...r.kernel.org>, Florian Westphal <fw@...len.de>,
        Mohammad Alizadeh <alizadeh@...il.mit.edu>
Subject: Re: Linux ECN Handling

On Mon, Oct 23, 2017 at 6:15 PM, Steve Ibanez <sibanez@...nford.edu> wrote:
> Hi All,
>
> I upgraded the kernel on all of our machines to Linux
> 4.13.8-041308-lowlatency. However, I'm still observing the same
> behavior where the source enters a timeout when the CWND=1MSS and it
> receives ECN marks.
>
> Here are the measured flow rates:
> <https://drive.google.com/file/d/0B-bt9QS-C3ONT0VXMUt6WHhKREE/view?usp=sharing>
>
> Here are snapshots of the packet traces at the sources when they both
> enter a timeout at t=1.6sec:
>
> 10.0.0.1 timeout event:
> <https://drive.google.com/file/d/0B-bt9QS-C3ONcl9WRnRPazg2ems/view?usp=sharing>
>
> 10.0.0.3 timeout event:
> <https://drive.google.com/file/d/0B-bt9QS-C3ONeDlxRjNXa0VzWm8/view?usp=sharing>
>
> Both still essentially follow the same sequence of events that I
> mentioned earlier:
> (1) receives an ACK for byte XYZ with the ECN flag set
> (2) stops sending for RTO_min=300ms
> (3) sends a retransmission for byte XYZ
>
> The cwnd samples reported by tcp_probe still indicate that the sources
> are reacting to the ECN marks more than once per window. Here are the
> cwnd samples at the same timeout event mentioned above:
> <https://drive.google.com/file/d/0B-bt9QS-C3ONdEZQdktpaW5JUm8/view?usp=sharing>
>
> Let me know if there is anything else you think I should try.

Sounds like perhaps cwnd is being set to 0 somewhere in this DCTCP
scenario. Would you be able to add printk statements in
tcp_init_cwnd_reduction(), tcp_cwnd_reduction(), and
tcp_end_cwnd_reduction(), printing the IP:port, tp->snd_cwnd, and
tp->snd_ssthresh?

Based on the output you may be able to figure out where cwnd is being
set to zero. If not, could you please post the printk output and
tcpdump traces (.pcap, headers-only is fine) from your tests?

thanks,
neal