netdev - Re: Linux ECN Handling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CACJspmKjAr+q9cFVssXVxWQMCUWe3TNYO77m0nQwzQK4hTCOzA@mail.gmail.com>
Date:   Mon, 23 Oct 2017 15:15:36 -0700
From:   Steve Ibanez <sibanez@...nford.edu>
To:     netdev@...r.kernel.org
Cc:     Florian Westphal <fw@...len.de>,
        Mohammad Alizadeh <alizadeh@...il.mit.edu>
Subject: Re: Linux ECN Handling

Hi All,

I upgraded the kernel on all of our machines to Linux
4.13.8-041308-lowlatency. However, I'm still observing the same
behavior where the source enters a timeout when the CWND=1MSS and it
receives ECN marks.

Here are the measured flow rates:
<https://drive.google.com/file/d/0B-bt9QS-C3ONT0VXMUt6WHhKREE/view?usp=sharing>

Here are snapshots of the packet traces at the sources when they both
enter a timeout at t=1.6sec:

10.0.0.1 timeout event:
<https://drive.google.com/file/d/0B-bt9QS-C3ONcl9WRnRPazg2ems/view?usp=sharing>

10.0.0.3 timeout event:
<https://drive.google.com/file/d/0B-bt9QS-C3ONeDlxRjNXa0VzWm8/view?usp=sharing>

Both still essentially follow the same sequence of events that I
mentioned earlier:
(1) receives an ACK for byte XYZ with the ECN flag set
(2) stops sending for RTO_min=300ms
(3) sends a retransmission for byte XYZ

The cwnd samples reported by tcp_probe still indicate that the sources
are reacting to the ECN marks more than once per window. Here are the
cwnd samples at the same timeout event mentioned above:
<https://drive.google.com/file/d/0B-bt9QS-C3ONdEZQdktpaW5JUm8/view?usp=sharing>

Let me know if there is anything else you think I should try.

Thanks,
-Steve

On Thu, Oct 19, 2017 at 5:43 AM, Florian Westphal <fw@...len.de> wrote:
>
> [ full-quoting due to Cc fixups, adding netdev ]
>
> Steve Ibanez <sibanez@...nford.edu> wrote:
> > Hi Florian, Neal, and Daniel,
> >
> > I hope this email finds you well. My name is Stephen Ibanez and I'm a PhD
> > Student at Stanford currently working on a project with Mohammad Alizadeh,
> > Nick McKeown, and Lavanya Jose. We have been doing some experiments using
> > the linux DCTCP implementation and are trying to understand some strange
> > behavior that we are encountering. I'm contacting you three because I have
> > seen your names on some of the source files and recent commits in the linux
> > source tree. Hopefully you can help us out or put us in contact with the
> > right people?
> >
> > Here are some details about our servers:
> >
> >    - Distribution: Ubuntu 14.04 LTS
> >    - Kernel release: 4.4.0-75-generic
>
> Can you re-test with a more recent kernel such as 4.13.8?
>
> > *The experiment:*
> >
> > We use iperf3 to generate two DCTCP flows from different servers to a
> > common server, as shown in the diagram below. We measure the sending rate
> > of each flow, record the tcp_probe output, as well as run tcpdump on the
> > source host interfaces.
> >
> > [image: Inline image 6]
> >
> > *The problem:*
> >
> > Our rate measurements look like the one shown below; the flows often enter
> > timeouts. In this case, both flows hit a timeout at t=0.3.
> > [image: Inline image 2]
> >
> > When looking at the sequence of packets seen at the source host interfaces
> > around this timeout event this is what we see:
> >
> > *10.0.0.1 timeout event:*
> > [image: Inline image 3]
> >
> > *10.0.0.3 timeout event:*
> > [image: Inline image 4]
> >
> > In both cases, the source:
> > (1) receives an ACK for byte XYZ with the ECN flag set
> > (2) stops sending anything for RTO_min=300ms
> > (3) sends a retransmission for byte XYZ
> >
> > I have verified that this behavior is consistent across multiple experiment
> > runs. Here are the CWND samples for the 10.0.0.1 flow provided by tcp_probe
> > at the time of the timeout event:
> >
> > [image: Inline image 5]
> >
> > From what I can tell, tcp_probe logs a sample whenever a packet is
> > received. If this is true, then that means when the source receives the
> > final ECN marked ACK just before the timeout the CWND=1 MSS.
> >
> > *The conclusion:*
> >
> > We believe that there may be an issue with how the linux kernel is handling
> > the ECN echoes. For DCTCP, if the CWND is 1 MSS and the end host is still
> > receiving ECN marks then the CWND should remain at 1 MSS and should *not*
> > enter a timeout. This is because the switch can perform ECN marking very
> > aggressively causing the source end host to receive many redundant ECN
> > echoes over a short period of time.
> >
> > Another potential issue is that from the CWND plot above it looks like the
> > end host may be reacting to congestion signals more than once per window,
> > which should not happen (section 5 of RF3168
> > <https://tools.ietf.org/html/rfc3168>). tcp_probe reports SRTT measurements
> > of about 400-500 us and in the plot above the CWND is reduced 6 times
> > within this amount of time.
> >
> > We have not yet tracked down the code path in the kernel code that is
> > causing the behavior described above. Perhaps this is something that you
> > can help us with? We would love to hear your thoughts on this matter and
> > are happy to try other experiments that you suggest.
> >
> > Here is a link
> > <https://drive.google.com/file/d/0Bw-GEX7h5ufiYmpCV2VpOGEtQWs/view?usp=sharing>
> > to
> > download the packet traces if you would like to take a look.
> > han-1_host.pcap is the trace from 10.0.0.1 and han-3_host.pcap is the trace
> > from 10.0.0.3.
> >
> > Looking forward to hearing from you!
> >
> > Best,
> > -Steve