netdev - Re: [PATCH net-next 0/4] net: allow setting congctl via routing table

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:	Fri, 05 Dec 2014 22:03:33 +0100
From:	Hannes Frederic Sowa <hannes@...essinduktion.org>
To:	Dave Taht <dave.taht@...il.com>
Cc:	Daniel Borkmann <dborkman@...hat.com>,
	"davem@...emloft.net" <davem@...emloft.net>,
	Florian Westphal <fw@...len.de>,
	"netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Re: [PATCH net-next 0/4] net: allow setting congctl via routing
 table

Hi Dave,

On Fr, 2014-12-05 at 11:05 -0800, Dave Taht wrote:
> On Fri, Dec 5, 2014 at 10:35 AM, Hannes Frederic Sowa
> <hannes@...essinduktion.org> wrote:
> > On Fr, 2014-12-05 at 08:35 -0800, Dave Taht wrote:
> >> On Fri, Dec 5, 2014 at 7:24 AM, Daniel Borkmann <dborkman@...hat.com> wrote:
> >> > This is the second part of our work and allows for setting the congestion
> >> > control algorithm via routing table. For details, please see individual
> >> > patches.
> >> >
> >> > Joint work with Florian Westphal, suggested by Hannes Frederic Sowa.
> >> >
> >> > Thanks!
> >> >
> >> > Daniel Borkmann (4):
> >> >   net: tcp: refactor reinitialization of congestion control
> >> >   net: tcp: add key management to congestion control
> >> >   net: tcp: add RTAX_CC_ALGO fib handling
> >> >   net: tcp: add per route congestion control
> >>
> >>
> >> Very interesting. Have you tried something other than dctcp here
> >> (e.g. westwood or lp?)
> >>
> >> Have you considered the case where the route changes underneath
> >> you from one device to another?
> >
> > Notice, there is no way the state of a tcp congestion control algorithm
> > can be converted to be used by a different one, so this would only
> > affect new tcp connections via this interface.
> 
> You are missing the point. If the route changes from a path that
> is DCTCP capable to one that is not, (say you fail over to a backup link)

I don't think that today's datacenter are designed that the backup path
has less performance than the primary link (different AQM settings). It
is much more important e.g. to allow the connections to a e.g. database
server selecting dctcp as CC and having all connections going to the
internet using some "ordinary" tcp congestion algorithm.

> and flows persist, bad things will happen. DCTCP, in particular, depends
> upon a very specific AQM configuration on all the hops in the path, without that
> it can be very aggressive.

That's for sure.

> I do think it is feasible to convert from at least some of the
> core state from one tcp congestion control algorithm to another.

Hmm, I haven't looked if that is possible. It might be.

> >> Example, here I am routing everything through eth0, where I
> >> would want cubic, probably...
> >>
> >> root@...esha:~/git/tinc# ip route
> >> default via 172.26.16.1 dev eth0  proto babel onlink
> >> 69.181.216.0/22 via 172.26.16.1 dev eth0  proto babel onlink
> >> 169.254.0.0/16 dev eth0  scope link  metric 1000
> >> 172.26.16.0/24 dev eth0  proto kernel  scope link  src 172.26.16.177
> >> 172.26.16.1 via 172.26.16.1 dev eth0  proto babel onlink
> >> 172.26.16.112 via 172.26.16.112 dev eth0  proto babel onlink
> >> 172.26.17.0/24 via 172.26.16.1 dev eth0  proto babel onlink
> >> 172.26.17.3 via 172.26.16.1 dev eth0  proto babel onlink
> >> 172.26.17.227 via 172.26.16.1 dev eth0  proto babel onlink
> >> 192.168.7.0/30 dev eth1  proto kernel  scope link  src 192.168.7.1  metric 1
> >> 192.168.7.2 via 172.26.16.112 dev eth0  proto babel onlink
> >>
> >> And I pull the plug, and everything flips over to wlan0,
> >> where I might want westwood (or something saner than
> >> that. It might be nice to have a per-device cc default
> >> algorithm...)
> >
> > Something like that might be possible with metrics and "via ... dev if0
> > metric xxx" routes, which will be cleaned up as soon as the interface
> > goes down and the fallback will be to a route with a different
> > congestion algorithm.
> 
> mmm... I do dynamic routing via various routing protocols, which
> generally don't bother with inserting more than one metric.

I totally understand, they might even remove the routes and re-add them,
thus losing the tcp cc property.

> While we are thinking through this, what happens with tunnels?

Tunnels should behave just like ordinary interfaces, but depending how
they get routed it might make problems regarding DCTCP.

> This route in my network switches between interfaces and routes
> depending on which is best.
> 
> fde5:dfb9:df90:fff0::/64 dev vpn6  proto kernel  metric 256
> fde5:dfb9:df90:fff0::/60 via fde5:dfb9:df90:fff0::1 dev vpn6  metric 1024
> 
> 
> >> root@...esha:~/git/tinc# ip route
> >> default via 172.26.17.224 dev wlan0  proto babel onlink
> >> 69.181.216.0/22 via 172.26.17.224 dev wlan0  proto babel onlink
> >> 169.254.0.0/16 dev eth0  scope link  metric 1000
> >> 172.26.16.0/24 dev eth0  proto kernel  scope link  src 172.26.16.177
> >> 172.26.16.1 via 172.26.17.227 dev wlan0  proto babel onlink
> >> 172.26.16.112 via 172.26.17.227 dev wlan0  proto babel onlink
> >> 172.26.17.0/24 via 172.26.17.224 dev wlan0  proto babel onlink
> >> 172.26.17.3 via 172.26.17.227 dev wlan0  proto babel onlink
> >> 172.26.17.227 via 172.26.17.227 dev wlan0  proto babel onlink
> >> 192.168.7.0/30 dev eth1  proto kernel  scope link  src 192.168.7.1  metric 1
> >> 192.168.7.2 via 172.26.17.227 dev wlan0  proto babel onlink

Please note, that is is an end-node only feature. Normally, routers
don't do heavy tcp processing, thus using this feature on a router
wasn't considered by us. That's the same problematic like e.g.
tcp_quick_ack.

As soon as you have control over the application and it allows you to
bind to an interface via SO_BINDTODEVICE, you are able to select the
congestion control algorithm by using ip rule oif matching. But the
application could also chose the CC also by itself by using
'TCP_CONGESTION' setsockopt on a per-socket basis if you have source
access.

Bye,
Hannes


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html