[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <1480477952.3702850.803295033.367FD66D@webmail.messagingengine.com>
Date: Wed, 30 Nov 2016 04:52:32 +0100
From: Hannes Frederic Sowa <hannes@...essinduktion.org>
To: David Lebrun <david.lebrun@...ouvain.be>, netdev@...r.kernel.org
Subject: Re: [RFC PATCH net-next v2] ipv6: implement consistent hashing for
equal-cost multipath routing
Hi,
On Tue, Nov 29, 2016, at 18:15, David Lebrun wrote:
> When multiple nexthops are available for a given route, the routing
> engine
> chooses a nexthop by computing the flow hash through get_hash_from_flowi6
> and by taking that value modulo the number of nexthops. The resulting
> value
> indexes the nexthop to select. This method causes issues when a new
> nexthop
> is added or one is removed (e.g. link failure). In that case, the number
> of nexthops changes and potentially all the flows get re-routed to
> another
> nexthop.
>
> This patch implements a consistent hash method to select the nexthop in
> case of ECMP. The idea is to generate K slices (or intervals) for each
> route with multiple nexthops. The nexthops are randomly assigned to those
> slices, in a uniform manner. The number K is configurable through a
> sysctl
> net.ipv6.route.ecmp_slices and is always an exponent of 2. To select the
> nexthop, the algorithm takes the flow hash and computes an index which is
> the flow hash modulo K. As K = 2^x, the modulo can be computed using a
> simple binary AND operation (idx = hash & (K - 1)). The resulting index
> references the selected nexthop. The lookup time complexity is thus O(1).
>
> When a nexthop is added, it steals K/N slices from the other nexthops,
> where N is the new number of nexthops. The slices are stolen randomly and
> uniformly from the other nexthops. When a nexthop is removed, the orphan
> slices are randomly reassigned to the other nexthops.
>
> The number of slices for a route also fixes the maximum number of
> nexthops
> possible for that route.
In the worst case this causes 2GB (order 19) allocations (x == 31) to
happen in GFP_ATOMIC (due to write lock) context and could cause update
failures to the routing table due to fragmentation. Are you sure the
upper limit of 31 is reasonable? I would very much prefer an upper limit
of below or equal 25 for x to stay within the bounds of the slab
allocators (which is still a lot and probably causes errors!).
Unfortunately because of the nature of the sysctl you can't really
create its own cache for it. :/
Also by design, one day this should all be RCU and having that much data
outstanding worries me a bit during routing table mutation.
I am a fan of consistent hashing but I am not so sure if it belongs into
a generic ECMP implementation or into its own ipvs or netfilter module
where you specifically know how much memory to burn for it.
Also please convert the sysctl to a netlink attribute if you pursue this
because if I change the sysctl while my quagga is hammering the routing
table I would like to know which nodes allocate what amount of memory.
Bye,
Hannes
Powered by blists - more mailing lists