linux-kernel - Re: [RFC 2/4] net/fib: Provide fib_balance

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:   Thu, 4 Apr 2019 19:31:50 +0100
From:   Dmitry Safonov <dima@...sta.com>
To:     Alexander Duyck <alexander.h.duyck@...ux.intel.com>,
        linux-kernel@...r.kernel.org
Cc:     Alexey Kuznetsov <kuznet@....inr.ac.ru>,
        David Ahern <dsahern@...il.com>,
        "David S. Miller" <davem@...emloft.net>,
        Eric Dumazet <edumazet@...gle.com>,
        Hideaki YOSHIFUJI <yoshfuji@...ux-ipv6.org>,
        Ido Schimmel <idosch@...lanox.com>, netdev@...r.kernel.org,
        linux-doc@...r.kernel.org, Jonathan Corbet <corbet@....net>
Subject: Re: [RFC 2/4] net/fib: Provide fib_balance_budget sysctl

On 4/1/19 7:09 PM, Alexander Duyck wrote:
> On Tue, 2019-03-26 at 15:30 +0000, Dmitry Safonov wrote:
>> Unfortunately, MAX_WORK at this moment is broken: during
>> trie_rebalance() climbing upwards resize() is being called for each
>> tnode. Which makes the limit useless the bigger trie gets: at each level
>> there are 10 attempts to balance tnode and childs, resulting in
>> O(10^n + 10^{n-1} + ... 10^{n-k}) complexity to balance trie, where k - is
>> the level where we changed the trie originally (adding/removing
>> alias/route).
>>
>> Which results in the following price of removing one route under big
>> trie (similar for some other single routes that results in reallocating
>> tnodes by hitting the threshold limits for halve/inflate resize):
> 
> The info below doesn't really tell us the price. It might be useful if
> you could just time the removal of that one route and provide that
> information.

Yes, well, that was an attempt to point that MAX_WORK in the current
state doesn't limit much complexity to balance trie.

So, most of the time (up to a couple of seconds) is spent on
synchronise_rcu() (which was addressed by 4/4 and a patch from David
Ahern in -next).

But I thought fixing max_work is also worth attention, regardless the
fix for synchronising rcu. And playing with the sysctl limit that is
provided by the patch, I've found out that even with 4/4 applied one can
tune this sysctl knob the way that results in dividing the work between
consecutive route manipulations. So that on a 4-core switch the time
spent to add a route in maximum was more than a second and adjusting
work limit can bring it down to 300msec in max. Though, the total time
spent to add the whole route table was the same.

I will provide more details for v2.

> 
>> Before:
>>    Basic info: size of leaf: 40 bytes, size of tnode: 40 bytes.
>>    Main:
>>            Aver depth:     1.99
>>            Max depth:      2
>>            Leaves:         77825
>>            Prefixes:       77825
>>            Internal nodes: 77
>>              9: 1  10: 76
>>            Pointers: 78336
>>    Null ptrs: 435
>>    Total size: 7912  kB
>>    Local:
>>    [omitted]
>>
>> After:
>>    Basic info: size of leaf: 40 bytes, size of tnode: 40 bytes.
>>    Main:
>>            Aver depth:     3.00
>>            Max depth:      3
>>            Leaves:         77824
>>            Prefixes:       77824
>>            Internal nodes: 20491
>>              1: 2048  2: 18432  6: 1  11: 10
>>            Pointers: 98368
>>    Null ptrs: 54
>>    Total size: 8865  kB
>>    Local:
>>    [omitted]
>>
> 
> Is this before/after the patch, or just before/after the removal of the
> one route? I assume you are are just talking about the one route since
> the number of prefixes has dropped by 1.

Yes, just for removing one route. Should have mentioned that explicitly,
sorry.

> 
>> Provide a sysctl to control amount of pending balancing work.
>> (by default unlimited as it was)
>>
>> Cc: linux-doc@...r.kernel.org
>> Cc: Jonathan Corbet <corbet@....net>
>> Fixes: ff181ed8768f ("fib_trie: Push assignment of child to parent down
>> into inflate/halve")
>> Signed-off-by: Dmitry Safonov <dima@...sta.com>
>> ---
>>  Documentation/networking/ip-sysctl.txt |  6 +++
>>  include/net/ip.h                       |  1 +
>>  net/ipv4/fib_trie.c                    | 60 +++++++++++++++-----------
>>  net/ipv4/sysctl_net_ipv4.c             |  7 +++
>>  4 files changed, 49 insertions(+), 25 deletions(-)
>>
>> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
>> index acdfb5d2bcaa..fb71dacff4dd 100644
>> --- a/Documentation/networking/ip-sysctl.txt
>> +++ b/Documentation/networking/ip-sysctl.txt
>> @@ -81,6 +81,12 @@ fib_multipath_hash_policy - INTEGER
>>  	0 - Layer 3
>>  	1 - Layer 4
>>  
>> +fib_balance_budget - UNSIGNED INTEGER
>> +	Limits the number of resize attempts during balancing fib trie
>> +	on adding/removing new routes.
>> +	Possible values:
>> +	Default: UINT_MAX (0xFFFFFFFF)
>> +
> 
> So I am not really a fan of the use of UNSIGNED_INTEGER here. I would
> much rather see this be a signed integer and the test be for the value
> going negative instead of having to test in so many places if the value
> is 0.

Ok, sounds good to me - will change the type for v2.

[..]
>> @@ -874,22 +877,25 @@ static struct key_vector *resize(struct trie *t, struct key_vector *tn)
>>  			break;
>>  		}
>>  
>> -		max_work--;
>> +		(*budget)--;
> 
> One thing we might explore here would be to look at pulling out the
> check of budget at the start of the loop and instead combine the
> decrement with the check for < 0 here so we have something like:
> 		if (--*budget < 0)
> 			break;
> 
> That way what should happen is that we will achieve maximum depth in
> any resize and make at least one pass optimizing from the bottom up
> instead of the top down.

Yeah, sounds reasonable - will do this way and retest.

> 
>> +		inflated = true;
>>  		tn = get_child(tp, cindex);
>>  	}
>>  
>>  	/* update parent in case inflate failed */
>>  	tp = node_parent(tn);
>>  
>> -	/* Return if at least one inflate is run */
>> -	if (max_work != MAX_WORK)
>> +	/* Return if at least one inflate is run:
>> +	 * microoptimization to not recalculate thresholds
>> +	 */
>> +	if (inflated)
>>  		return tp;
> 
> An alternative to having to add another variable would be to look at
> switching the loop to a combination of an if and a do/while like so:
> 	if (should_inflate(tp, tn)) {
> 		do {
> 			...
> 		} while (should_inflate(tp, tn));
> 	
> 		/* update parent in case inflate failed */
> 		return node_parent(tn);
> 	}

That's neat! Will drop the variable.

> 		
> 
>>  	/* Halve as long as the number of empty children in this
>>  	 * node is above threshold.
>>  	 */
>> -	while (should_halve(tp, tn) && max_work) {
>> -		tp = halve(t, tn);
>> +	while (should_halve(tp, tn) && *budget) {
>> +		tp = halve(t, tn, budget);
>>  		if (!tp) {
>>  #ifdef CONFIG_IP_FIB_TRIE_STATS
>>  			this_cpu_inc(stats->resize_node_skipped);
>> @@ -897,7 +903,7 @@ static struct key_vector *resize(struct trie *t, struct key_vector *tn)
>>  			break;
>>  		}
>>  
>> -		max_work--;
>> +		(*budget)--;
> 
> Same here. It might make sense to pull the budget check out of the
> start of the loop and move it here so that it becomes a depth first
> optimization instead of being a shallow one.

Makes sense, will rework.

> 
>>  		tn = get_child(tp, cindex);
>>  	}
>>  
>> @@ -1005,8 +1011,10 @@ static struct fib_alias *fib_find_alias(struct hlist_head *fah, u8 slen,
>>  
>>  static void trie_rebalance(struct trie *t, struct key_vector *tn)
>>  {
>> -	while (!IS_TRIE(tn))
>> -		tn = resize(t, tn);
>> +	unsigned int budget = fib_balance_budget;
>> +
>> +	while (budget && !IS_TRIE(tn))
>> +		tn = resize(t, tn, &budget);
> 
> I would not have the budget check here. At a minimum we should be
> replacing trie nodes with leaves in the case that we are down to one
> leaf node in the trie.

Yeah, I see.

> 
>>  }
>>  
>>  static int fib_insert_node(struct trie *t, struct key_vector *tp,
>> @@ -1784,6 +1792,7 @@ struct fib_table *fib_trie_unmerge(struct fib_table *oldtb)
>>  void fib_table_flush_external(struct fib_table *tb)
>>  {
>>  	struct trie *t = (struct trie *)tb->tb_data;
>> +	unsigned int budget = fib_balance_budget;
>>  	struct key_vector *pn = t->kv;
>>  	unsigned long cindex = 1;
>>  	struct hlist_node *tmp;
>> @@ -1806,7 +1815,7 @@ void fib_table_flush_external(struct fib_table *tb)
>>  				update_suffix(pn);
>>  
>>  			/* resize completed node */
>> -			pn = resize(t, pn);
>> +			pn = resize(t, pn, &budget);
>>  			cindex = get_index(pkey, pn);
>>  
>>  			continue;
> 
> So we would probably want a completely different budget here, or at
> least we would want the budget to be multiplied by the leaves we are
> removing. The problem is we aren't pulling single addresses, but are
> literally splitting the tree into two separate tries. The same budget
> isn't going to work for both cases.

Which will multiply the work (and the time spent).
Hmm, well, I guess we can multiply here and relax rtnl_mutex pressure if
needed by introducing fib_lock. Does that sounds good?

> 
>> @@ -1853,6 +1862,7 @@ void fib_table_flush_external(struct fib_table *tb)
>>  int fib_table_flush(struct net *net, struct fib_table *tb, bool flush_all)
>>  {
>>  	struct trie *t = (struct trie *)tb->tb_data;
>> +	unsigned int budget = fib_balance_budget;
>>  	struct key_vector *pn = t->kv;
>>  	unsigned long cindex = 1;
>>  	struct hlist_node *tmp;
>> @@ -1876,7 +1886,7 @@ int fib_table_flush(struct net *net, struct fib_table *tb, bool flush_all)
>>  				update_suffix(pn);
>>  
>>  			/* resize completed node */
>> -			pn = resize(t, pn);
>> +			pn = resize(t, pn, &budget);
>>  			cindex = get_index(pkey, pn);
>>  
>>  			continue;
> 
> Similar issues here, though I don't know what the typical amount of
> change is we expect to see from a fib_table_flush call versus something
> like the fib_table_flush_external call.

Yes. Well, I was also curious if it's possible to avoid resize() call
under (flush_all == true)?
So, that exiting a namespace wouldn't result in rebalancing a possibly
huge routing trie.

Thanks for your notes and review,
          Dmitry