netdev - Re: [PATCH] net: implement emergency route cache rebulds when gc

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Mon, 6 Oct 2008 16:54:22 -0400
From:	Neil Horman <nhorman@...driver.com>
To:	Eric Dumazet <dada1@...mosbay.com>
Cc:	Bill Fink <billfink@...dspring.com>,
	David Miller <davem@...emloft.net>, netdev@...r.kernel.org,
	kuznet@....inr.ac.ru, pekkas@...core.fi, jmorris@...ei.org,
	yoshfuji@...ux-ipv6.org, kaber@...sh.net,
	Evgeniy Polyakov <johnpol@....mipt.ru>
Subject: Re: [PATCH] net: implement emergency route cache rebulds when
	gc_elasticity is exceeded

On Mon, Oct 06, 2008 at 12:49:33PM +0200, Eric Dumazet wrote:
> Neil Horman a écrit :
>> Hey all-
>> 	So, I've been doing some testing here with this patch, and am
>> comfortable that the sd estimation is working reasonably well. For a hash table
>> with an average chain length of 1, it computes the stadard deviation to be 2,
>> which gives us a max chain length of 9 (4*sd + avg), and it manages to do that
>> in about 7 jiffies over about 524000 hash buckets.  I'm reasonably pleased with
>> that speed I think, and after thinking about it, I like this implementation
>> somewhat better, as it doesn't create a window in which chains can be
>> artifically overrun (until the nect gc round) (although I'm happy to hear
>> arguments against my implementation).  Anywho here it is, comments and thoughts
>> welcome!
>>
>> Thanks & Regards
>> Neil
>>
>> Signed-off-by: Neil Horman <nhorman@...driver.com>
>>
>>
>>  route.c |  121 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
>>  1 file changed, 118 insertions(+), 3 deletions(-)
>>
>>
>> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
>> index 6ee5354..4f8c5b5 100644
>> --- a/net/ipv4/route.c
>> +++ b/net/ipv4/route.c
>> @@ -145,6 +145,7 @@ static struct dst_entry *ipv4_negative_advice(struct dst_entry *dst);
>>  static void		 ipv4_link_failure(struct sk_buff *skb);
>>  static void		 ip_rt_update_pmtu(struct dst_entry *dst, u32 mtu);
>>  static int rt_garbage_collect(struct dst_ops *ops);
>> +static void rt_emergency_hash_rebuild(struct net *net);
>>    static struct dst_ops ipv4_dst_ops = {
>> @@ -200,7 +201,14 @@ const __u8 ip_tos2prio[16] = {
>>   struct rt_hash_bucket {
>>  	struct rtable	*chain;
>> +	atomic_t	chain_length;
>>  };
>> +
>> +atomic_t rt_hash_total_count;
>> +atomic_t rt_hash_nz_count;
>> +
>> +static int rt_chain_length_max;
>> +
>>  #if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) || \
>>  	defined(CONFIG_PROVE_LOCKING)
>>  /*
>> @@ -601,6 +609,68 @@ static inline int ip_rt_proc_init(void)
>>  }
>>  #endif /* CONFIG_PROC_FS */
>>  +static void rt_hash_sd_update(void)
>> +{
>> +	int temp, i;
>> +	unsigned long long sd;
>> +	int average = atomic_read(&rt_hash_total_count);
>> +	int nzcount = atomic_read(&rt_hash_nz_count);
>> +
>> +	/*
>> + 	 * Don't divide by zero
>> + 	 */
>> +	if (!nzcount)
>> +		return;
>> +
>> +	average = DIV_ROUND_UP(average, nzcount);
>> +
>> +	sd = 0;
>> +	for (i = 0; i < (1 << rt_hash_log); i++) {
>> +		temp = atomic_read(&rt_hash_table[i].chain_length);
>> +		/*
>> + 		 * Don't count zero entries, as most of the table
>> + 		 * will likely be empty.  We don't want to unfairly
>> + 		 * bias our average chain length down so far
>> + 		 */
>
> Empty chains should be accounted for, or average and standard
> deviation are not correct.
>
>> +		if (unlikely(temp))
>> +			sd += (temp-average)^2;
>
> Out of curiosity, what do you expect to do here ?
>
> (temp-average) XOR 2
> or (temp-average) * (temp-average) 
>
> Also, your computations use integer arithmetic.
>
> If avg = 2.5 and sd = 1.9, avg+4*sd you'll find 6 instead of 10 
>
> Anyway, we wont add so many atomic operations and double
> hash table size in order to be able to compute sd.
>
> If we really want to be smart, we can have a pretty good
> estimate of average and sd for free in rt_check_expire()
>
> Something like this untested patch. (We should make sure
> we dont overflow sum2 for example)
>

> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index 6ee5354..85182d9 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -125,6 +125,7 @@ static int ip_rt_redirect_silence __read_mostly	= ((HZ / 50) << (9 + 1));
>  static int ip_rt_error_cost __read_mostly	= HZ;
>  static int ip_rt_error_burst __read_mostly	= 5 * HZ;
>  static int ip_rt_gc_elasticity __read_mostly	= 8;
> +static int rt_chain_length_max __read_mostly    = 32;
>  static int ip_rt_mtu_expires __read_mostly	= 10 * 60 * HZ;
>  static int ip_rt_min_pmtu __read_mostly		= 512 + 20 + 20;
>  static int ip_rt_min_advmss __read_mostly	= 256;
> @@ -748,11 +749,24 @@ static void rt_do_flush(int process_context)
>  	}
>  }
>  
> +/*
> + * While freeing expired entries, we compute average chain length
> + * and standard deviation, using fixed-point arithmetic.
> + * This to have an estimation of rt_chain_length_max
> + *  rt_chain_length_max = max(elasticity, AVG + 4*SD)
> + * We use 3 bits for frational part, and 29 (or 61) for magnitude.
> + */
> +
> +#define FRACT_BITS 3
> +#define ONE (1UL << FRACT_BITS)
> +
>  static void rt_check_expire(void)
>  {
>  	static unsigned int rover;
>  	unsigned int i = rover, goal;
>  	struct rtable *rth, **rthp;
> +	unsigned long sum = 0, sum2 = 0;
> +	unsigned long length, samples = 0;
>  	u64 mult;
>  
>  	mult = ((u64)ip_rt_gc_interval) << rt_hash_log;
> @@ -770,8 +784,10 @@ static void rt_check_expire(void)
>  		if (need_resched())
>  			cond_resched();
>  
> +		samples++;
>  		if (*rthp == NULL)
>  			continue;
> +		length = 0;
>  		spin_lock_bh(rt_hash_lock_addr(i));
>  		while ((rth = *rthp) != NULL) {
>  			if (rt_is_expired(rth)) {
> @@ -784,11 +800,13 @@ static void rt_check_expire(void)
>  				if (time_before_eq(jiffies, rth->u.dst.expires)) {
>  					tmo >>= 1;
>  					rthp = &rth->u.dst.rt_next;
> +					length += ONE;
>  					continue;
>  				}
>  			} else if (!rt_may_expire(rth, tmo, ip_rt_gc_timeout)) {
>  				tmo >>= 1;
>  				rthp = &rth->u.dst.rt_next;
> +				length += ONE;
>  				continue;
>  			}
>  
> @@ -797,6 +815,15 @@ static void rt_check_expire(void)
>  			rt_free(rth);
>  		}
>  		spin_unlock_bh(rt_hash_lock_addr(i));
> +		sum += length;
> +		sum2 += length*length;
> +	}
> +	if (samples) {
> +		unsigned long avg = sum / samples;
> +		unsigned long sd = int_sqrt(sum2 / samples - avg*avg);
> +		rt_chain_length_max = max_t(unsigned long,
> +					    ip_rt_gc_elasticity,
> +					    (avg + 4*sd) >> FRACT_BITS);

So, I've been playing with this patch, and I've not figured out eactly whats
bothering me yet, since the math seems right, but something doesn't seem right
about the outcome of this algorithm.  I've tested with my local system, and all
works well, because the route cache is well behaved, and the sd value always
works out to be very small, so ip_rt_gc_elasticity is used.  So I've been
working through some scenarios by hand to see what this looks like using larger
numbers.  If i assume ip_rt_gc_interval is 60, and rt_hash_log is 17, my sample
count here is 7864320 samples per run.  If within that sample 393216 (about 4%)
of the buckets have one entry on the chain, and all the rest are zeros, my hand
calculations result in a standard deviation of approximately 140 and an average
of .4.  That imples that in that sample set any one chain could be almost 500
entires long before it triggered a cache rebuld.  Does that seem reasonable?

Best
Neil



-- 
/****************************************************
 * Neil Horman <nhorman@...driver.com>
 * Software Engineer, Red Hat
 ****************************************************/
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html