netdev - Re: Scalability of interface creation and deletion

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1304793174.3207.22.camel@edumazet-laptop>
Date:	Sat, 07 May 2011 20:32:54 +0200
From:	Eric Dumazet <eric.dumazet@...il.com>
To:	Alex Bligh <alex@...x.org.uk>
Cc:	netdev@...r.kernel.org
Subject: Re: Scalability of interface creation and deletion

Le samedi 07 mai 2011 à 19:24 +0100, Alex Bligh a écrit :
> Eric,
> 
> --On 7 May 2011 18:26:29 +0200 Eric Dumazet <eric.dumazet@...il.com> wrote:
> 
> > Here, on 2.6.38 kernel (Ubuntu 11.04 provided, on my 2 core laptop)
> ># time rmmod dummy
> > real	0m0.111s
> ...
> > On another machine with a very recent kernel :
> > $ modprobe dummy numdummies=1
> > $ ifconfig dummy0 192.168.46.46 up
> > $ time rmmod dummy
> >
> > real	0m0.032s
> 
> I know it's different machines, but that's a pretty significant
> difference. So I compiled from 2.6.39-rc6 head (i.e. a kernel
> less than an hour old), with only your suggested change in,
> so that (a) I could eliminate old kernels, and (b) I could
> instrument it.
> 
> > synchronize_rcu() calls are not consuming cpu, they just _wait_
> > rcu grace period.
> >
> > I suggest you read Documentation/RCU files if you really want to :)
> 
> I understand the basic point: it needs to wait for all readers
> to drop their references. It's sort of hard to understand why
> on a machine with an idle network there would be reader(s) holding
> references for 250ms. And indeed the analysis below shows that
> isn't the case (it's more like 44 ms).
> 
> > If you want to check how expensive it is, its quite easy:
> > add a trace in synchronize_net()
> 
> At least for veth devices, I see the same on 2.6.39-rc6 - if anything
> it's worse:
> 
> # ./ifseq -n 100
> Sat May 7 17:50:53 UTC 2011 creating 100 interfaces
> Sat May 7 17:50:54 UTC 2011 done
> 
> real	0m1.549s
> user	0m0.060s
> sys	0m0.990s
> Sat May 7 17:50:54 UTC 2011 deleting 100 interfaces
> Sat May 7 17:51:22 UTC 2011 done
> 
> real	0m27.917s
> user	0m0.420s
> sys	0m0.060s
> 
> Performing that operation produced exactly 200 calls to synchronize net.
> The timestamps indicate that's 2 per veth pair deletion, and zero
> per veth pair creation.
> 
> Analysing the resultant logs shows only 31% of the problem is
> time spent within synchronize_net() (perl script below).
> 
> $ ./analyse.pl < syncnet | tail -2
> Total 18.98515 Usage 199 Average 0.09540 elsewhere
> Total 8.77581 Usage 200 Average 0.04388 synchronizing
> 
> So *something* is spending more than twice as much time as
> synchronize_net().
> 
> I've attached the log below as well.
> 
> -- 
> Alex Bligh
> 
> 
> $ cat analyse.pl
> #!/usr/bin/perl
> 
> use strict;
> use warnings;
> 
> my $lastuptime;
> my $uptime;
> my $diff;
> my $area;
> my %time;
> my %usage;
> 
> while (<>)
> {
>     chomp;
>     if (m/\[\s*([0-9.]+)\].*synchronize_net/)
>     {
> 	$uptime = $1;
> 	if (defined($lastuptime))
> 	{
> 	    $area = (m/end/)?"synchronizing":"elsewhere";
> 	    $diff = $uptime - $lastuptime;
> 	    printf "%5.5f $area\n", $diff;
> 	    $time{$area}+=$diff;
> 	    $usage{$area}++;
> 	}
> 	$lastuptime = $uptime;
>     }
> }
> 
> print "\n";
> 
> my $k;
> foreach $k (sort keys %time)
> {
>     printf "Total %5.5f Usage %d Average %5.5f %s\n", $time{$k}, 
> $usage{$k}, $time{$k}/$usage{$k}, $k;
> }
> 
> 
> 
> May  7 17:50:55 nattytest kernel: [  127.490142] begin synchronize_net()
> May  7 17:50:55 nattytest kernel: [  127.560084] end synchronize_net()
> May  7 17:50:55 nattytest kernel: [  127.610350] begin synchronize_net()
> May  7 17:50:55 nattytest kernel: [  127.610932] end synchronize_net()
> May  7 17:50:55 nattytest kernel: [  127.740078] begin synchronize_net()
> May  7 17:50:55 nattytest kernel: [  127.820071] end synchronize_net()

Well, there is also one rcu_barrier() call that is expensive.
(It was changed from one synchronize_rcu() to one rcu_barrier() lately
in commit ef885afb , in 2.6.36 kernel)

net/core/dev.c line 5167


http://git2.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ef885afbf8a37689afc1d9d545e2f3e7a8276c17

netdev_wait_allrefs() waits that all references to a device vanishes.

It currently uses a _very_ pessimistic 250 ms delay between each probe.
Some users reported that no more than 4 devices can be dismantled per
second, this is a pretty serious problem for some setups.

Most of the time, a refcount is about to be released by an RCU callback,
that is still in flight because rollback_registered_many() uses a
synchronize_rcu() call instead of rcu_barrier(). Problem is visible if
number of online cpus is one, because synchronize_rcu() is then a no op.

time to remove 50 ipip tunnels on a UP machine :

before patch : real 11.910s
after patch : real 1.250s

Reported-by: Nicolas Dichtel <nicolas.dichtel@...nd.com>
Reported-by: Octavian Purdila <opurdila@...acom.com>
Reported-by: Benjamin LaHaise <bcrl@...ck.org>
Signed-off-by: Eric Dumazet <eric.dumazet@...il.com>
Signed-off-by: David S. Miller <davem@...emloft.net>



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html