lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Date: Tue, 29 Sep 2009 00:20:07 +0200 From: Eric Dumazet <eric.dumazet@...il.com> To: Yakov Lerner <iler.ml@...il.com> CC: netdev@...r.kernel.org, Eric Dumazet <eric.dumazet@...il.com>, David Miller <davem@...emloft.net> Subject: Re: [PATCH] /proc/net/tcp, overhead removed Yakov Lerner a écrit : > On Sun, Sep 27, 2009 at 12:53, Eric Dumazet <eric.dumazet@...il.com> wrote: >> Yakov Lerner a écrit : >>> /proc/net/tcp does 20,000 sockets in 60-80 milliseconds, with this patch. >>> >>> The overhead was in tcp_seq_start(). See analysis (3) below. >>> The patch is against Linus git tree (1). The patch is small. >>> >>> ------------ ----------- ------------------------------------ >>> Before patch After patch 20,000 sockets (10,000 tw + 10,000 estab)(2) >>> ------------ ----------- ------------------------------------ >>> 6 sec 0.06 sec dd bs=1k if=/proc/net/tcp >/dev/null >>> 1.5 sec 0.06 sec dd bs=4k if=/proc/net/tcp >/dev/null >>> >>> 1.9 sec 0.16 sec netstat -4ant >/dev/null >>> ------------ ----------- ------------------------------------ >>> >>> This is ~ x25 improvement. >>> The new time is not dependent on read blockize. >>> Speed of netstat, naturally, improves, too; both -4 and -6. >>> /proc/net/tcp6 does 20,000 sockets in 100 millisec. >>> >>> (1) against git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git >>> >>> (2) Used 'manysock' utility to stress system with large number of sockets: >>> "manysock 10000 10000" - 10,000 tw + 10,000 estab ip4 sockets. >>> "manysock -6 10000 10000" - 10,000 tw + 10,000 estab ip6 sockets. >>> Found at http://ilerner.3b1.org/manysock/manysock.c >>> >>> (3) Algorithmic analysis. >>> Old algorithm. >>> >>> During 'cat </proc/net/tcp', tcp_seq_start() is called O(numsockets) times (4). >>> On average, every call to tcp_seq_start() scans half the whole hashtable. Ouch. >>> This is O(numsockets * hashsize). 95-99% of 'cat </proc/net/tcp' is spent in >>> tcp_seq_start()->tcp_get_idx. This overhead is eliminated by new algorithm, >>> which is O(numsockets + hashsize). >>> >>> New algorithm. >>> >>> New algorithms is O(numsockets + hashsize). We jump to the right >>> hash bucket in tcp_seq_start(), without scanning half the hash. >>> To jump right to the hash bucket corresponding to *pos in tcp_seq_start(), >>> we reuse three pieces of state (st->num, st->bucket, st->sbucket) >>> as follows: >>> - we check that requested pos >= last seen pos (st->num), the typical case. >>> - if so, we jump to bucket st->bucket >>> - to arrive to the right item after beginning of st->bucket, we >>> keep in st->sbucket the position corresponding to the beginning of >>> bucket. >>> >>> (4) Explanation of O( numsockets * hashsize) of old algorithm. >>> >>> tcp_seq_start() is called once for every ~7 lines of netstat output >>> if readsize is 1kb, or once for every ~28 lines if readsize >= 4kb. >>> Since record length of /proc/net/tcp records is 150 bytes, formula for >>> number of calls to tcp_seq_start() is >>> (numsockets * 150 / min(4096,readsize)). >>> Netstat uses 4kb readsize (newer versions), or 1kb (older versions). >>> Note that speed of old algorithm does not improve above 4kb blocksize. >>> >>> Speed of the new algorithm does not depend on blocksize. >>> >>> Speed of the new algorithm does not perceptibly depend on hashsize (which >>> depends on ramsize). Speed of old algorithm drops with bigger hashsize. >>> >>> (5) Reporting order. >>> >>> Reporting order is exactly same as before if hash does not change underfoot. >>> When hash elements come and go during report, reporting order will be >>> same as that of tcpdiag. >>> >>> Signed-off-by: Yakov Lerner <iler.ml@...il.com> >>> --- >>> net/ipv4/tcp_ipv4.c | 26 ++++++++++++++++++++++++-- >>> 1 files changed, 24 insertions(+), 2 deletions(-) >>> >>> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c >>> index 7cda24b..7d9421a 100644 >>> --- a/net/ipv4/tcp_ipv4.c >>> +++ b/net/ipv4/tcp_ipv4.c >>> @@ -1994,13 +1994,14 @@ static inline int empty_bucket(struct tcp_iter_state *st) >>> hlist_nulls_empty(&tcp_hashinfo.ehash[st->bucket].twchain); >>> } >>> >>> -static void *established_get_first(struct seq_file *seq) >>> +static void *established_get_first_after(struct seq_file *seq, int bucket) >>> { >>> struct tcp_iter_state *st = seq->private; >>> struct net *net = seq_file_net(seq); >>> void *rc = NULL; >>> >>> - for (st->bucket = 0; st->bucket < tcp_hashinfo.ehash_size; ++st->bucket) { >>> + for (st->bucket = bucket; st->bucket < tcp_hashinfo.ehash_size; >>> + ++st->bucket) { >>> struct sock *sk; >>> struct hlist_nulls_node *node; >>> struct inet_timewait_sock *tw; >>> @@ -2036,6 +2037,11 @@ out: >>> return rc; >>> } >>> >>> +static void *established_get_first(struct seq_file *seq) >>> +{ >>> + return established_get_first_after(seq, 0); >>> +} >>> + >>> static void *established_get_next(struct seq_file *seq, void *cur) >>> { >>> struct sock *sk = cur; >>> @@ -2045,6 +2051,7 @@ static void *established_get_next(struct seq_file *seq, void *cur) >>> struct net *net = seq_file_net(seq); >>> >>> ++st->num; >>> + st->sbucket = st->num; >> Hello Yakov >> >> Intention of your patch is very good, but not currently working. >> >> It seems you believe there is at most one entry per hash slot or something like that >> >> Please reboot your test machine with "thash_entries=4096" so that tcp hash >> size is 4096, and try to fill 20000 tcp sockets with a test program. >> >> then : >> >> # ss | wc -l >> 20001 >> (ok) >> >> # cat /proc/net/tcp | wc -l >> 22160 >> (not quite correct ...) >> >> # netstat -tn | wc -l >> <never ends> >> >> >> # dd if=/proc/net/tcp ibs=1024 | wc -l >> <never ends> >> >> >> Please send your next patch on netdev@...r.kernel.org , DaveM only , were netdev people >> are reviewing netdev patches, there is no need include other people for first submissions. >> >> Thank you >> >> >> #include <sys/types.h> >> #include <sys/socket.h> >> #include <netinet/in.h> >> #include <string.h> >> int fdlisten; >> main() >> { >> int i; >> struct sockaddr_in sockaddr; >> >> fdlisten = socket(AF_INET, SOCK_STREAM, 0); >> memset(&sockaddr, 0, sizeof(sockaddr)); >> sockaddr.sin_family = AF_INET; >> sockaddr.sin_port = htons(2222); >> if (bind(fdlisten, (struct sockaddr *)&sockaddr, sizeof(sockaddr))== -1) { >> perror("bind"); >> return 1; >> } >> if (listen(fdlisten, 10)== -1) { >> perror("listen"); >> return 1; >> } >> if (fork() == 0) { >> while (1) { >> socklen_t len = sizeof(sockaddr); >> int newfd = accept(fdlisten, (struct sockaddr *)&sockaddr, &len); >> } >> } >> for (i = 0 ; i < 10000; i++) { >> int fd = socket(AF_INET, SOCK_STREAM, 0); >> if (fd == -1) { >> perror("socket"); >> break; >> } >> connect(fd, (struct sockaddr *)&sockaddr, sizeof(sockaddr)); >> } >> pause(); >> } >> > > Hello Eric, > > I found the problem, thanks. I'll re-send after testing. OK good ! > > In the meantime, I'd like to ask you whether it makes sense to > add the /proc/net entry, to switch between "old way" and "new way". > The switch would allow quick compare/test between new way and > old way not only by line count, but by full contents, without reboot. > Well, this switch wont be needed for patch validation, but it might help you to test your patch of course. Actually I found the error reading your patch, and I made a quick test to confirm my understanding :) See you tomorrow, its rather late here :) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@...r.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists