[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1431577041.27831.39.camel@edumazet-glaptop2.roam.corp.google.com>
Date: Wed, 13 May 2015 21:17:21 -0700
From: Eric Dumazet <eric.dumazet@...il.com>
To: Herbert Xu <herbert@...dor.apana.org.au>
Cc: David Miller <davem@...emloft.net>, Thomas Graf <tgraf@...g.ch>,
netdev <netdev@...r.kernel.org>
Subject: Re: netlink & rhashtable status
On Wed, 2015-05-13 at 21:13 -0700, Eric Dumazet wrote:
> On Wed, 2015-05-13 at 20:58 -0700, Eric Dumazet wrote:
> > On Thu, 2015-05-14 at 11:34 +0800, Herbert Xu wrote:
> > > On Wed, May 13, 2015 at 08:17:43PM -0700, Eric Dumazet wrote:
> > > >
> > > > The initial bug report was on 3.18 for sure.
> > > >
> > > > (Tester had to leave the program run ~8 hours to get the problem, on a 8
> > > > vCPU VM)
> > > >
> > > > I can reproduce the bug quite easily (in a few seconds) on 4.0.3, I did
> > > > not spent lot of time trying 3.18, but it seems a bit harder.
> > >
> > > No what I'm asking is on 3.18 was it permanent? I can imagine
> > > there being a lookup bug in 3.18 that triggers during a rehash
> > > but I cannot find any permanent corruption issues.
> >
> > Let me try to reproduce this on 3.18.13. I'll give you an update.
>
> OK I reproduced a hang after few minutes :
>
> Out of my 200 processes, one of them is stuck in the recvmsg() system
> call :
>
> lpaa23:~# ps aux|grep addrinfo
> root 33416 0.0 0.0 3692 376 pts/0 S+ 21:09 0:00 /bin/bash ./getaddrinfo_many.sh
> root 33417 0.0 0.0 3692 376 pts/0 S+ 21:09 0:00 /bin/bash ./getaddrinfo_many.sh
> root 33418 0.0 0.0 3744 2108 pts/0 S+ 21:09 0:00 /bin/bash ./getaddrinfo_many.sh
> root 33428 0.0 0.0 3696 1752 pts/0 S+ 21:09 0:00 /bin/bash ./getaddrinfo_many.sh
> root 33431 0.0 0.0 1172 4 pts/0 S+ 21:09 0:00 ./getaddrinfo 500
> root 34102 0.0 0.0 2600 1312 pts/1 S+ 21:11 0:00 grep addrinfo
> root 40236 0.0 0.0 3692 2920 pts/0 S+ 21:09 0:00 /bin/bash ./getaddrinfo_many.sh
> lpaa23:~# strace -p 33431
> Process 33431 attached
> recvmsg(3, ^CProcess 33431 detached
> <detached ...>
>
> lpaa23:~# lsof -p 33431
> COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
> getaddrin 33431 root cwd DIR 8,1 12288 16394 /root
> getaddrin 33431 root rtd DIR 8,1 4096 2 /
> getaddrin 33431 root txt REG 8,1 978477 87 /root/getaddrinfo
> getaddrin 33431 root 0r CHR 1,3 0t0 2521 /dev/null
> getaddrin 33431 root 1w REG 8,1 0 6919 /root/5.out
> getaddrin 33431 root 2w REG 8,1 0 6919 /root/5.out
> getaddrin 33431 root 3u netlink 0t0 57052903 ROUTE
>
> lpaa23:~# cat /proc/net/netlink
> sk Eth Pid Groups Rmem Wmem Dump Locks Drops Inode
> ffff881f6d8b8000 0 33431 00000000 0 0 0 2 0 57052903
> ffff881fe1d98400 0 0 00000000 0 0 0 2 0 3
> ffff881f6d8b8000 0 33431 00000000 0 0 0 2 0 57052903
> ffff881fe1066400 8 0 00000000 0 0 0 2 0 13355
> ffff881fe1066400 8 0 00000000 0 0 0 2 0 13355
> ffff883fe1204800 9 0 00000000 0 0 0 2 0 2056
> ffff883fe1204800 9 0 00000000 0 0 0 2 0 2056
> ffff883feecf6400 10 0 00000000 0 0 0 2 0 9602
> ffff883fe1208000 11 0 00000000 0 0 0 2 0 2051
> ffff883fe1208000 11 0 00000000 0 0 0 2 0 2051
> ffff881fe0f4ac00 16 0 00000000 0 0 0 2 0 2054
> ffff881fe0f4ac00 16 0 00000000 0 0 0 2 0 2054
>
> So it looks like we lost an skb or something....
>
Ah, the user socket is listed twice in /proc/net/netlink !
It is permanent until I kill task :
lpaa23:~# grep ffff881f6d8b8000 /proc/net/netlink
ffff881f6d8b8000 0 33431 00000000 0 0 0 2 0 57052903
ffff881f6d8b8000 0 33431 00000000 0 0 0 2 0 57052903
After kill, I got another hang after 20 seconds.
And we can again see a socket twice in /proc/net/netlink
lpaa23:~# cat /proc/net/netlink
sk Eth Pid Groups Rmem Wmem Dump Locks Drops Inode
ffff881fcac36c00 0 47169 00000000 0 0 0 2 0 59270869
ffff881fe1d98400 0 0 00000000 0 0 0 2 0 3
ffff881fcac36c00 0 47169 00000000 0 0 0 2 0 59270869
ffff881fe1066400 8 0 00000000 0 0 0 2 0 13355
ffff881fe1066400 8 0 00000000 0 0 0 2 0 13355
ffff883fe1204800 9 0 00000000 0 0 0 2 0 2056
ffff883fe1204800 9 0 00000000 0 0 0 2 0 2056
ffff883feecf6400 10 0 00000000 0 0 0 2 0 9602
ffff883fe1208000 11 0 00000000 0 0 0 2 0 2051
ffff883fe1208000 11 0 00000000 0 0 0 2 0 2051
ffff881fe0f4ac00 16 0 00000000 0 0 0 2 0 2054
ffff881fe0f4ac00 16 0 00000000 0 0 0 2 0 2054
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists