netdev - Re: WARNING: at include/net/sock.h:417 udp_lib

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Mon, 6 Jul 2009 17:54:50 -0700
From:	Emil S Tantilov <emils.tantilov@...il.com>
To:	Eric Dumazet <eric.dumazet@...il.com>
Cc:	"Tantilov, Emil S" <emil.s.tantilov@...el.com>,
	NetDev <netdev@...r.kernel.org>,
	"Brandeburg, Jesse" <jesse.brandeburg@...el.com>,
	"Kirsher, Jeffrey T" <jeffrey.t.kirsher@...el.com>,
	Jiri Olsa <jolsa@...hat.com>,
	"David S. Miller" <davem@...emloft.net>
Subject: Re: WARNING: at include/net/sock.h:417 udp_lib_unhash

On Wed, Jul 1, 2009 at 11:10 PM, Eric Dumazet<eric.dumazet@...il.com> wrote:
> Tantilov, Emil S a écrit :
>> I see the following trace during netperf stress mixed UDP/TCP IPv4/6 traffic. This is on recent pulls from net-2.6 and net-next.
>>
>> [45197.989163] ------------[ cut here ]------------
>> [45197.994309] WARNING: at include/net/sock.h:417 udp_lib_unhash+0x81/0xab()
>> [45197.994311] Hardware name: X7DA8
>> [45197.994314] Modules linked in: e1000 [last unloaded: e1000]
>> [45197.994326] Pid: 7110, comm: netserver Tainted: G        W  2.6.31-rc1-net-next-e1000-06250902 #8
>> [45197.994331] Call Trace:
>> [45197.994336]  [<ffffffff8135e0dc>] ? udp_lib_unhash+0x81/0xab
>> [45197.994344]  [<ffffffff8103cac9>] warn_slowpath_common+0x77/0x8f
>> [45197.994349]  [<ffffffff8103caf0>] warn_slowpath_null+0xf/0x11
>> [45197.994352]  [<ffffffff8135e0dc>] udp_lib_unhash+0x81/0xab
>> [45197.994357]  [<ffffffff81301acb>] sk_common_release+0x2f/0xb4
>> [45197.994364]  [<ffffffff813a0256>] udp_lib_close+0x9/0xb
>> [45197.994369]  [<ffffffff81364259>] inet_release+0x58/0x5f
>> [45197.994374]  [<ffffffff8138c8bd>] inet6_release+0x30/0x35
>> [45197.994383]  [<ffffffff812ff273>] sock_release+0x1a/0x6c
>> [45197.994386]  [<ffffffff812ff763>] sock_close+0x22/0x26
>> [45197.994392]  [<ffffffff810c69a0>] __fput+0xf0/0x18c
>> [45197.994395]  [<ffffffff810c6d00>] fput+0x15/0x19
>> [45197.994399]  [<ffffffff810c3c3e>] filp_close+0x5c/0x67
>> [45197.994404]  [<ffffffff810c3cc4>] sys_close+0x7b/0xb6
>> [45197.994412]  [<ffffffff8100baeb>] system_call_fastpath+0x16/0x1b
>> [45197.994418] ---[ end trace 5acab6fc0afdaaa3 ]---
>>
>> Emil--
>
> Thanks for this report Emil.
>
> I could not find a recent change in this area in last kernels.
> If struct sk is hashed (sk_hashed() true), then sk_refcnt was incremented
> in sk_nulls_add_node_rcu(), thus its value should be >= 2.
>
> Maybe we have a missing memory barrier somewhere or a list corruption.
>
> 1) Could you try CONFIG_DEBUG_LIST=y ?
I am running a test with this option now. Sorry for the delayed
response, I was out last week.

> 2) Could you give model of cpu, since it reminds me the ongoing discussion raised by Jiri Olsa.

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 23
model name	: Intel(R) Xeon(R) CPU           E5450  @ 3.00GHz
stepping	: 6
cpu MHz		: 2999.790
cache size	: 6144 KB
physical id	: 0
siblings	: 4
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 10
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor
ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 lahf_lm tpr_shadow
vnmi flexpriority
bogomips	: 5999.58
clflush size	: 64

2 quad core Xeons, I only included the output from the first to reduce clutter.

> CPU1 does an atomic_inc(&sk->sk_refcnt)  : refcnt changes from 1 to 2
> then CPU2 does an atomic_read(&sk->sk_refcnt) and reads 1 instead of 2
>
> David, maybe this test is not safe and if we really want to do a check
> we need to use a stronger atomic function.
>
> If you can reproduce this problem easily could you try following patch ?

It varies from few minutes to hours, but it does reproduce
consistently in my tests so far. I will try your patch once I manage
to get a trace with CONFIG_DEBUG_LIST

Thanks,
Emil
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html