netdev - Re: [BUG] Kernel recieves DNS reply, but doesn't deliver it to a waiting application

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <20121021032543.09d1844f.bircoph@gmail.com>
Date:	Sun, 21 Oct 2012 03:25:43 +0400
From:	Andrew Savchenko <bircoph@...il.com>
To:	Eric Dumazet <eric.dumazet@...il.com>
Cc:	netdev@...r.kernel.org
Subject: Re: [BUG] Kernel recieves DNS reply, but doesn't deliver it to a
 waiting application

Hello,

On Sun, 14 Oct 2012 03:11:19 +0400 Andrew Savchenko wrote:
> On Sat, 13 Oct 2012 15:44:20 +0200 Eric Dumazet wrote:
> > On Sat, 2012-10-13 at 16:36 +0400, Andrew Savchenko wrote:
> > > On Wed, 3 Oct 2012 23:25:48 +0400 Andrew Savchenko wrote:
> > > > I encountered a very weird bug: after a while of uptime kernel stops to deliver
> > > > DNS reply to applications. Tcpdump shows that correct reply is recieved, but 
> > > > strace shows inquiring application never recieves it and ends with timeout,
> > > > epoll_wait() always returns 0:
> > > > a slice from: $ host kernel.org 8.8.8.8:
> [...]
> > > > In a few days I'll try 3.4.12 (I need to rebuild kernel anyway due to unrelated
> > > > issue) and will report if this bug will occur again. But please note it may
> > > > take several weeks to check this.
> > > 
> > > I got this problem again with 3.4.12 kernel. System lasted less than
> > > a week and reboot was the only option...
> > 
> > You should investigate and check where the incoming packet is lost
> > 
> > Tools :
> > 
> > netstat -s
> > 
> > drop_monitor module and dropwatch command
> > 
> > cat /proc/net/udp
> 
> Thank you for you reply; I updated my kernel to 3.4.14, enabled
> CONFIG_NET_DROP_MONITOR, and installed dropwatch utility.
> 
> I will report back when the bug will struck again.
> This may take a weak or two, however.

This bug is back again on kernel 3.4.14, but this time I was able to
get debug data and to recover running kernel without reboot.

Drowpatch showed that DNS UDP replies are always dropped here:
1 drops at __udp_queue_rcv_skb+61 (0xffffffff813bd670)

Another observations:
- only UDP replies are lost, TCP works fine;
- if network load is dropped dramatically (ip_forward disabled, most
network daemons are stopped) UDP DNS queries work again; but with
gradual load increase replies became first slow and than cease at all.
- CPU load is very low (uptime is below 0.05), so this shouldn't be
an insufficient computing power issue.

I found __udp_queue_rcv_skb function in net/ipv4/udp.c. From the code
and observations above it follows that this is likely to be a ENOMEM
condition leading to a packet loss.

This is a memory data after bug happened:
# cat /proc/meminfo
MemTotal:        1021576 kB
MemFree:           32056 kB
Buffers:          105204 kB
Cached:           646716 kB
SwapCached:          236 kB
Active:           205932 kB
Inactive:         587156 kB
Active(anon):      20636 kB
Inactive(anon):    22488 kB
Active(file):     185296 kB
Inactive(file):   564668 kB
Unevictable:        2152 kB
Mlocked:            2152 kB
SwapTotal:        995992 kB
SwapFree:         995020 kB
Dirty:                 0 kB
Writeback:             0 kB
AnonPages:         43120 kB
Mapped:             7504 kB
Shmem:               148 kB
Slab:             176004 kB
SReclaimable:     118636 kB
SUnreclaim:        57368 kB
KernelStack:         688 kB
PageTables:         2948 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     1506780 kB
Committed_AS:      62708 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      262732 kB
VmallocChunk:   34359474615 kB
AnonHugePages:         0 kB
DirectMap4k:       33536 kB
DirectMap2M:     1013760 kB

# sysctl -a | grep mem
net.core.optmem_max = 20480
net.core.rmem_default = 229376
net.core.rmem_max = 131071
net.core.wmem_default = 229376
net.core.wmem_max = 131071
net.ipv4.igmp_max_memberships = 20
net.ipv4.tcp_mem = 22350        29801   44700
net.ipv4.tcp_rmem = 4096        87380   6291456
net.ipv4.tcp_wmem = 4096        16384   4194304
net.ipv4.udp_mem = 24150        32202   48300
net.ipv4.udp_rmem_min = 4096
net.ipv4.udp_wmem_min = 4096
vm.lowmem_reserve_ratio = 256   256     32
vm.overcommit_memory = 0

Sysctl memory parameters are system defaults, I haven't changed them
via sysctl or /proc interfaces.

I tried to increase udm_mem values to the following:
net.ipv4.udp_mem = 100000       150000  200000

This solved my issue, at least for a while: DNS queries are working
fine now.

But I suspect that there is some memory loss in the kernel UDP stack,
because this issue never happens after reboot and always after about
a week of network operation. So this memory increase should help only
for a month or so, if memory loss is linear.

If you need some memory debug information, let me know which one and
what tools will be needed.

Best regards,
Andrew Savchenko

Content of type "application/pgp-signature" skipped