[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20121003232548.eb6b6b22.bircoph@gmail.com>
Date: Wed, 3 Oct 2012 23:25:48 +0400
From: Andrew Savchenko <bircoph@...il.com>
To: netdev@...r.kernel.org
Subject: Kernel recieves DNS reply, but doesn't deliver it to a waiting
application
Hello,
I encountered a very weird bug: after a while of uptime kernel stops to deliver
DNS reply to applications. Tcpdump shows that correct reply is recieved, but
strace shows inquiring application never recieves it and ends with timeout,
epoll_wait() always returns 0:
a slice from: $ host kernel.org 8.8.8.8:
sendmsg(20, {msg_name(16)={sa_family=AF_INET, sin_port=htons(53),
sin_addr=inet_addr("8.8.8.8")}, msg_iov(1)=[{"\266\344\1\0\0\1\0\0\0\0\0\0\6k
ernel\3org\0\0\1\0\1", 28}], msg_controllen=0, msg_flags=0}, 0) = 28
epoll_wait(3, {}, 64, 0) = 0
epoll_wait(3, {}, 64, 4999) = 0
Though tcpdump shows a normal reply:
20:28:44.162897 IP 10.7.74.7.43167 > 8.8.8.8.domain: 46820+ A? kernel.org. (28)
20:28:44.221308 IP 8.8.8.8.domain > 10.7.74.7.43167: 46820 1/0/0 A 149.20.4.69
(44)
After this bug has occured, it is no longer possible to perform DNS request on
the crippled system. I tried to stop/restart all network-related daemons, to
recreate network interfaces whenever possible (e.g. pppX devices), but with no
help. I use iptables and ebtables on this host, but reseting them (flushing all
chains, removing user chains, setting all policies to ACCEPT) doesn't help. The
only worknig solution is to reboot the system.
This bug happens rarely and randomly (about once in 7-12 days on 24x7 available
production system), but I had it 5 times already. Due to rare and random nature
of the bug I can't bisect it.
This problem occured after I updated vanilla kernel from 2.6.39.4 to 3.4.6.
Afterward I updated kernel to 3.4.10 in the hope that this will fix the
problem, but with no result. (I updated kernel due to commit
2ce42ec4ef551b08d2e5d26775d838ac640f82ad, which describes somewhat similar
issue, though I don't use I/OAT engine due to lack of hardware support.)
More details, attached trace files and kernel configs are available at bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=48081
In a few days I'll try 3.4.12 (I need to rebuild kernel anyway due to unrelated
issue) and will report if this bug will occur again. But please note it may
take several weeks to check this.
Best regards,
Andrew Savchenko
Content of type "application/pgp-signature" skipped
Powered by blists - more mailing lists