lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Date: Mon, 22 Oct 2018 07:22:29 +0200 From: Henning Rogge <henning.rogge@...e.fraunhofer.de> To: <netdev@...r.kernel.org> CC: Stephen Hemminger <stephen@...workplumber.org> Subject: Re: [rtnetlink] Potential bug in Linux (rt)netlink code Does anyone else have an idea how to debug this problem? Henning Rogge Am 15.10.2018 um 07:25 schrieb Henning Rogge: > Am 12.10.2018 um 20:51 schrieb Stephen Hemminger: >> On Fri, 12 Oct 2018 09:30:40 +0200 >> Henning Rogge <henning.rogge@...e.fraunhofer.de> wrote: >> >>> Hi, >>> >>> I am working on a self-written routing agent >>> (https://github.com/OLSR/OONF) and am stuck on a problem with netlink >>> that I cannot explain with an userspace error. >>> >>> I am using a netlink socket for setting routes >>> (RTM_NEWROUTE/RTM_DELROUTE), querying the kernel for the current routes >>> in the database (via a RTM_GETROUTE dump) and for getting multicast >>> messages for ongoing routing changes. >>> >>> After a few netlink messages I get to the point where the kernel just >>> does not responst to a RTM_NEWROUTE. No error, no answer, despite the >>> NLM_F_ACK flag set)... but sometime when (during shutdown of the routing >>> agent) the program sends another route command (most times a >>> RTM_DELROUTE) I get a single netlink packet with a "successful" response >>> for both the "missing" RTM_NEWROUTE and one for the new RTM DELROUTE >>> sequence number. >>> >>> I am testing two routing agents, each of them in a systemd-nspawn based >>> container connected over a bridge on the host system on a current Debian >>> Testing (kernel 4.18.0-1-amd64). >>> >>> I am directly using the netlink sockets, without any other userspace >>> library in between. >>> >>> I have checked the hexdumps of a couple of netlink messages (including >>> the ones just before the bug happens) by hand and they seem to be okay. >>> >>> When I tried to add a "netlink listener" socket for futher debugging (ip >>> link add nlmon0 type nlmon) the problem vanished until I removed the >>> listener socket again. >>> >>> Any ideas how to debug this problem? Unfortunately I have no short >>> example program to trigger the bug... I have rarely seen the problem for >>> years (once every couple of months), but until a few days ago I never >>> managed to reproduce it. >>> >>> Henning Rogge >> >> Are you reading the responses to your requests? If you don't read >> the response, the socket will get flow blocked. > > Yes, I do... > > all netlink sockets the program uses are constantly watched for traffic > coming from the kernel (with an epoll()-based event loop, no edge-trigger). > > I even have a rate limitation towards the kernel, only sending a > "pagesize" full of netlink data towards the kernel, then waiting for the > reply before sending more (I had the blocking problem a few years ago > when experimenting with LOTS of routes). > > Henning Rogge Henning Rogge -- Diplom-Informatiker Henning Rogge , Fraunhofer-Institut für Kommunikation, Informationsverarbeitung und Ergonomie FKIE Kommunikationssysteme (KOM) Zanderstrasse 5, 53177 Bonn, Germany Telefon +49 228 50212-469 mailto:henning.rogge@...e.fraunhofer.de http://www.fkie.fraunhofer.de
Powered by blists - more mailing lists