netdev - Re: Problem with patch "make nlmsg_end() and genlmsg

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Tue, 09 Jun 2015 17:49:07 -0700
From:	Eric Dumazet <eric.dumazet@...il.com>
To:	David Woodhouse <dwmw2@...radead.org>
Cc:	Johannes Berg <johannes@...solutions.net>,
	David Miller <davem@...emloft.net>,
	torvalds@...ux-foundation.org, marcel@...tmann.org,
	sfeldma@...il.com, netdev@...r.kernel.org, teg@...m.no
Subject: Re: Problem with patch "make nlmsg_end() and genlmsg_end() void"

On Tue, 2015-06-09 at 14:34 +0100, David Woodhouse wrote:
> On Wed, 2015-04-08 at 15:08 +0200, Johannes Berg wrote:
> > On Wed, 2015-04-08 at 13:03 +0100, David Woodhouse wrote:
> > 
> > > I'm not sure if this is entirely fixed. In Fedora 22 (4.0.0-rc5-git4)
> > > I'm occasionally seeing glibc deadlock in __check_pf() on a netlink
> > > recvmsg(), here:
> > > https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/check_pf.c;h=162606d7;hb=glibc-2.21#l166
> > > 
> > > As I understand it, this shouldn't happen. Even if messages are
> > > dropped (which surely shouldn't happen as often as I'm seeing this),
> > > glibc should get ENOBUFS from the recvmsg() call.
> > > 
> > > https://bugzilla.redhat.com/show_bug.cgi?id=1209433
> > > 
> > > I haven't bisected and proved that it *was* this commit which
> > > introduced the problem, as it only happens after a day or two of
> > > running Evolution and I haven't managed to trigger it more reliably.
> > 
> > I don't see the connection to this change.
> > 
> > The issue with my patch was that some code for NLM_F_DUMP would have
> > this pattern:
> > 
> >  int fill_function(...)
> >  {
> >     ...
> >     return nlmsg_end(...);
> >  }
> > 
> >  loop (...) {
> >    if (fill_function() <= 0)
> >      break; /* continue in next dump */
> >  }
> > 
> > and that all had to be converted to be just "< 0" now.
> > 
> > Additionally, the failure mode of this was the process running out of
> > memory due to receiving the same results over and over again - does that
> > happen for you? It seems it was stuck in recvmsg(), but that may just be
> > a side effect of happening to interrupt at that point?
> > 
> 
> I don't think the problem was introduced by your change. At 
> https://github.com/nahi/httpclient/issues/232 it seems to have been
> observed even in November of last year.
> 
> I've added some debugging, and it seems that when it deadlocks, glibc
> doesn't get *any* response to its RTM_GETADDR request. I know we'd get
> ENOBUFS is a *response* was dropped... but what about when the request
> itself is dropped? Does userspace get any hint of that? Is this purely
> a glibc bug, for assuming its request got delivered and unconditionally
> waiting for a response?
> 
> I don't know why it suddenly started happening to me in the 4.0 kernel
> when I'd never seen it before, but it's still happening. I've put a
> poll() in the glibc code (referenced above), and made it fail after a 5
> -second timeout. That will at least prevent me from throwing my
> computer out the window for the time being...
> 

Please check that this patch fixes your issue :

http://patchwork.ozlabs.org/patch/473041/



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html