lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Sun, 1 Jun 2008 08:51:34 +0300 (EEST)
From:	"Ilpo Järvinen" <ilpo.jarvinen@...sinki.fi>
To:	Patrick McManus <mcmanus@...ksong.com>
cc:	Ingo Molnar <mingo@...e.hu>, Peter Zijlstra <peterz@...radead.org>,
	LKML <linux-kernel@...r.kernel.org>,
	Netdev <netdev@...r.kernel.org>,
	"David S. Miller" <davem@...emloft.net>,
	"Rafael J. Wysocki" <rjw@...k.pl>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Evgeniy Polyakov <johnpol@....mipt.ru>
Subject: Re: [bug] stuck localhost TCP connections, v2.6.26-rc3+

On Sat, 31 May 2008, Patrick McManus wrote:

> On Sat, 2008-05-31 at 18:35 +0200, Ingo Molnar wrote:
> > * Ilpo Järvinen <ilpo.jarvinen@...sinki.fi> wrote:
> > 
> 
> > > ...setsockopt(listenfd, SOL_TCP, TCP_DEFER_ACCEPT, &val, sizeof(val)) 
> > > seems to be the magic trick that is interestion here.
> > 
> > seems to be used:
> > 
> >  22003 write(3, "distccd[22003] (dcc_listen_by_ad"..., 62) = 62
> >  22003 listen(4, 10)                     = 0
> >  22003 setsockopt(4, SOL_TCP, TCP_DEFER_ACCEPT, [1], 4) = 0
> > 
> > i'll queue up your reverts for testing in -tip.
> 
> 
> So the code you will revert came from my fingers. The circumstances here
> make me nervous; while I'm at a loss to explain what might be going on
> in particular, let me offer an apology in advance should the revert help
> resolve the issue.

Yes, don't worry just yet. It far from proven yet that this is the cause 
(or contributes to easiness of reproducal in any way). The patch was just 
for Ingo's testing in his -tip branch. I didn't even bother to cc you yet 
because it's more or less a stab into dark, but it's definately worth of 
testing still even though Ingo probably comes back soon and tells that it 
didn't help any because it's clearly related :-).

> Here's what makes me nervous:
> 
>  * not a lot of code uses DEFER_ACCEPT.. frankly it was pretty broken
> before 26 - but not broken this way .. the correlation of your bug using
> it is significant. 
>
>  * in 26, a server TCP socket (with DA) goes to ESTABLISHED when the 3rd
> part of the handshake is received (as normal without DA), but the socket
> isn't put on the accept queue until a real data packet arrives. (That's
> the point of DA). In <= 25 this socket would have syn-recv until the
> data packet arrived.
> 
>   - I did run tests where the server died in between the handshake being
> completed and first data packet arriving - the client should see RST and
> the server socket should disappear. But maybe something was missed?

Also in this Ingo's case RST seems to be missing, ie., there's unread data 
and both ends remain ESTABLISHED while the receiver is already gone (or 
not referencing to the connection correctly).

> Do I understand this correctly, the server process is gone but the
> socket is still in the table? And the client process is still there
> waiting for the server to do something - having sent a bunch of data?

Yes, this seems to be the case, sender was doing window probes because 
window became to zero.

Because it's distcc, tracking a particular process is not that simple 
task. Either the process is gone or it doesn't correctly reference to the 
connection.

> Do we know if any data bytes (not handshake bytes) have been consumed by
> the server side? If they were, that would seem to vindicate DA.

We don't know. We cannot currently track the particular process which 
would definately be helpful here.

> Also pointing away from DA is that you started seeing this with rc3 -
> that code was included in rc1.Is that a firm observation, or maybe there
> weren't enough datapoints to conclude that rc1 and rc2 were clean?

Timeline won't match too well yes. I also find it quite unlikely, but 
still worth of test because it's hard to know when this begun, luck might 
have just played some role there because it's quite evasive in Ingo's 
case anyway.

Anything you find suspicious between rc1..rc3?

...I suspected my rc3 FRTO fixes first but they have nothing to do with 
window probing and orphan handling.

> The most interesting patch is ec3c0982a2dd1e671bad8e9d26c28dcba0039d87
> if anyone wants to eyeball it.

I personally think it might as well be some other issue which just become 
more visible after DA but lets wait until Ingo has some results which may 
well result in that DA is not making it to become visible in his case. 
...Also, I doubt Arjan's mua has nothing to do with DA.


-- 
 i.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ