netdev - Re: [PATCH] tcp FRTO: in-order-only "TCP proxy" fragility workaround (fwd) [SOLVED]

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Pine.LNX.4.64.0811031705390.23792@wrl-59.cs.helsinki.fi>
Date:	Mon, 3 Nov 2008 17:37:09 +0200 (EET)
From:	"Ilpo Järvinen" <ilpo.jarvinen@...sinki.fi>
To:	"Dâniel Fraga" <fragabr@...il.com>
cc:	Thomas Gleixner <tglx@...utronix.de>,
	David Miller <davem@...emloft.net>,
	Netdev <netdev@...r.kernel.org>
Subject: Re: [PATCH] tcp FRTO: in-order-only "TCP proxy" fragility workaround
 (fwd) [SOLVED]

On Sun, 2 Nov 2008, Dâniel Fraga wrote:

> On Thu, 30 Oct 2008 12:43:05 +0200 (EET)
> "Ilpo Järvinen" <ilpo.jarvinen@...sinki.fi> wrote:
> 
> > Perhaps we could try to solve it though stracing syslogd...
> 
> 	Well Ilpo, you're right, what I'm about to write here will make
> me very ashamed, but the truth must be told! The culprit was syslogd!
> Almost unbeliavable, but I had been using and old syslogd version for
> about 5 years!
> 
> 	How can I'm sure that it's syslogd's fault? Simply, because I
> had a stall today and when I killed syslogd everything was back to
> normal.

Once there's any kind of flow control, anything jamming downstream will 
eventually make upstream to stall as well (or to appear as not working 
as expected. Sadly, it's exactly opposite from correctness point of view 
as flow control is a feature in TCP, not a bug :-)). Thus I occassionally 
run to these tcp with flow control not working reports which turn to be 
totally unrelated.

This still doesn't explain everything though afaik... E.g., why did the 
sendto() to SOCK_DGRAM socket hung.

> 	But no problem. I'll just wait a few more days to test if
> syslogd is the only responsible for this, but I'm 90% sure it is.

And you had the same old syslogd on both hosts?

In any case the loss of every other character deterministically sounds 
like a real bug in the syslogd since it doesn't make too much sense to 
happen in kernel->syslogd communication (where I'd expect it to not show 
up in such consistent pattern but would cause more randomness).

> 	I apologize for thinking that it was a kernel fault.

It's not clear what caused this to happen _now_, nor the exact mechanism.

> 	Ps: just for curiosity, I was using a syslogd binary from Mar,
> 3, 2003! Extremely old! This is so old, it was compiled for Linux
> 2.2.5. Or maybe I was too lazy and copied it from another machine...

In theory this shouldn't be too big problem, but I'm hardly an expert of 
those things and syslogd is anyway more thightly coupled to kernel than 
some random app.

> 	Ps3: anyway, it's interesting how a small piece of the system
> (syslogd) can generate those kinds of problems... I mean, a simple
> error on syslogd could lead to a complete stall on connections, just
> because everything is waiting for it to log through /dev/log.

This is more of a philosophical question than something else... it's 
always balancing between data loss (=possibly losing a logline of an 
important event) or possibility of a stall. But this shouldn't be a 
concern in the case where SOCK_DGRAM was used by the sudo (like in the 
strace you sent to sudo people), in general UDP doesn't guarantee 
reliability so not delivering wouldn't be a problem but I don't know if 
PF_FILE domain does something otherwise in there.

> Of course
> the problem was the binary, but it could have a time out, so even if it
> was in fact a buggy syslogd, it won't cause such a stall on the
> system. I really don't know what changed from 2.6.24 to 2.6.25, but
> maybe 2.6.24 had such a timeout? Maybe I'm just silly writing that...
> you guys know much more than me.

Until we know more details than that killing syslogd helped it's hard to 
tell what is the actual cause. And I have no clue about semantics of 
/dev/log anyway.

> 	Ps4: maybe now we can understand why nmap solved the issue...

Not very clear but at least sudo does some writing there too.

-- 
 i.