[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Pine.LNX.4.64.0811031705390.23792@wrl-59.cs.helsinki.fi>
Date: Mon, 3 Nov 2008 17:37:09 +0200 (EET)
From: "Ilpo Järvinen" <ilpo.jarvinen@...sinki.fi>
To: "Dâniel Fraga" <fragabr@...il.com>
cc: Thomas Gleixner <tglx@...utronix.de>,
David Miller <davem@...emloft.net>,
Netdev <netdev@...r.kernel.org>
Subject: Re: [PATCH] tcp FRTO: in-order-only "TCP proxy" fragility workaround
(fwd) [SOLVED]
On Sun, 2 Nov 2008, Dâniel Fraga wrote:
> On Thu, 30 Oct 2008 12:43:05 +0200 (EET)
> "Ilpo Järvinen" <ilpo.jarvinen@...sinki.fi> wrote:
>
> > Perhaps we could try to solve it though stracing syslogd...
>
> Well Ilpo, you're right, what I'm about to write here will make
> me very ashamed, but the truth must be told! The culprit was syslogd!
> Almost unbeliavable, but I had been using and old syslogd version for
> about 5 years!
>
> How can I'm sure that it's syslogd's fault? Simply, because I
> had a stall today and when I killed syslogd everything was back to
> normal.
Once there's any kind of flow control, anything jamming downstream will
eventually make upstream to stall as well (or to appear as not working
as expected. Sadly, it's exactly opposite from correctness point of view
as flow control is a feature in TCP, not a bug :-)). Thus I occassionally
run to these tcp with flow control not working reports which turn to be
totally unrelated.
This still doesn't explain everything though afaik... E.g., why did the
sendto() to SOCK_DGRAM socket hung.
> But no problem. I'll just wait a few more days to test if
> syslogd is the only responsible for this, but I'm 90% sure it is.
And you had the same old syslogd on both hosts?
In any case the loss of every other character deterministically sounds
like a real bug in the syslogd since it doesn't make too much sense to
happen in kernel->syslogd communication (where I'd expect it to not show
up in such consistent pattern but would cause more randomness).
> I apologize for thinking that it was a kernel fault.
It's not clear what caused this to happen _now_, nor the exact mechanism.
> Ps: just for curiosity, I was using a syslogd binary from Mar,
> 3, 2003! Extremely old! This is so old, it was compiled for Linux
> 2.2.5. Or maybe I was too lazy and copied it from another machine...
In theory this shouldn't be too big problem, but I'm hardly an expert of
those things and syslogd is anyway more thightly coupled to kernel than
some random app.
> Ps3: anyway, it's interesting how a small piece of the system
> (syslogd) can generate those kinds of problems... I mean, a simple
> error on syslogd could lead to a complete stall on connections, just
> because everything is waiting for it to log through /dev/log.
This is more of a philosophical question than something else... it's
always balancing between data loss (=possibly losing a logline of an
important event) or possibility of a stall. But this shouldn't be a
concern in the case where SOCK_DGRAM was used by the sudo (like in the
strace you sent to sudo people), in general UDP doesn't guarantee
reliability so not delivering wouldn't be a problem but I don't know if
PF_FILE domain does something otherwise in there.
> Of course
> the problem was the binary, but it could have a time out, so even if it
> was in fact a buggy syslogd, it won't cause such a stall on the
> system. I really don't know what changed from 2.6.24 to 2.6.25, but
> maybe 2.6.24 had such a timeout? Maybe I'm just silly writing that...
> you guys know much more than me.
Until we know more details than that killing syslogd helped it's hard to
tell what is the actual cause. And I have no clue about semantics of
/dev/log anyway.
> Ps4: maybe now we can understand why nmap solved the issue...
Not very clear but at least sudo does some writing there too.
--
i.
Powered by blists - more mailing lists