[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Pine.LNX.4.64.0809112140250.26799@wrl-59.cs.helsinki.fi>
Date: Fri, 12 Sep 2008 13:16:19 +0300 (EEST)
From: "Ilpo Järvinen" <ilpo.jarvinen@...sinki.fi>
To: "Dâniel Fraga" <fragabr@...il.com>
cc: David Miller <davem@...emloft.net>, thomas.jarosch@...ra2net.com,
billfink@...dspring.com, Netdev <netdev@...r.kernel.org>,
Patrick Hardy <kaber@...sh.net>,
netfilter-devel@...r.kernel.org, kadlec@...ckhole.kfki.hu
Subject: Re: [PATCH] tcp FRTO: in-order-only "TCP proxy" fragility workaround
On Thu, 11 Sep 2008, Dâniel Fraga wrote:
> On Thu, 11 Sep 2008 16:44:20 +0300 (EEST)
> "Ilpo Järvinen" <ilpo.jarvinen@...sinki.fi> wrote:
>
> > ...I guess it would be possible to remove SCHED_FEAT_HRTICK from
> > /proc/sys/kernel/sched_features then while keeping the hrtimers
> > otherwise enabled to test this.
> >
> > It's possible that hrtimers just affect on how easy it is to trigger
> > but at least it seems an useful lead until proven otherwise.
>
> You're right Ilpo. After days and days without the problem,
> today it triggered (but I wasn't online at the time, so I couldn't grab
> any data).
Thanks. Once we know what the userspace at the server is doing, it might
make the problem immediately obvious, though I'm a bit afraid that e.g.,
strace might interfere with the problem so that it resolves right away and
we're again left with nothing...
> So, you're correct. HRtimers just affect on how easy it is to
> trigger the issue. In other words: with high resolution timer enabled,
> the problem appears more frequently.
>
> At least if we discovered a way how to trigger this, we could
> test it more easily. The problem is to wait a long time for it to
> happen.
>
> Just a curiosity: on your servers,
I don't really have any I would call "server" in the sense you mean, I
might occassionally set up one for test from time to time for a very
limited period but normally it's just ssh and some other which I use so
rarely that I'd hardly notice, and that's it. I was planning, however,
to setup some day a distcc stress test using all my spare cpu cycles
(I'd like to put it under kvm but that got stalled due to some timing
issue at the guest making it to go into an infinite loop), once I get
that working I could probably easily put other test-only stuff to that
framework as well.
But but, there are other people around the world besides us :-), and
afaict this is the only (outstanding) report which relates to ceasing of
accept() so I doubt it's something very regularly occuring thing or we
would have heard of it.
> do you use x86_64?
At least on some machines, but like you have discovered it seems to
service dependant, so that some processes never got stuck, I might only
run such or so, who knows...
> It seems
> this problem is very specific to x86_64 or appear more often on x86_64
> than x86_32. It never happens on my x86_32 bit servers.
Ok.
--
i.
Powered by blists - more mailing lists