netdev - Re: [PATCH] tcp FRTO: in-order-only "TCP proxy" fragility workaround

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Pine.LNX.4.64.0809112140250.26799@wrl-59.cs.helsinki.fi>
Date:	Fri, 12 Sep 2008 13:16:19 +0300 (EEST)
From:	"Ilpo Järvinen" <ilpo.jarvinen@...sinki.fi>
To:	"Dâniel Fraga" <fragabr@...il.com>
cc:	David Miller <davem@...emloft.net>, thomas.jarosch@...ra2net.com,
	billfink@...dspring.com, Netdev <netdev@...r.kernel.org>,
	Patrick Hardy <kaber@...sh.net>,
	netfilter-devel@...r.kernel.org, kadlec@...ckhole.kfki.hu
Subject: Re: [PATCH] tcp FRTO: in-order-only "TCP proxy" fragility workaround

On Thu, 11 Sep 2008, Dâniel Fraga wrote:

> On Thu, 11 Sep 2008 16:44:20 +0300 (EEST)
> "Ilpo Järvinen" <ilpo.jarvinen@...sinki.fi> wrote:
> 
> > ...I guess it would be possible to remove SCHED_FEAT_HRTICK from
> > /proc/sys/kernel/sched_features then while keeping the hrtimers
> > otherwise enabled to test this.
> > 
> > It's possible that hrtimers just affect on how easy it is to trigger
> > but at least it seems an useful lead until proven otherwise.
> 
> 	You're right Ilpo. After days and days without the problem,
> today it triggered (but I wasn't online at the time, so I couldn't grab
> any data).

Thanks. Once we know what the userspace at the server is doing, it might 
make the problem immediately obvious, though I'm a bit afraid that e.g., 
strace might interfere with the problem so that it resolves right away and 
we're again left with nothing...

> 	So, you're correct. HRtimers just affect on how easy it is to
> trigger the issue. In other words: with high resolution timer enabled,
> the problem appears more frequently.
> 
> 	At least if we discovered a way how to trigger this, we could
> test it more easily. The problem is to wait a long time for it to
> happen.
> 
> 	Just a curiosity: on your servers,

I don't really have any I would call "server" in the sense you mean, I 
might occassionally set up one for test from time to time for a very 
limited period but normally it's just ssh and some other which I use so 
rarely that I'd hardly notice, and that's it. I was planning, however,
to setup some day a distcc stress test using all my spare cpu cycles 
(I'd like to put it under kvm but that got stalled due to some timing 
issue at the guest making it to go into an infinite loop), once I get
that working I could probably easily put other test-only stuff to that 
framework as well.

But but, there are other people around the world besides us :-), and 
afaict this is the only (outstanding) report which relates to ceasing of 
accept() so I doubt it's something very regularly occuring thing or we 
would have heard of it.

> do you use x86_64?

At least on some machines, but like you have discovered it seems to 
service dependant, so that some processes never got stuck, I might only 
run such or so, who knows...

> It seems
> this problem is very specific to x86_64 or appear more often on x86_64
> than x86_32. It never happens on my x86_32 bit servers.

Ok.

-- 
 i.