netdev - Re: [Bugme-new] [Bug 10767] New: Seg Fault Instead of Swapping

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [day] [month] [year] [list]

Message-ID: <Pine.LNX.4.64.0805231302410.16829@wrl-59.cs.helsinki.fi>
Date:	Fri, 23 May 2008 13:25:23 +0300 (EEST)
From:	"Ilpo Järvinen" <ilpo.jarvinen@...sinki.fi>
To:	Brian Vowell <brian.vowell@...il.com>
cc:	Andrew Morton <akpm@...ux-foundation.org>,
	Netdev <netdev@...r.kernel.org>, bugme-daemon@...zilla.kernel.org
Subject: Re: [Bugme-new] [Bug 10767] New: Seg Fault Instead of Swapping

On Wed, 21 May 2008, Brian Vowell wrote:

> On 5/21/08, Ilpo Järvinen <ilpo.jarvinen@...sinki.fi> wrote:
> >
> > On Wed, 21 May 2008, Brian Vowell wrote:
> >
> > ...This TCP warning isn't sign of a problem that would cause corruption of
> > any kind (and the state inconsistency gets repaired there as well!), we're
> > debugging this already with couple of other people, so unless you have
> > some very good reproducer case you'll just have to wait a bit.
> 
> I don't mind waiting.  I just thought that since this happened so quickly
> after upgrading to the 2.6.25.4 kernel that it might be something that was
> related to changes from 2.6.25.3.

It's just due to luck (that's what makes debugging it hard). There isn't 
anything even remotely related changed in 2.6.25.3 -> 2.6.25.4 :-). Of 
course, if you really want to you can always add the debug patch the other 
2539-warning tracking people already use:

  http://marc.info/?l=linux-netdev&m=120972551120080&w=2

It will add some expensive verifications that are run multiple times per 
ACK to validate TCP's internal consistency which is normally just 
"assumed" to be valid due to performance reasons. Please note what I said 
about CONFIG_LOG_BUF_SHIFT there so that the previous disaster doesn't 
repeat itself... :-) Then you just need to occassionally see if it has 
triggered (from kernel logs).

> The only thing that I can do in regards to assisting is to provide info
> about what triggers the bug and/or provide access to my system to anyone who
> wants to take a closer look.
>
> At the time the bug occurred, the system was running two I/O intensive
> applications.  The first was the "xfs_fsr" tool to defragment the XFS
> filesystem (it was pushing about 25-30MB/sec to the array), and the rtorrent
> BitTorrent client, which was pulling down a Fedora ISO at about 25mbps, and
> writing to a SATA device, separate from the array where the XFS filesystem
> lives.  In the past, I have seen the rtorrent client create oops errors when
> running within a Xen VM.  This problem only occurred when the guest domU VM
> was running XFS for its root filesystem and the host dom0 that provided the
> filestore for the gues was also running XFS.  (Running the guest on ext3
> stops the oops errors).  It may or may not be related to this error that I'm
> reporting, but considering that the rtorrent app is the same app that was
> running at the time of both segfaults and the oops errors, it makes me
> suspicious and I thought that I'd mention it.

All other details but torrent are most likely irrelevant, it could be 
that the added load has some significance to window behavior but that's 
not too likely to make difference. More important would be to know with 
which host torrent was communicating at the time of the WARNING, and even 
more importantly, what "special" happened between that host and your host 
in the network. Once that becomes known thing, this will be easy to 
reproduce and it might even be that the fix is then something dead 
obvious.

> Just to be clear, the bug that I'm reporting is running on a clean system,
> with no Xen installed on it.
> 
> Otherwise, it's really like a bug in real life-- a small creature that is
> annoying but doesn't really harm anything.

The thing I'm still concerned in this report is the presence of the other 
stacktrace which seemed to not be TCP related. Could you confirm/list all 
the other problems than the WARNING net/ipv4/tcp_input.c:2539(+its 
stacktrace) that you have been seeing. ...I'm saying this because it might 
be worth of pursuing further.

-- 
 i.