linux-kernel - DMAR regression in 2.6.31 leads to ext4 corruption?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [day] [month] [year] [list]

Message-ID: <20091008235631.GZ30557@hexapodia.org>
Date:	Thu, 8 Oct 2009 16:56:31 -0700
From:	Andy Isaacson <adi@...apodia.org>
To:	linux-kernel@...r.kernel.org, linux-ext4@...r.kernel.org
Cc:	iommu@...ts.linux-foundation.org
Subject: DMAR regression in 2.6.31 leads to ext4 corruption?

I'm testing DMAR support on 2.6.32 on Intel VT-d laptop platforms.  It
was pretty stable circa 2.6.31-rc5 (we have dozens of machines running
2.6.31-rc8), but in the last two weeks I've had a bunch of instability
on Linus' tip kernels that looked potentially like IOMMU badness.

For example,
<20090928191644.GR12922@...apodia.org>
http://lkml.org/lkml/2009/9/28/201

Today while running 817b33d38 I got the following (on a Thinkpad X200
I'd replaced the Dell with, just in case it was previously-good hardware
going bad).

[   29.450550] EXT4-fs error (device sda1): ext4_lookup: deleted inode referenced: 79
[   30.022328] DRHD: handling fault status reg 3
[   30.022328] DMAR:[DMA Write] Request device [00:02.0] fault addr ddae28000 
[   30.022328] DMAR:[fault reason 05] PTE Write access is not set
[   30.146136] DRHD: handling fault status reg 3
[   30.248938] DMAR:[DMA Write] Request device [00:02.0] fault addr ddae28000 
[   30.248939] DMAR:[fault reason 05] PTE Write access is not set

I don't know that DMAR is resulting in my repeated filesystem
corruption, but it does seem like a potential cause (and would explain
why I'm seeing this whereas most people aren't, since few people are
using VT-d *and* i915).

I see that the BROKEN_GFX_WA code has been removed; do we actually
believe that the relevant code is working?  Could it be corrupting my
AHCI DMAs if not?  At the end of the last thread Ted thought that we'd
lost a write of an inode block; this time the symptoms look different,
in that I don't see one inode block representing a significant data
loss (though I'm by no means an expert).

I've attached some useful info, let me know if I missed anything.

I'll try running with BROKEN_GFX_WA turned back on and see if that
improves things at all.

Thanks,
-andy

View attachment "cpuinfo" of type "text/plain" (1514 bytes)

View attachment "dmesg" of type "text/plain" (104403 bytes)

View attachment "git-id" of type "text/plain" (578 bytes)

View attachment "iomem" of type "text/plain" (2381 bytes)

View attachment "ioports" of type "text/plain" (1445 bytes)

View attachment "kconfig" of type "text/plain" (65356 bytes)

View attachment "meminfo" of type "text/plain" (1114 bytes)

View attachment "fsck.out" of type "text/plain" (2327 bytes)