lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Thu, 10 Apr 2014 09:14:59 -0600
From:	Bjorn Helgaas <bhelgaas@...gle.com>
To:	"Woodhouse, David" <david.woodhouse@...el.com>
Cc:	"joro@...tes.org" <joro@...tes.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"bhe@...hat.com" <bhe@...hat.com>,
	"jiang.liu@...ux.intel.com" <jiang.liu@...ux.intel.com>,
	"linux-scsi@...r.kernel.org" <linux-scsi@...r.kernel.org>,
	"iommu@...ts.linux-foundation.org" <iommu@...ts.linux-foundation.org>,
	"James.Bottomley@...senpartnership.com" 
	<James.Bottomley@...senpartnership.com>,
	"linux-pci@...r.kernel.org" <linux-pci@...r.kernel.org>,
	"scameron@...rdog.cce.hp.com" <scameron@...rdog.cce.hp.com>,
	"davidlohr@...com" <davidlohr@...com>
Subject: Re: hpsa driver bug crack kernel down!

On Thu, Apr 10, 2014 at 2:46 AM, Woodhouse, David
<david.woodhouse@...el.com> wrote:

>> > > >> > > > > DMAR:[fault reason 02] Present bit in context entry is clear
>> > > >> > > > > dmar: DRHD: handling fault status reg 602
>> > > >> > > > > dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 7f61e000
>
> That "Present bit in context entry is clear" fault means that we have
> not set up *any* mappings for this PCI deviceā€¦ on this IOMMU.
>
>> > Yes, specifically (finally done bisecting):
>> >
>> > commit 2e45528930388658603ea24d49cf52867b928d3e
>> > Author: Jiang Liu <jiang.liu@...ux.intel.com>
>> > Date:   Wed Feb 19 14:07:36 2014 +0800
>> >
>> >     iommu/vt-d: Unify the way to process DMAR device scope array
>
> This commit is about how we decide which IOMMU a given PCI device is
> attached to.
>
> Thus, my first guess would be that we are quite happily setting up the
> requested DMA maps on the *wrong* IOMMU, and then taking faults when the
> device actually tries to do DMA.
>
> However, I'm not 100% convinced of that. The fault address looks
> suspiciously like a true physical address, not a virtual bus address of
> the type that we'd normally allocate for a dma_map_* operation. Those
> would start at 0xfffff000 and work downwards, typically.

I like the "wrong IOMMU (or no IOMMU at all)" theory.  If we didn't
connect the device with an IOMMU at all, that would explain the device
DMAing directly to a physical address, wouldn't it?

> Do you have 'iommu=pt' on the kernel command line? Can I see the full
> dmesg as this system boots, and also a copy of the DMAR table?
>
> We should also rate-limit DMA faults, which would avoid the lockup
> failure mode. Bjorn, what should an IOMMU driver *do* when it detects
> that a device is creating an endless stream of DMA faults and isn't
> aborting the transaction?

You mentioned that POWER with EEH does something intelligent in this
case, but I'm not familiar with that code.  We have AER support, which
can result in resetting a device, but I think DMA faults are reported
differently, and I don't think there's any nice existing way for PCI
to deal with them.  Maybe there should be, though.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ