linux-kernel - Re: [PATCH 2/2] iommu/vt-d: don's issue devTLB flush request when device is disconnected

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20231213104417.GA31964@wunner.de>
Date:   Wed, 13 Dec 2023 11:44:17 +0100
From:   Lukas Wunner <lukas@...ner.de>
To:     Ethan Zhao <haifeng.zhao@...ux.intel.com>
Cc:     bhelgaas@...gle.com, baolu.lu@...ux.intel.com, dwmw2@...radead.org,
        will@...nel.org, robin.murphy@....com, linux-pci@...r.kernel.org,
        iommu@...ts.linux.dev, linux-kernel@...r.kernel.org,
        Haorong Ye <yehaorong@...edance.com>
Subject: Re: [PATCH 2/2] iommu/vt-d: don's issue devTLB flush request when
 device is disconnected

On Tue, Dec 12, 2023 at 10:46:37PM -0500, Ethan Zhao wrote:
> For those endpoint devices connect to system via hotplug capable ports,
> users could request a warm reset to the device by flapping device's link
> through setting the slot's link control register,

Well, users could just *unplug* the device, right?  Why is it relevant
that thay could fiddle with registers in config space?

> as pciehpt_ist() DLLSC
> interrupt sequence response, pciehp will unload the device driver and
> then power it off. thus cause an IOMMU devTLB flush request for device to
> be sent and a long time completion/timeout waiting in interrupt context.

A completion timeout should be on the order of usecs or msecs, why does it
cause a hard lockup?  The dmesg excerpt you've provided shows a 12 *second*
delay between hot removal and watchdog reaction.

> Fix it by checking the device's error_state in
> devtlb_invalidation_with_pasid() to avoid sending meaningless devTLB flush
> request to link down device that is set to pci_channel_io_perm_failure and
> then powered off in

This doesn't seem to be a proper fix.  It will work most of the time
but not always.  A user might bring down the slot via sysfs, then yank
the card from the slot just when the iommu flush occurs such that the
pci_dev_is_disconnected(pdev) check returns false but the card is
physically gone immediately afterwards.  In other words, you've shrunk
the time window during which the issue may occur, but haven't eliminated
it completely.

Thanks,

Lukas