linux-kernel - Re: [PATCH 2/2] arm64: Notify on pte permission upgrades

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87v8g9qr2z.fsf@nvidia.com>
Date:   Wed, 31 May 2023 10:30:48 +1000
From:   Alistair Popple <apopple@...dia.com>
To:     Jason Gunthorpe <jgg@...dia.com>
Cc:     Sean Christopherson <seanjc@...gle.com>,
        Robin Murphy <robin.murphy@....com>,
        Andrew Morton <akpm@...ux-foundation.org>, will@...nel.org,
        catalin.marinas@....com, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, nicolinc@...dia.com,
        linux-arm-kernel@...ts.infradead.org, kvm@...r.kernel.org,
        John Hubbard <jhubbard@...dia.com>, zhi.wang.linux@...il.com
Subject: Re: [PATCH 2/2] arm64: Notify on pte permission upgrades


Jason Gunthorpe <jgg@...dia.com> writes:

> On Tue, May 30, 2023 at 02:44:11PM +0100, Robin Murphy wrote:
>> On 30/05/2023 1:52 pm, Jason Gunthorpe wrote:
>> > On Tue, May 30, 2023 at 01:14:41PM +0100, Robin Murphy wrote:
>> > > On 2023-05-30 12:54, Jason Gunthorpe wrote:
>> > > > On Tue, May 30, 2023 at 06:05:41PM +1000, Alistair Popple wrote:
>> > > > > 
>> > > > > > > As no notification is sent and the SMMU does not snoop TLB invalidates
>> > > > > > > it will continue to return read-only entries to a device even though
>> > > > > > > the CPU page table contains a writable entry. This leads to a
>> > > > > > > continually faulting device and no way of handling the fault.
>> > > > > > 
>> > > > > > Doesn't the fault generate a PRI/etc? If we get a PRI maybe we should
>> > > > > > just have the iommu driver push an iotlb invalidation command before
>> > > > > > it acks it? PRI is already really slow so I'm not sure a pipelined
>> > > > > > invalidation is going to be a problem? Does the SMMU architecture
>> > > > > > permit negative caching which would suggest we need it anyhow?
>> > > > > 
>> > > > > Yes, SMMU architecture (which matches the ARM architecture in regards to
>> > > > > TLB maintenance requirements) permits negative caching of some mapping
>> > > > > attributes including the read-only attribute. Hence without the flushing
>> > > > > we fault continuously.
>> > > > 
>> > > > Sounds like a straight up SMMU bug, invalidate the cache after
>> > > > resolving the PRI event.
>> > > 
>> > > No, if the IOPF handler calls back into the mm layer to resolve the fault,
>> > > and the mm layer issues an invalidation in the process of that which isn't
>> > > propagated back to the SMMU (as it would be if BTM were in use), logically
>> > > that's the mm layer's failing. The SMMU driver shouldn't have to issue extra
>> > > mostly-redundant invalidations just because different CPU architectures have
>> > > different idiosyncracies around caching of permissions.
>> > 
>> > The mm has a definition for invalidate_range that does not include all
>> > the invalidation points SMMU needs. This is difficult to sort out
>> > because this is general purpose cross arch stuff.
>> > 
>> > You are right that this is worth optimizing, but right now we have a
>> > -rc bug that needs fixing and adding and extra SMMU invalidation is a
>> > straightforward -rc friendly way to address it.
>> 
>> Sure; to clarify, I'm not against the overall idea of putting a hack in the
>> SMMU driver with a big comment that it is a hack to work around missing
>> notifications under SVA, but it would not constitute an "SMMU bug" to not do
>> that. SMMU is just another VMSAv8-compatible MMU - if, say, KVM or some
>> other arm64 hypervisor driver wanted to do something funky with notifiers to
>> shadow stage 1 permissions for some reason, it would presumably be equally
>> affected.
>
> Okay, Alistair can you make this?

Right, I agree this isn't a bug in SMMU. I could add the hack to the
SMMU driver, but it doesn't address my issue because we're using SVA
without PRI. So I'd much rather update the MM to keep SVA IOMMU TLBs in
sync.

So I'd rather keep the invalidate in ptep_set_access_flags(). Would
renaming invalidate_range() to invalidate_arch_secondary_tlb() along
with clearing up the documentation make that more acceptable, at least
in the short term?

> On Tue, May 30, 2023 at 02:44:40PM -0700, Sean Christopherson wrote:
>> > KVM already has locking for invalidate_start/end - it has to check
>> > mmu_notifier_retry_cache() with the sequence numbers/etc around when
>> > it does does hva_to_pfn()
>> > 
>> > The bug is that the kvm_vcpu_reload_apic_access_page() path is
>> > ignoring this locking so it ignores in-progress range
>> > invalidations. It should spin until the invalidation clears like other
>> > places in KVM.
>> > 
>> > The comment is kind of misleading because drivers shouldn't be abusing
>> > the iommu centric invalidate_range() thing to fix missing locking in
>> > start/end users. :\
>> > 
>> > So if KVM could be fixed up we could make invalidate_range defined to
>> > be an arch specific callback to synchronize the iommu TLB.
>> 
>> And maybe rename invalidate_range() and/or invalidate_range_{start,end}() to make
>> it super obvious that they are intended for two different purposes?  E.g. instead
>> of invalidate_range(), something like invalidate_secondary_tlbs().
>
> Yeah, I think I would call it invalidate_arch_secondary_tlb() and
> document it as being an arch specific set of invalidations that match
> the architected TLB maintenance requrements. And maybe we can check it
> more carefully to make it be called in less places. Like I'm not sure
> it is right to call it from invalidate_range_end under this new
> definition..

I'd be happy to look at that, although it sounds like Sean already is.

>> FWIW, PPC's OpenCAPI support (drivers/misc/ocxl/link.c) also uses invalidate_range().
>> Though IIUC, the use case is the same as a "traditional" IOMMU, where a device can
>> share the CPU's page tables, so maybe the devices can be considered IOMMUs in practice,
>> if not in name?
>
> OpenCAPI is an IOMMU HW for sure. PPC just doesn't have integration
> with the drivers/iommu infrastructure.

Yep it sure is. I worked on that a fair bit when it was first being
brought up. It doesn't suffer this problem because it follows the PPC
MMU architecture which doesn't require TLB invalidates for RO/RW
upgrades. It's a pity it was never integrated with the rest of the
driver iommu infrastructure though.

>> I have patches coded up.  Assuming testing goes well, I'll post them regardless
>> of the OCXL side of things.  I've disliked KVM's one-off use of invalidate_range()
>> for a long time, this is a good excuse to get rid of it before KVM gains more usage.

Feel free to CC me, I'd be happy to review them and can probably help
with the OCXL side of things.

> Nice!
>
> Thanks,
> Jason