linux-kernel - Re: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZBtoj3deE2Y6k9lq@Asurada-Nvidia>
Date:   Wed, 22 Mar 2023 13:43:59 -0700
From:   Nicolin Chen <nicolinc@...dia.com>
To:     Jason Gunthorpe <jgg@...dia.com>
CC:     "Tian, Kevin" <kevin.tian@...el.com>,
        Robin Murphy <robin.murphy@....com>,
        "will@...nel.org" <will@...nel.org>,
        "eric.auger@...hat.com" <eric.auger@...hat.com>,
        "baolu.lu@...ux.intel.com" <baolu.lu@...ux.intel.com>,
        "joro@...tes.org" <joro@...tes.org>,
        "shameerali.kolothum.thodi@...wei.com" 
        <shameerali.kolothum.thodi@...wei.com>,
        "jean-philippe@...aro.org" <jean-philippe@...aro.org>,
        "linux-arm-kernel@...ts.infradead.org" 
        <linux-arm-kernel@...ts.infradead.org>,
        "iommu@...ts.linux.dev" <iommu@...ts.linux.dev>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v1 14/14] iommu/arm-smmu-v3: Add
 arm_smmu_cache_invalidate_user

On Wed, Mar 22, 2023 at 04:41:32PM -0300, Jason Gunthorpe wrote:
> On Wed, Mar 22, 2023 at 12:21:27PM -0700, Nicolin Chen wrote:
> 
> > Do you prefer this to happen with this series? 
> 
> No, I just don't want to exclude doing it someday if people are
> interested to optimize this. As I said in the other thread I'd rather
> optimize SMMUv3 emulation than try to use virtio-iommu to make it run
> faster.

Got it. I will then just focus on reworking the invalidation
data structure with a list of command queue info.

> > the uAPI would be completely compatible. It seems to me that
> > we would need a different uAPI, so as to setup a queue in an
> > earlier stage, and then to ring a bell when QEMU traps any
> > incoming commands in the emulated VCMDQ.
> 
> Yes, it would need more uAPI. Lets just make sure there is room and
> maybe think a bit about what it would look like.
> 
> You should also draft through the HW vCMDQ stuff to ensure it fits
> in here nicely.

Yes.
  
> > > > Btw, just to confirm my understanding, a use case having two
> > > > or more iommu_domains means an S2 iommu_domain replacement,
> > > > right? I.e. a running S2 iommu_domain gets replaced on the fly
> > > > by a different S2 iommu_domain holding a different VMID, while
> > > > the IOAS still has the previous mappings? When would that
> > > > actually happen in the real world?
> > > 
> > > It doesn't have to be replace - what is needed is that evey vPCI
> > > device connected to the same SMMU instance be using the same S2 and
> > > thus the same VM_ID.
> > > 
> > > IOW evey SID must be linked to the same VM_ID or invalidation commands
> > > will not be properly processed.
> > > 
> > > qemu would have to have multiple SMMU instances according to S2
> > > domains, which is probably true anyhow since we need to know what
> > > physical SMMU instance to deliver the invalidation too anyhow.
> > 
> > I am not 100% following this part. So, you mean that we're
> > safe if we only have one SMMU instance, because there'd be
> > only one S2 domain, while multiple S2 domains would happen
> > if we have multiple SMMU instances?
> 
> Yes, that would happen today, especially since each smmu has its own
> vm_id allocator IIRC
>  
> > Can we still use the same S2 domain for multiple instances?
> 
> I think not today.
> 
> At the core, if we share the same S2 domain then it is a problem to
> figure out what smmu instance to send the invalidation command too. EG
> if the userspace invalidates ASID 1 you'd have to replicate
> invalidation to all SMMU instances. Even if ASID 1 is used by only a
> single SID/STE that has a single SMMU instance backing it.

Oh, Right. That would be a perf drawdown from an unnecessary
IOTLB miss potentially, because with a single instance QEMU
has to broadcast that invalidation to all SMMU instances.

> So I think for ARM we want to reflect the physical SMMU instances into
> vSMMU instances and that feels best done by having a unique S2
> iommu_domain for each SMMU instance. Then we know that an invalidation
> for a SMMU instance is delivered to that S2's singular CMDQ and things
> like vCMDQ become possible.

In that environment, do we still need a VMID unification?
 
> > Our approach of setting up a stage-2 mapping in QEMU is to
> > map the entire guest memory. I don't see a point in having
> > a separate S2 domain, even if there are multiple instances?
> 
> And then this is the drawback, we don't really want to have duplicated
> S2 page tables in the system for every stage 2.
> 
> Maybe we have made a mistake by allowing the S2 to be an unmanaged
> domain. Perhaps we should create the S2 out of an unmanaged domain
> like the S1.
> 
> Then the rules could be
>  - Unmanaged domain can be used with every smmu instance, only one
>    copy of the page table. The ASID in the iommu_domain is
>    kernel-global
>  - S2 domain is a child of a shared unmanaged domain. It can be used
>    only with the SMMU it is associated with, it has a per-SMMU VM ID
>  - S1 domain is a child of a S2 domain, it can be used only with the
>    SMMU it's S2 is associated with, just because

The actual S2 pagetable has to stay at the unmanaged domain
for IOAS_MAP, while we maintain an s2_cfg data structure in
the shadow S2 domain per SMMU instance that has its own VMID
but a shared S2 page table pointer?

Hmm... Feels very complicated to me. How does that help?

> > Btw, from a private discussion with Eric, he expressed the
> > difficulty of adding multiple SMMU instances in QEMU, as it
> > would complicate the device and ACPI components. 
> 
> I'm not surprised by this, but for efficiency we probably have to do
> this. Eric am I wrong?
> 
> qemu shouldn't have to do it immediately, but the kernel uAPI should
> allow for a VMM that is optimized. We shouldn't exclude this by
> mis-designing the kernel uAPI. qemu can replicate the invalidations
> itself to make an ineffecient single vSMMU.
> 
> > For VCMDQ, we do need a multi-instance environment, because there
> > are multiple physical pairs of SMMU+VCMDQ, i.e. multiple VCMDQ MMIO
> > regions being attached/used by different devices. 
> 
> Yes. IMHO vCMDQ is the sane design here - invalidation performance is
> important, having a kernel-bypass way to do it is ideal. I understand
> AMD has a similar kernel-bypass queue approach for their stuff too. I
> think everyone will eventually need to do this, especially for CC
> applications. Having the hypervisor able to interfere with
> invalidation feels like an attack vector.
> 
> So we should focus on long term designs that allow kernel-bypass to
> work, and I don't see way to hide multi-instance and still truely
> support vCMDQ??

Well, I agree and hope people across the board decide to move
towards the multi-instance direction.

> > So, I have been exploring a different approach by creating an
> > internal multiplication inside VCMDQ...
> 
> How can that work?
> 
> You'd have to have the guest VM to know to replicate to different
> vCMDQ's? Which isn't the standard SMMU programming model anymore..

VCMDQ has multiple VINTFs (Virtual Interfaces) that's supposed
to be used by the host to expose to multiple VMs.

In a multi-SMMU environment, every single SMMU+VCMDQ instance
would have one VINTF only that contains one or more VCMDQs. In
this case, passthrough devices behind different physical SMMU
instances are straightforwardly attached to different vSMMUs.

However, if we can't have multiple vSMMU instances, the guest
VM (its HW) would enable multiple VINTFs corresponding to the
number of physical SMMU/VCMDQ instances, for devices to attach
accordingly. That means I need to figure out a way to pin the
devices onto those VINTFs, by somehow passing their physical
SMMU IDs. The latest progress that I made is to have a bit of
a hack in the Dsdt table by inserting a physical SMMU ID to
every single passthrough device node, though I still need to
confirm the legality of doing that...

Thanks
Nic