linux-kernel - RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <MWHPR11MB1886E22D03B14EE0D5725CE08C539@MWHPR11MB1886.namprd11.prod.outlook.com>
Date:   Tue, 11 May 2021 09:10:03 +0000
From:   "Tian, Kevin" <kevin.tian@...el.com>
To:     Jason Gunthorpe <jgg@...dia.com>
CC:     Jean-Philippe Brucker <jean-philippe@...aro.org>,
        Li Zefan <lizefan@...wei.com>,
        "Jiang, Dave" <dave.jiang@...el.com>,
        "Raj, Ashok" <ashok.raj@...el.com>,
        Jonathan Corbet <corbet@....net>,
        "Jean-Philippe Brucker" <jean-philippe@...aro.com>,
        LKML <linux-kernel@...r.kernel.org>,
        "iommu@...ts.linux-foundation.org" <iommu@...ts.linux-foundation.org>,
        "Alex Williamson" <alex.williamson@...hat.com>,
        Johannes Weiner <hannes@...xchg.org>,
        Tejun Heo <tj@...nel.org>,
        "cgroups@...r.kernel.org" <cgroups@...r.kernel.org>,
        "Wu, Hao" <hao.wu@...el.com>, David Woodhouse <dwmw2@...radead.org>
Subject: RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation
 APIs

> From: Jason Gunthorpe
> Sent: Monday, May 10, 2021 8:37 PM
> 
[...] 
> > gPASID!=hPASID has a problem when assigning a physical device which
> > supports both shared work queue (ENQCMD with PASID in MSR)
> > and dedicated work queue (PASID in device register) to a guest
> > process which is associated to a gPASID. Say the host kernel has setup
> > the hPASID entry with nested translation though /dev/ioasid. For
> > shared work queue the CPU is configured to translate gPASID in MSR
> > into **hPASID** before the payload goes out to the wire. However
> > for dedicated work queue the device MMIO register is directly mapped
> > to and programmed by the guest, thus containing a **gPASID** value
> > implying DMA requests through this interface will hit IOMMU faults
> > due to invalid gPASID entry. Having gPASID==hPASID is a simple
> > workaround here. mdev doesn't have this problem because the
> > PASID register is in emulated control-path thus can be translated
> > to hPASID manually by mdev driver.
> 
> This all must be explicit too.
> 
> If a PASID is allocated and it is going to be used with ENQCMD then
> everything needs to know it is actually quite different than a PASID
> that was allocated to be used with a normal SRIOV device, for
> instance.
> 
> The former case can accept that the guest PASID is virtualized, while
> the lattter can not.
> 
> This is also why PASID per RID has to be an option. When I assign a
> full SRIOV function to the guest then that entire RID space needs to
> also be assigned to the guest. Upon migration I need to take all the
> physical PASIDs and rebuild them in another hypervisor exactly as is.
> 
> If you force all RIDs into a global PASID pool then normal SRIOV
> migration w/PASID becomes impossible. ie ENQCMD breaks everything else
> that should work.
> 
> This is why you need to sort all this out and why it feels like some
> of the specs here have been mis-designed.
> 
> I'm not sure carving out ranges is really workable for migration.
> 
> I think the real answer is to carve out entire RIDs as being in the
> global pool or not. Then the ENQCMD HW can be bundled together and
> everything else can live in the natural PASID per RID world.
> 

OK. Here is the revised scheme by making it explicitly.

There are three scenarios to be considered:

1) SR-IOV (AMD/ARM):
	- "PASID per RID" with guest-allocated PASIDs;
	- PASID table managed by guest (in GPA space);
	- the entire PASID space delegated to guest;
	- no need to explicitly register guest-allocated PASIDs to host;
	- uAPI for attaching PASID table:

    // set to "PASID per RID"
    ioctl(ioasid_fd, IOASID_SET_HWID_MODE, IOASID_HWID_LOCAL);

    // When Qemu captures a new PASID table through vIOMMU;
    pasidtbl_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
    ioctl(device_fd, VFIO_ATTACH_IOASID, pasidtbl_ioasid);

    // Set the PASID table to the RID associated with pasidtbl_ioasid;
    ioctl(ioasid_fd, IOASID_SET_PASID_TABLE, pasidtbl_ioasid, gpa_addr);

2) SR-IOV, no ENQCMD (Intel):
	- "PASID per RID" with guest-allocated PASIDs;
	- PASID table managed by host (in HPA space);
	- the entire PASID space delegated to guest too;
	- host must be explicitly notified for guest-allocated PASIDs;
	- uAPI for binding user-allocated PASIDs:

    // set to "PASID per RID"
    ioctl(ioasid_fd, IOASID_SET_HWID_MODE, IOASID_HWID_LOCAL);

    // When Qemu captures a new PASID allocated through vIOMMU;
    pgtbl_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
    ioctl(device_fd, VFIO_ATTACH_IOASID, pgtbl_ioasid);

    // Tell the kernel to associate pasid to pgtbl_ioasid in internal structure;
    // &pasid being a pointer due to a requirement in scenario-3
    ioctl(ioasid_fd, IOASID_SET_HWID, pgtbl_ioasid, &pasid);

    // Set guest page table to the RID+pasid associated to pgtbl_ioasid
    ioctl(ioasid_fd, IOASID_BIND_PGTABLE, pgtbl_ioasid, gpa_addr);

3) SRIOV, ENQCMD (Intel):
	- "PASID global" with host-allocated PASIDs;
	- PASID table managed by host (in HPA space);
	- all RIDs bound to this ioasid_fd use the global pool;
	- however, exposing global PASID into guest breaks migration;
	- hybrid scheme: split local PASID range and global PASID range;
	- force guest to use only local PASID range (through vIOMMU);
	- for ENQCMD, configure CPU to translate local->global;
	- for non-ENQCMD, setup both local/global pasid entries;
	- uAPI for range split and CPU pasid mapping:

    // set to "PASID global"
    ioctl(ioasid_fd, IOASID_SET_HWID_MODE, IOASID_HWID_GLOBAL);

    // split local/global range, applying to all RIDs in this fd
    // Example: local [0, 1024), global [1024, max)
    // local PASID range is managed by guest and migrated as VM state
    // global PASIDs are re-allocated and mapped to local PASIDs post migration
    ioctl(ioasid_fd, IOASID_HWID_SET_GLOBAL_MIN, 1024);

    // When Qemu captures a new local_pasid allocated through vIOMMU;
    pgtbl_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
    ioctl(device_fd, VFIO_ATTACH_IOASID, pgtbl_ioasid);

    // Tell the kernel to associate local_pasid to pgtbl_ioasid in internal structure;
    // Due to HWID_GLOBAL, the kernel also allocates a global_pasid from the
    // global pool. From now on, every hwid related operations must be applied
    // to both PASIDs for this page table;
    // global_pasid is returned to userspace in the same field as local_pasid;
    ioctl(ioasid_fd, IOASID_SET_HWID, pgtbl_ioasid, &local_pasid);

    // Qemu then registers local_pasid/global_pasid pair to KVM for setting up
    // CPU PASID translation table;
    ioctl(kvm_fd, KVM_SET_PASID_MAPPING, local_pasid, global_pasid);

    // Set guest page table to the RID+{local_pasid, global_pasid} associated 
    // to pgtbl_ioasid;
    ioctl(ioasid_fd, IOASID_BIND_PGTABLE, pgtbl_ioasid, gpa_addr);

-----
Notes:

I tried to keep common commands in generic format for all scenarios, while
introducing new commands for usage-specific requirement. Everything is
made explicit now. 

The userspace has sufficient information to choose its desired scheme based 
on vIOMMU types and platform information (e.g. whether ENQCMD is exposed 
in virtual CPUID, whether assigned devices support DMWr, etc.). 

Above example assumes one RID per bound page table, because vIOMMU
identifies new guest page tables per-RID. If there are other usages requiring
multiple RIDs per page table, SET_HWID/BIND_PGTABLE could accept
another device_handle parameter to specify which RID is targeted for this
operation.

When considering SIOV/mdev there is no change to above uAPI sequence. 
It's n/a for 1) as SIOV requires PASID table in HPA space, nor does it
cause any change to 3) regarding to the split range scheme. The only
 conceptual change is in 2), where although it's still "PASID per RID" the 
PASIDs must be managed by host because the parent driver also allocates 
PASIDs from per-RID space to mark mdev (RID+PASID). But this difference 
doesn't change the uAPI flow - just treat user-provisioned PASID as 'virtual' 
and then allocate a 'real' PASID at IOASID_SET_HWID. Later always use the 
real one when programming PASID entry (IOASID_BIND_PGTABLE) or device 
PASID register (converted in the mediation path).

If all above can work reasonably, we even don't need the special VCMD 
interface in VT-d for guest to allocate PASIDs from host. Just always let
the guest to manage its PASIDs (with restriction of available local PASIDs),
being a global allocator or per-RID allocator. the vIOMMU side just stick
to the per-RID emulation according to the spec. 

Thanks
Kevin