linux-kernel - RE: [RFC PATCH] vfio/pci: add PCIe TPH to device feature ioctl

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID:
 <PAWPR08MB8909A29147335BD40A829A779FC82@PAWPR08MB8909.eurprd08.prod.outlook.com>
Date: Tue, 4 Mar 2025 07:01:50 +0000
From: Wathsala Wathawana Vithanage <wathsala.vithanage@....com>
To: Alex Williamson <alex.williamson@...hat.com>
CC: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, nd
	<nd@....com>, Jason Gunthorpe <jgg@...pe.ca>, Kevin Tian
	<kevin.tian@...el.com>, Philipp Stanner <pstanner@...hat.com>, Yunxiang Li
	<Yunxiang.Li@....com>, "Dr. David Alan Gilbert" <linux@...blig.org>, Ankit
 Agrawal <ankita@...dia.com>, "open list:VFIO DRIVER" <kvm@...r.kernel.org>,
	Jeremy Linton <Jeremy.Linton@....com>, Honnappa Nagarahalli
	<Honnappa.Nagarahalli@....com>, Dhruv Tripathi <Dhruv.Tripathi@....com>
Subject: RE: [RFC PATCH] vfio/pci: add PCIe TPH to device feature ioctl

> Subject: Re: [RFC PATCH] vfio/pci: add PCIe TPH to device feature ioctl
> 
> On Mon, 3 Mar 2025 04:20:24 +0000
> Wathsala Wathawana Vithanage <wathsala.vithanage@....com> wrote:
> 
> > Hi Alex,
> >
> > > > Linux v6.13 introduced the PCIe TLP Processing Hints (TPH) feature
> > > > for direct cache injection. As described in the relevant patch set
> > > > [1], direct cache injection in supported hardware allows optimal
> > > > platform resource utilization for specific requests on the PCIe
> > > > bus. This feature is currently available only for kernel device drivers.
> > > > However, user space applications, especially those whose
> > > > performance is sensitive to the latency of inbound writes as seen
> > > > by a CPU core, may benefit from using this information (E.g., DPDK
> > > > cache stashing RFC [2] or an HPC application running in a VM).
> > > >
> > > > This patch enables configuring of TPH from the user space via
> > > > VFIO_DEVICE_FEATURE IOCLT. It provides an interface to user space
> > > > drivers and VMMs to enable/disable the TPH feature on PCIe devices
> > > > and set steering tags in MSI-X or steering-tag table entries using
> > > > VFIO_DEVICE_FEATURE_SET flag or read steering tags from the kernel
> > > > using VFIO_DEVICE_FEATURE_GET to operate in device-specific mode.
> > >
> > > Unless I'm missing it, the RFC in [2] doesn't make use of this
> > > proposal.  Is there published code anywhere that does use this interface?
> > >
> > > > [1]
> > > > lore.kernel.org/linux-pci/20241002165954.128085-1-wei.huang2@....c
> > > > om
> > > > [2]
> > > > inbox.dpdk.org/dev/20241021015246.304431-2-wathsala.vithanage@....
> > > > com
> > > >
> >
> > The DPDK RFC in question is on hold for now because there is no way to
> > get steering-tags from the user space.
> > Consequently, I had to stop working on [2] and start working on the
> > kernel to get VFIO ready for it. This was discussed in a DPDK community call
> sometime back.
> > https://mails.dpdk.org/archives/dev/2024-October/305408.html
> 
> I think the userspace and kernel support need to be co-developed, we don't want
> to be adding kernel code and uAPIs that don't have users.
> 


I will update [2] RFC with along with a v2 of this RFC to use the uAPI discussed here. 


> > > > Signed-off-by: Wathsala Vithanage <wathsala.vithanage@....com>
> > > > ---
> > > >  drivers/vfio/pci/vfio_pci_core.c | 163
> +++++++++++++++++++++++++++++++
> > > >  include/uapi/linux/vfio.h        |  68 +++++++++++++
> > > >  2 files changed, 231 insertions(+)
> > > >
> > > > diff --git a/drivers/vfio/pci/vfio_pci_core.c
> > > > b/drivers/vfio/pci/vfio_pci_core.c
> > > > index 586e49efb81b..d6dd0495b08b 100644
> > > > --- a/drivers/vfio/pci/vfio_pci_core.c
> > > > +++ b/drivers/vfio/pci/vfio_pci_core.c
> > > > @@ -29,6 +29,7 @@
> > > >  #include <linux/nospec.h>
> > > >  #include <linux/sched/mm.h>
> > > >  #include <linux/iommufd.h>
> > > > +#include <linux/pci-tph.h>
> > > >  #if IS_ENABLED(CONFIG_EEH)
> > > >  #include <asm/eeh.h>
> > > >  #endif
> > > > @@ -1510,6 +1511,165 @@ static int
> > > > vfio_pci_core_feature_token(struct
> > > vfio_device *device, u32 flags,
> > > >  	return 0;
> > > >  }
> > > >
> > > > +static ssize_t vfio_pci_tph_uinfo_dup(struct vfio_pci_tph *tph,
> > > > +				      void __user *arg, size_t argsz,
> > > > +				      struct vfio_pci_tph_info **info) {
> > > > +	size_t minsz;
> > > > +
> > > > +	if (tph->count > VFIO_TPH_INFO_MAX)
> > > > +		return -EINVAL;
> > > > +	if (!tph->count)
> > > > +		return 0;
> > > > +
> > > > +	minsz = tph->count * sizeof(struct vfio_pci_tph_info);
> > > > +	if (minsz < argsz)
> > > > +		return -EINVAL;
> > > > +
> > > > +	*info = memdup_user(arg, minsz);
> > > > +	if (IS_ERR(info))
> > > > +		return PTR_ERR(info);
> > > > +
> > > > +	return minsz;
> > > > +}
> > > > +
> > > > +static int vfio_pci_feature_tph_st_op(struct vfio_pci_core_device *vdev,
> > > > +				      struct vfio_pci_tph *tph,
> > > > +				      void __user *arg, size_t argsz) {
> > > > +	int i, mtype, err = 0;
> > > > +	u32 cpu_uid;
> > > > +	struct vfio_pci_tph_info *info = NULL;
> > > > +	ssize_t data_size = vfio_pci_tph_uinfo_dup(tph, arg, argsz,
> > > > +&info);
> > > > +
> > > > +	if (data_size <= 0)
> > > > +		return data_size;
> > > > +
> > > > +	for (i = 0; i < tph->count; i++) {
> > > > +		if (!(info[i].cpu_id < nr_cpu_ids &&
> > > > +cpu_present(info[i].cpu_id))) {
> > >
> > > Not intuitive logic, imo.  This could easily be:
> > >
> > > 		if (info[i].cpu_id >= nr_cpu_ids || !cpu_present(info[i].cpu_id))
> > >
> >
> > Agree, looks more intuitive.
> >
> > > > +			info[i].err = -EINVAL;
> > > > +			continue;
> > > > +		}
> > > > +		cpu_uid = topology_core_id(info[i].cpu_id);
> > > > +		mtype = (info[i].flags & VFIO_TPH_MEM_TYPE_MASK) >>
> > > > +			VFIO_TPH_MEM_TYPE_SHIFT;
> > > > +
> > > > +		/* processing hints are always ignored */
> > > > +		info[i].ph_ignore = 1;
> > > > +
> > > > +		info[i].err = pcie_tph_get_cpu_st(vdev->pdev, mtype, cpu_uid,
> > > > +						  &info[i].st);
> > > > +		if (info[i].err)
> > > > +			continue;
> > > > +
> > > > +		if (tph->flags & VFIO_DEVICE_FEATURE_TPH_SET_ST) {
> > > > +			info[i].err = pcie_tph_set_st_entry(vdev->pdev,
> > > > +							    info[i].index,
> > > > +							    info[i].st);
> > > > +		}
> > > > +	}
> > > > +
> > > > +	if (copy_to_user(arg, info, data_size))
> > > > +		err = -EFAULT;
> > > > +
> > > > +	kfree(info);
> > > > +	return err;
> > > > +}
> > > > +
> > > > +
> > > > +static int vfio_pci_feature_tph_enable(struct vfio_pci_core_device *vdev,
> > > > +				       struct vfio_pci_tph *arg) {
> > > > +	int mode = arg->flags & VFIO_TPH_ST_MODE_MASK;
> > > > +
> > > > +	switch (mode) {
> > > > +	case VFIO_TPH_ST_NS_MODE:
> > > > +		return pcie_enable_tph(vdev->pdev, PCI_TPH_ST_NS_MODE);
> > > > +
> > > > +	case VFIO_TPH_ST_IV_MODE:
> > > > +		return pcie_enable_tph(vdev->pdev, PCI_TPH_ST_IV_MODE);
> > > > +
> > > > +	case VFIO_TPH_ST_DS_MODE:
> > > > +		return pcie_enable_tph(vdev->pdev, PCI_TPH_ST_DS_MODE);
> > > > +
> > > > +	default:
> > > > +		return -EINVAL;
> > > > +	}
> > > > +
> > > > +}
> > > > +
> > > > +static int vfio_pci_feature_tph_disable(struct
> > > > +vfio_pci_core_device
> > > > +*vdev) {
> > > > +	pcie_disable_tph(vdev->pdev);
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +static int vfio_pci_feature_tph_prepare(struct vfio_pci_tph __user *arg,
> > > > +					size_t argsz, u32 flags,
> > > > +					struct vfio_pci_tph *tph)
> > > > +{
> > > > +	u32 op;
> > > > +	int err = vfio_check_feature(flags, argsz,
> > > > +				 VFIO_DEVICE_FEATURE_SET |
> > > > +				 VFIO_DEVICE_FEATURE_GET,
> > > > +				 sizeof(struct vfio_pci_tph));
> > > > +	if (err != 1)
> > > > +		return err;
> > >
> > > We don't seem to account for a host booted with pci=notph.
> > >
> >
> > pcie_enable_tph() looks for pci=notph and fails if it's set.
> > If pcie_enable_tph() fails, pcie_tph_get_cpu_st() and
> > pcie_tph_set_st_entry() fail too.
> >
> > Is it required to check pci=notph here as well?
> 
> Does it make sense to probe affirmatively for a feature that can't be enabled?  I
> think I previously also overlooked that we never actually check that the device
> supports TPH at probe time.
>

pci_init_capabilities() calls pci_tph_init(), so my understanding is that device is
already probed and pdev->tph_cap is set by the time vfio is initialized.
pcie_enable_tph() checks both pdev->tph_cap and pci=notph before enabling.


> > > > +
> > > > +	if (copy_from_user(tph, arg, sizeof(struct vfio_pci_tph)))
> > > > +		return -EFAULT;
> > > > +
> > > > +	op = tph->flags & VFIO_DEVICE_FEATURE_TPH_OP_MASK;
> > > > +
> > > > +	switch (op) {
> > > > +	case VFIO_DEVICE_FEATURE_TPH_ENABLE:
> > > > +	case VFIO_DEVICE_FEATURE_TPH_DISABLE:
> > > > +	case VFIO_DEVICE_FEATURE_TPH_SET_ST:
> > > > +		return (flags & VFIO_DEVICE_FEATURE_SET) ? 0 : -EINVAL;
> > > > +
> > > > +	case VFIO_DEVICE_FEATURE_TPH_GET_ST:
> > > > +		return (flags & VFIO_DEVICE_FEATURE_GET) ? 0 : -EINVAL;
> > >
> > > This is a convoluted mangling of an ioctl into a vfio feature.
> > >
> >
> > Do you prefer all four operations to fold under a single ioctl? Or
> > split them such that enabling and disabling TPH remains a VFIO SET
> > feature while SET_ST and GET_ST moved to a regular ioctl?
> 
> Splitting the functionality between a feature and a new ioctl doesn't make any
> sense.  As I noted, I can imagine this interface being two related features where
> one allows the TPH state to be set enabled or disabled and the other allows
> setting steering tags.
> 

Understood. I will change this in V2.

> > > > +
> > > > +	default:
> > > > +		return -EINVAL;
> > > > +	}
> > > > +}
> > > > +
> > > > +static int vfio_pci_core_feature_tph(struct vfio_device *device, u32 flags,
> > > > +				     struct vfio_pci_tph __user *arg,
> > > > +				     size_t argsz)
> > > > +{
> > > > +	u32 op;
> > > > +	struct vfio_pci_tph tph;
> > > > +	void __user *uinfo;
> > > > +	size_t infosz;
> > > > +	struct vfio_pci_core_device *vdev =
> > > > +		container_of(device, struct vfio_pci_core_device, vdev);
> > > > +	int err = vfio_pci_feature_tph_prepare(arg, argsz, flags, &tph);
> > > > +
> > > > +	if (err)
> > > > +		return err;
> > > > +
> > > > +	op = tph.flags & VFIO_DEVICE_FEATURE_TPH_OP_MASK;
> > > > +
> > > > +	switch (op) {
> > > > +	case VFIO_DEVICE_FEATURE_TPH_ENABLE:
> > > > +		return vfio_pci_feature_tph_enable(vdev, &tph);
> > > > +
> > > > +	case VFIO_DEVICE_FEATURE_TPH_DISABLE:
> > > > +		return vfio_pci_feature_tph_disable(vdev);
> > > > +
> > > > +	case VFIO_DEVICE_FEATURE_TPH_GET_ST:
> > > > +	case VFIO_DEVICE_FEATURE_TPH_SET_ST:
> > > > +		uinfo = (u8 *)(arg) + offsetof(struct vfio_pci_tph, info);
> > > > +		infosz = argsz - sizeof(struct vfio_pci_tph);
> > > > +		return vfio_pci_feature_tph_st_op(vdev, &tph, uinfo, infosz);
> > >
> > > This is effectively encoding a regular ioctl as a feature.  See below.
> > >
> >
> > Addressed this in the above comment.
> >
> > > > +
> > > > +	default:
> > > > +		return -EINVAL;
> > > > +	}
> > > > +}
> > > > +
> > > >  int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
> > > >  				void __user *arg, size_t argsz)  { @@ -1523,6
> +1683,9 @@ int
> > > > vfio_pci_core_ioctl_feature(struct vfio_device
> > > *device, u32 flags,
> > > >  		return vfio_pci_core_pm_exit(device, flags, arg, argsz);
> > > >  	case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
> > > >  		return vfio_pci_core_feature_token(device, flags, arg, argsz);
> > > > +	case VFIO_DEVICE_FEATURE_PCI_TPH:
> > > > +		return vfio_pci_core_feature_tph(device, flags,
> > > > +						 arg, argsz);
> > > >  	default:
> > > >  		return -ENOTTY;
> > > >  	}
> > > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > > > index c8dbf8219c4f..608d57dfe279 100644
> > > > --- a/include/uapi/linux/vfio.h
> > > > +++ b/include/uapi/linux/vfio.h
> > > > @@ -1458,6 +1458,74 @@ struct vfio_device_feature_bus_master {  };
> > > > #define VFIO_DEVICE_FEATURE_BUS_MASTER 10
> > > >
> > > > +/*
> > > > + * Upon VFIO_DEVICE_FEATURE_SET, enable or disable PCIe TPH or
> > > > +set steering tags
> > > > + * on the device. Data provided when setting this feature is a
> > > > +__u32 with the
> > > > + * following flags. VFIO_DEVICE_FEATURE_TPH_ENABLE enables PCIe
> > > > +TPH in
> > > > + * no-steering-tag, interrupt-vector, or device-specific mode
> > > > +when feature flags
> > > > + * VFIO_TPH_ST_NS_MODE, VFIO_TPH_ST_IV_MODE, and
> > > VFIO_TPH_ST_DS_MODE
> > > > +are set
> > > > + * respectively.
> > > > + * VFIO_DEVICE_FEATURE_TPH_DISABLE disables PCIe TPH on the device.
> > > > + * VFIO_DEVICE_FEATURE_TPH_SET_ST set steering tags on a device
> > > > +at an index in
> > > > + * MSI-X or ST-table depending on the VFIO_TPH_ST_x_MODE flag
> > > > +used and device
> > > > + * capabilities. The caller can set multiple steering tags by
> > > > +passing an array
> > > > + * of vfio_pci_tph_info objects containing cpu_id, cache_level,
> > > > +and
> > > > + * MSI-X/ST-table index. The caller can also set the intended
> > > > +memory type and
> > > > + * the processing hint by setting VFIO_TPH_MEM_TYPE_x and
> > > > +VFIO_TPH_HINT_x flags,
> > > > + * respectively. The return value for each vfio_pci_tph_info
> > > > +object is stored in
> > > > + * err, with the steering-tag set on the device and the ph_ignore
> > > > +status bit
> > > > + * resulting from the steering-tag lookup operation. If err < 0,
> > > > +the values
> > > > + * stored in the st and ph_ignore fields should be considered invalid.
> > > > + *
> > >
> > > Sorry, I don't understand ph_ignore as described here.  It's only
> > > ever set to 1.  I guess we assume the incoming state is zero.  But what does it
> mean?
> > >
> >
> > Agree, it's confusing and it has something to do with the current
> > implementation of TPH in the Kernel.
> > PCIe firmware specification states root-port ACPI _DSM as the single
> > source of truth for steering-tags. Its job is to maintain a table of
> > steering-tags indexed by a tuple of CPU/Container UID, Cache-id and a
> > processing-hint>. Each tuple is mapped to two steering-tags for
> > persistent or volatile memory (either or both depending on the platform).
> > A caller looking to acquire a steering tag for a cache closer to a CPU
> > will pass above tuple in the format defined in the PCIe firmware specification.
> > But PCIe spec also allows implementing TPH without processing-hints,
> > in such cases the _DSM is supposed to set ph_ignore bit in the return structure.
> > Current TPH implementation in the Linux kernel ignores some of these
> > details, it sets cache-id and the processing-hint in the above tuple
> > to zeros. It also ignores the ph_ignore bit return by the _DSM as well.
> > However, I wrote this RFC to match the PCI firmware specification,
> > therefore it sets ph_ignore bit to 1 to inform the caller that it is
> > ignored so that caller can make an informed decision to stick with the default
> (bidirectional hint) or exit with error.
> > This is why the cache_level is ignored as well, ideally cache levels
> > (0 for L1D, 1 for L2D,
> > Etc.) should be converted to a PPTT cache_id and passed into the _DSM.
> > I'm working on a separate PCI patch to fix above issues.
> >
> > > > + * Upon VFIO_DEVICE_FEATURE_GET,  return steering tags to the caller.
> > > > + * VFIO_DEVICE_FEATURE_TPH_GET_ST returns steering tags to the caller.
> > > > + * The return values per vfio_pci_tph_info object are stored in
> > > > + the st,
> > > > + * ph_ignore, and err fields.
> > >
> > > Why do we need to provide an interface to return the steering tags
> > > set by the user?  Seems like this could be a write-only, SET, interface.
> > >
> >
> > VFIO_DEVICE_FEATURE_TPH_GET_ST calls the _DSM to read the steering-tag
> > for a vector of tuples of a cpu-id, cache-id, processing-hint. It is
> > for devices that operate in device specific mode where they don't set
> > steering-tags in the MSI-X or ST tables but in another construct like a queue
> context accessible from the user/kernel space.
> > For instance, this is what will be used by DPDK poll mode drivers as
> > [2] fleshes out.
> >
> > So, VFIO_DEVICE_FEATURE_TPH_GET_ST doesn't return steering-tags set by
> > the user. It's there to read the platform database of steering-tags
> > which is an ACPI _DSM bound each PCIe root port.
> >
> > > > + */
> > > > +struct vfio_pci_tph_info {
> > >
> > > This seems more of an _entry than an _info.  Note that vfio has
> > > various INFO ioctls which make this nomenclature confusing.
> > >
> >
> > It is, it's more of a struct that describes the request.
> > Perhaps vfio_pci_tph_request_descriptor, but that's too long.
> > Suggestions are welcome.
> >
> >
> > > > +	/* in */
> > > > +	__u32 cpu_id;
> > >
> > > The logical ID?
> > >
> > Yes, this is logical ID.
> >
> > > > +	__u32 cache_level;
> > >
> > > Zero based?  1-based?  How many cache levels are there?  What's valid here?
> > >
> > It's Zero based. This is currently ignored due to limitations in the kernel TPH.
> > One way to validate would be iterating through the topology to find
> > it, decrementing a copy_of_cache_level as we move further away from
> > the cpu_id until the value reaches zero. If loop terminates before
> copy_of_cache_level reaching zero, -EINVAL.
> > Is that a good approach?
> 
> Sounds like a try and hope for the best interface.  How would userspace
> realistically infer the cache levels?  I don't think we want vfio describing that.
> 

User sees each cache level relative to a logical CPU.
For instance, when caller says cup_id = 0, and cache_level = 0 they mean L1D
of the logical CPU 0. Cpu_id = 0, cache_level = 1 means L2D of logical CPU 0.
Likewise, further it moves from the L1D, the cache_level increases. 

> > > > +	__u8  flags;
> > > > +#define VFIO_TPH_MEM_TYPE_MASK		0x1
> > > > +#define VFIO_TPH_MEM_TYPE_SHIFT		0
> > > > +#define VFIO_TPH_MEM_TYPE_VMEM		0	/* Request
> > > volatile memory ST */
> > > > +#define VFIO_TPH_MEM_TYPE_PMEM		1	/* Request
> > > persistent memory ST */
> > >
> > > Is there a relation to the cache_level here?  Spec references here
> > > and below, assuming those are relevant to these values.
> > >
> > There is no relation to cache level. Return from the _DSM has PMEM and
> > VMEM sections. If either of those are not supported a valid bit is set to 0 in the
> return.
> >
> > PCIe spec has four processing-hints, VFIO_TPH_HINT_* are there to set
> > them in the flags field. Hints are forced to VFIO_TPH_HINT_BIDIR by
> > current TPH implementation as I described above.
> >
> > > > +
> > > > +#define VFIO_TPH_HINT_MASK		0x3
> > > > +#define VFIO_TPH_HINT_SHIFT		1
> > > > +#define VFIO_TPH_HINT_BIDIR		0
> > > > +#define VFIO_TPH_HINT_REQSTR		(1 << VFIO_TPH_HINT_SHIFT)
> > > > +#define VFIO_TPH_HINT_TARGET		(2 << VFIO_TPH_HINT_SHIFT)
> > > > +#define VFIO_TPH_HINT_TARGET_PRIO	(3 << VFIO_TPH_HINT_SHIFT)
> > >
> > > There needs to be a __u8 padding in here somewhere or flags extended
> > > to __u16.
> > >
> > > > +	__u16 index;			/* MSI-X/ST-table index to set ST */
> > > > +	/* out */
> > > > +	__u16 st;			/* Steering-Tag */
> > >
> > > Sorry if I'm just not familiar with TPH, but why do we need to return the ST?
> > > Doesn't hardware make use of the ST index and the physical value
> > > gets applied automatically in HW?
> > >
> >
> > Device specific mode requires this. For instance, intel E810 NIC can
> > set Steering-tags in queue context which is in user-space when used with DPDK.
> > For such use cases we must return steering-tags read from the _DSM.
> > Going back to DPDK RFC mentioned here, if TPH gets enabled in VFIO,
> > the E810 poll mode driver in the DPDK will ask the kernel a
> > steering-tag for a combination of a cpu-id, a cache-level and a
> > processing-hint. In response the kernel will invoke the ACPI _DSM for
> > the root port of the device via pcie_tph_get_cpu_st() and return the
> > steering-tag back to the user. E810 driver will set the returned tag in
> appropriate queue context.
> >
> > For cases where steering-tag is not required in user-space (VMMs),
> > caller asks the kernel to set steering-tag that corresponds the
> > cpu-id, cache-level, and PH combination at a given MSI-X vector entry
> > or ST table in config space. For such cases too, the kernel will read
> > the steering-tag from the _DSM and set the tag in the corresponding MSI-X or
> ST table entry.
> 
> We need to consider that there are vfio-pci variant drivers that support migration
> and make use of the vfio-pci-core feature interface.
> TPH programming appears to be very non-migration friendly, the attributes are
> very specific to the current host system.  Should migration and TPH be mutually
> exclusive?  This may be a factor in the decision to use a feature or ioctl.
> 

I'm not very familiar with migration, so I don't have a good answer here.
By virtualizing STs VMMs can provide a consistent ST assignments for a vCPU and a
Cache-level relative to the vCPU, but there is no guarantee that on next platform the
same processing hint would be available. One solution would be to only allow 
bidirectional hint on VMs (which is cache injection with ST only that works on everything
that supports TPH).

For virtualization in general, I think platform STs should not be exposed to VMs
or allowed to set arbitrary STs on devices to prevent VMs from figuring out platform
topology and cache injecting to physical CPUs that they are not supposed to run on.
So, I believe VMMs need some additional features to support TPH securely anyway.

> > > > +	__u8  ph_ignore;		/* Processing hint was ignored by */
> > >
> > > Padding/alignment, same as above.  "ignored by..."  By what?  Is that an
> error?
> > >
> >
> > An error. "Processing hint was ignored by the platform"
> >
> > > > +	__s32 err;			/* Error on getting/setting Steering-
> > > Tag*/
> > > > +};
> > >
> > > Generally we'd expect a system call either works or fails.  Why do
> > > we need per entry error report?  Can't we validate and prepare the
> > > entire operation before setting any of it into the device?
> > > Ultimately we're just writing hints to config space or MSI-X table
> > > space, so the write operation itself is not likely to be the point of failure.
> > >
> >
> > Write to MSI-X won't fail but reading the steering-tag from the _DSM
> > for an invalid <cpu_id, cach_id, ph_hint> combo can fail in both
> > GET_ST and SET_ST cases.
> > We can change this to an all or nothing interface, such that success
> > if all the entries are valid and otherwise if at least one is invalid.
> > But in that case the caller may not know which entries were invalid.
> > If you think that's ok, I can change it.
> 
> It's not exactly uncommon for an ioctl to fail for a single bad arg among many.  I
> think it's harder for userspace to deal with recovering from a partially
> implemented call.
> 

I will change this in v2.

> > > > +
> > > > +struct vfio_pci_tph {
> > > > +	__u32 argsz;			/* Size of vfio_pci_tph and info[] */
> > > > +	__u32 flags;
> > > > +#define VFIO_DEVICE_FEATURE_TPH_OP_MASK		0x7
> > > > +#define VFIO_DEVICE_FEATURE_TPH_OP_SHIFT	3
> > > > +#define VFIO_DEVICE_FEATURE_TPH_ENABLE		0	/*
> > > Enable TPH on device */
> > > > +#define VFIO_DEVICE_FEATURE_TPH_DISABLE	1	/* Disable TPH
> > > on device */
> > > > +#define VFIO_DEVICE_FEATURE_TPH_GET_ST		2	/* Get
> > > steering-tags */
> > > > +#define VFIO_DEVICE_FEATURE_TPH_SET_ST		4	/* Set
> > > steering-rags */
> > >
> > > s/rags/tags/
> > >
> > > vfio device features already have GET and SET as part of their base
> > > structure, why are they duplicated here?  It really seems like there
> > > are two separate features here, one that allows enabling with a
> > > given mode or disable, and another that allows writing specific
> > > steering tags.  Both seem like they could be SET-only features.
> > > It's also not clear to me that there's a significant advantage to
> > > providing an array of steering tags.  Isn't updating STs an infrequent operation
> and likely bound to at most 2K tags in the case of MSI-X?
> > >
> >
> > VFIO_DEVICE_FEATURE_TPH_ENABLE and
> VFIO_DEVICE_FEATURE_TPH_DISABLE are
> > SET features.
> > Since, VFIO_DEVICE_FEATURE_TPH_SET_ST requires writing to ST table, so
> > that too falls under SET features.
> > Therefore, these flags are only valid when VFIO_DEVICE_FEATURE_SET flag is
> set.
> > The only GET features use case here is VFIO_DEVICE_FEATURE_TPH_GET_ST
> > that reads the steering-tags from the _DSM and returns it back to
> > caller. Only valid with VFIO_DEVICE_FEATURE_GET flag.
> > These are checked in vfio_pci_feature_tph_prepare().
> >
> > As I mentioned earlier, does it make sense to leave enable/disable
> > under VFIO Feature and move GET_ST and SET_ST to a separate ioctl?
> 
> Splitting TPH between a feature and new ioctls doesn't seem like a logical
> interface.
> 
> > There isn't much rational to providing an array of tuples other than
> > following the format VFIO_DEVICE_SET_IRQS that sets eventfds.
> >
> > > > +
> > > > +#define	VFIO_TPH_ST_MODE_MASK	(0x3 <<
> > > VFIO_DEVICE_FEATURE_TPH_OP_SHIFT)
> > > > +#define	VFIO_TPH_ST_NS_MODE	(0 <<
> > > VFIO_DEVICE_FEATURE_TPH_OP_SHIFT)
> > > > +#define	VFIO_TPH_ST_IV_MODE	(1 <<
> > > VFIO_DEVICE_FEATURE_TPH_OP_SHIFT)
> > > > +#define	VFIO_TPH_ST_DS_MODE	(2 <<
> > > VFIO_DEVICE_FEATURE_TPH_OP_SHIFT)
> > > > +	__u32 count;				/* Number of entries in info[]
> > > */
> > > > +	struct vfio_pci_tph_info info[];
> > > > +#define VFIO_TPH_INFO_MAX	64		/* Max entries allowed
> > > in info[] */
> > >
> > > This seems to match the limit if the table is located in extended
> > > config space, but it's not particularly relevant if the table is
> > > located in MSI-X space.  Why is this a good limit?
> >
> > Yes, for MSI-X the caller may have to invoke the ioctl multiple times
> > if more than 64 entries need to be updated.
> >
> > >
> > > Also, try to keep these all in 80 column.  Thanks,
> > >
> >
> > I will, try to keep these under 80 columns.
> >
> > Thanks Alex, for your input. Let me know what you think about moving
> > SET_ST and GET_ST to an ioctl and keep ENABLE/DISABLE TPH as a vfio
> feature.
> 
> It seems like kind of a hodgepodge to split the interface like that, imo.  Sorry for a
> less that complete reply, I'm on PTO this week.
> Thanks,

Understood. Now, the question is whether this should be an ioctl or two features, 
and it all depends on how TPH works with migration.
Thanks and have a great week.