linux-kernel - Re: [RFC PATCH] vfio/pci: add PCIe TPH to device feature ioctl

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250303201246.5122beb5.alex.williamson@redhat.com>
Date: Mon, 3 Mar 2025 20:12:46 -0700
From: Alex Williamson <alex.williamson@...hat.com>
To: Wathsala Wathawana Vithanage <wathsala.vithanage@....com>
Cc: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, nd
 <nd@....com>, Jason Gunthorpe <jgg@...pe.ca>, Kevin Tian
 <kevin.tian@...el.com>, Philipp Stanner <pstanner@...hat.com>, Yunxiang Li
 <Yunxiang.Li@....com>, "Dr. David Alan Gilbert" <linux@...blig.org>, Ankit
 Agrawal <ankita@...dia.com>, "open list:VFIO DRIVER" <kvm@...r.kernel.org>,
 Jeremy Linton <Jeremy.Linton@....com>, Honnappa Nagarahalli
 <Honnappa.Nagarahalli@....com>, Dhruv Tripathi <Dhruv.Tripathi@....com>
Subject: Re: [RFC PATCH] vfio/pci: add PCIe TPH to device feature ioctl

On Mon, 3 Mar 2025 04:20:24 +0000
Wathsala Wathawana Vithanage <wathsala.vithanage@....com> wrote:

> Hi Alex,
> 
> > > Linux v6.13 introduced the PCIe TLP Processing Hints (TPH) feature for
> > > direct cache injection. As described in the relevant patch set [1],
> > > direct cache injection in supported hardware allows optimal platform
> > > resource utilization for specific requests on the PCIe bus. This
> > > feature is currently available only for kernel device drivers.
> > > However, user space applications, especially those whose performance
> > > is sensitive to the latency of inbound writes as seen by a CPU core,
> > > may benefit from using this information (E.g., DPDK cache stashing RFC
> > > [2] or an HPC application running in a VM).
> > >
> > > This patch enables configuring of TPH from the user space via
> > > VFIO_DEVICE_FEATURE IOCLT. It provides an interface to user space
> > > drivers and VMMs to enable/disable the TPH feature on PCIe devices and
> > > set steering tags in MSI-X or steering-tag table entries using
> > > VFIO_DEVICE_FEATURE_SET flag or read steering tags from the kernel
> > > using VFIO_DEVICE_FEATURE_GET to operate in device-specific mode.  
> > 
> > Unless I'm missing it, the RFC in [2] doesn't make use of this proposal.  Is there
> > published code anywhere that does use this interface?
> >   
> > > [1]
> > > lore.kernel.org/linux-pci/20241002165954.128085-1-wei.huang2@....com
> > > [2]
> > > inbox.dpdk.org/dev/20241021015246.304431-2-wathsala.vithanage@....com
> > >  
> 
> The DPDK RFC in question is on hold for now because there is no way to get steering-tags
> from the user space. 
> Consequently, I had to stop working on [2] and start working on the kernel to get VFIO ready
> for it. This was discussed in a DPDK community call sometime back.
> https://mails.dpdk.org/archives/dev/2024-October/305408.html

I think the userspace and kernel support need to be co-developed, we
don't want to be adding kernel code and uAPIs that don't have users.
 
> > > Signed-off-by: Wathsala Vithanage <wathsala.vithanage@....com>
> > > ---
> > >  drivers/vfio/pci/vfio_pci_core.c | 163 +++++++++++++++++++++++++++++++
> > >  include/uapi/linux/vfio.h        |  68 +++++++++++++
> > >  2 files changed, 231 insertions(+)
> > >
> > > diff --git a/drivers/vfio/pci/vfio_pci_core.c
> > > b/drivers/vfio/pci/vfio_pci_core.c
> > > index 586e49efb81b..d6dd0495b08b 100644
> > > --- a/drivers/vfio/pci/vfio_pci_core.c
> > > +++ b/drivers/vfio/pci/vfio_pci_core.c
> > > @@ -29,6 +29,7 @@
> > >  #include <linux/nospec.h>
> > >  #include <linux/sched/mm.h>
> > >  #include <linux/iommufd.h>
> > > +#include <linux/pci-tph.h>
> > >  #if IS_ENABLED(CONFIG_EEH)
> > >  #include <asm/eeh.h>
> > >  #endif
> > > @@ -1510,6 +1511,165 @@ static int vfio_pci_core_feature_token(struct  
> > vfio_device *device, u32 flags,  
> > >  	return 0;
> > >  }
> > >
> > > +static ssize_t vfio_pci_tph_uinfo_dup(struct vfio_pci_tph *tph,
> > > +				      void __user *arg, size_t argsz,
> > > +				      struct vfio_pci_tph_info **info) {
> > > +	size_t minsz;
> > > +
> > > +	if (tph->count > VFIO_TPH_INFO_MAX)
> > > +		return -EINVAL;
> > > +	if (!tph->count)
> > > +		return 0;
> > > +
> > > +	minsz = tph->count * sizeof(struct vfio_pci_tph_info);
> > > +	if (minsz < argsz)
> > > +		return -EINVAL;
> > > +
> > > +	*info = memdup_user(arg, minsz);
> > > +	if (IS_ERR(info))
> > > +		return PTR_ERR(info);
> > > +
> > > +	return minsz;
> > > +}
> > > +
> > > +static int vfio_pci_feature_tph_st_op(struct vfio_pci_core_device *vdev,
> > > +				      struct vfio_pci_tph *tph,
> > > +				      void __user *arg, size_t argsz) {
> > > +	int i, mtype, err = 0;
> > > +	u32 cpu_uid;
> > > +	struct vfio_pci_tph_info *info = NULL;
> > > +	ssize_t data_size = vfio_pci_tph_uinfo_dup(tph, arg, argsz, &info);
> > > +
> > > +	if (data_size <= 0)
> > > +		return data_size;
> > > +
> > > +	for (i = 0; i < tph->count; i++) {
> > > +		if (!(info[i].cpu_id < nr_cpu_ids && cpu_present(info[i].cpu_id)))
> > > +{  
> > 
> > Not intuitive logic, imo.  This could easily be:
> > 
> > 		if (info[i].cpu_id >= nr_cpu_ids || !cpu_present(info[i].cpu_id))
> >   
> 
> Agree, looks more intuitive.
> 
> > > +			info[i].err = -EINVAL;
> > > +			continue;
> > > +		}
> > > +		cpu_uid = topology_core_id(info[i].cpu_id);
> > > +		mtype = (info[i].flags & VFIO_TPH_MEM_TYPE_MASK) >>
> > > +			VFIO_TPH_MEM_TYPE_SHIFT;
> > > +
> > > +		/* processing hints are always ignored */
> > > +		info[i].ph_ignore = 1;
> > > +
> > > +		info[i].err = pcie_tph_get_cpu_st(vdev->pdev, mtype, cpu_uid,
> > > +						  &info[i].st);
> > > +		if (info[i].err)
> > > +			continue;
> > > +
> > > +		if (tph->flags & VFIO_DEVICE_FEATURE_TPH_SET_ST) {
> > > +			info[i].err = pcie_tph_set_st_entry(vdev->pdev,
> > > +							    info[i].index,
> > > +							    info[i].st);
> > > +		}
> > > +	}
> > > +
> > > +	if (copy_to_user(arg, info, data_size))
> > > +		err = -EFAULT;
> > > +
> > > +	kfree(info);
> > > +	return err;
> > > +}
> > > +
> > > +
> > > +static int vfio_pci_feature_tph_enable(struct vfio_pci_core_device *vdev,
> > > +				       struct vfio_pci_tph *arg)
> > > +{
> > > +	int mode = arg->flags & VFIO_TPH_ST_MODE_MASK;
> > > +
> > > +	switch (mode) {
> > > +	case VFIO_TPH_ST_NS_MODE:
> > > +		return pcie_enable_tph(vdev->pdev, PCI_TPH_ST_NS_MODE);
> > > +
> > > +	case VFIO_TPH_ST_IV_MODE:
> > > +		return pcie_enable_tph(vdev->pdev, PCI_TPH_ST_IV_MODE);
> > > +
> > > +	case VFIO_TPH_ST_DS_MODE:
> > > +		return pcie_enable_tph(vdev->pdev, PCI_TPH_ST_DS_MODE);
> > > +
> > > +	default:
> > > +		return -EINVAL;
> > > +	}
> > > +
> > > +}
> > > +
> > > +static int vfio_pci_feature_tph_disable(struct vfio_pci_core_device
> > > +*vdev) {
> > > +	pcie_disable_tph(vdev->pdev);
> > > +	return 0;
> > > +}
> > > +
> > > +static int vfio_pci_feature_tph_prepare(struct vfio_pci_tph __user *arg,
> > > +					size_t argsz, u32 flags,
> > > +					struct vfio_pci_tph *tph)
> > > +{
> > > +	u32 op;
> > > +	int err = vfio_check_feature(flags, argsz,
> > > +				 VFIO_DEVICE_FEATURE_SET |
> > > +				 VFIO_DEVICE_FEATURE_GET,
> > > +				 sizeof(struct vfio_pci_tph));
> > > +	if (err != 1)
> > > +		return err;  
> > 
> > We don't seem to account for a host booted with pci=notph.
> >   
> 
> pcie_enable_tph() looks for pci=notph and fails if it's set.
> If pcie_enable_tph() fails, pcie_tph_get_cpu_st() and 
> pcie_tph_set_st_entry() fail too.
> 
> Is it required to check pci=notph here as well?

Does it make sense to probe affirmatively for a feature that can't be
enabled?  I think I previously also overlooked that we never actually
check that the device supports TPH at probe time.
 
> > > +
> > > +	if (copy_from_user(tph, arg, sizeof(struct vfio_pci_tph)))
> > > +		return -EFAULT;
> > > +
> > > +	op = tph->flags & VFIO_DEVICE_FEATURE_TPH_OP_MASK;
> > > +
> > > +	switch (op) {
> > > +	case VFIO_DEVICE_FEATURE_TPH_ENABLE:
> > > +	case VFIO_DEVICE_FEATURE_TPH_DISABLE:
> > > +	case VFIO_DEVICE_FEATURE_TPH_SET_ST:
> > > +		return (flags & VFIO_DEVICE_FEATURE_SET) ? 0 : -EINVAL;
> > > +
> > > +	case VFIO_DEVICE_FEATURE_TPH_GET_ST:
> > > +		return (flags & VFIO_DEVICE_FEATURE_GET) ? 0 : -EINVAL;  
> > 
> > This is a convoluted mangling of an ioctl into a vfio feature.
> >   
> 
> Do you prefer all four operations to fold under a single ioctl? Or split them
> such that enabling and disabling TPH remains a VFIO SET feature while
> SET_ST and GET_ST moved to a regular ioctl?

Splitting the functionality between a feature and a new ioctl doesn't
make any sense.  As I noted, I can imagine this interface being two
related features where one allows the TPH state to be set enabled or
disabled and the other allows setting steering tags.

> > > +
> > > +	default:
> > > +		return -EINVAL;
> > > +	}
> > > +}
> > > +
> > > +static int vfio_pci_core_feature_tph(struct vfio_device *device, u32 flags,
> > > +				     struct vfio_pci_tph __user *arg,
> > > +				     size_t argsz)
> > > +{
> > > +	u32 op;
> > > +	struct vfio_pci_tph tph;
> > > +	void __user *uinfo;
> > > +	size_t infosz;
> > > +	struct vfio_pci_core_device *vdev =
> > > +		container_of(device, struct vfio_pci_core_device, vdev);
> > > +	int err = vfio_pci_feature_tph_prepare(arg, argsz, flags, &tph);
> > > +
> > > +	if (err)
> > > +		return err;
> > > +
> > > +	op = tph.flags & VFIO_DEVICE_FEATURE_TPH_OP_MASK;
> > > +
> > > +	switch (op) {
> > > +	case VFIO_DEVICE_FEATURE_TPH_ENABLE:
> > > +		return vfio_pci_feature_tph_enable(vdev, &tph);
> > > +
> > > +	case VFIO_DEVICE_FEATURE_TPH_DISABLE:
> > > +		return vfio_pci_feature_tph_disable(vdev);
> > > +
> > > +	case VFIO_DEVICE_FEATURE_TPH_GET_ST:
> > > +	case VFIO_DEVICE_FEATURE_TPH_SET_ST:
> > > +		uinfo = (u8 *)(arg) + offsetof(struct vfio_pci_tph, info);
> > > +		infosz = argsz - sizeof(struct vfio_pci_tph);
> > > +		return vfio_pci_feature_tph_st_op(vdev, &tph, uinfo, infosz);  
> > 
> > This is effectively encoding a regular ioctl as a feature.  See below.
> >   
> 
> Addressed this in the above comment.
> 
> > > +
> > > +	default:
> > > +		return -EINVAL;
> > > +	}
> > > +}
> > > +
> > >  int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
> > >  				void __user *arg, size_t argsz)
> > >  {
> > > @@ -1523,6 +1683,9 @@ int vfio_pci_core_ioctl_feature(struct vfio_device  
> > *device, u32 flags,  
> > >  		return vfio_pci_core_pm_exit(device, flags, arg, argsz);
> > >  	case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
> > >  		return vfio_pci_core_feature_token(device, flags, arg, argsz);
> > > +	case VFIO_DEVICE_FEATURE_PCI_TPH:
> > > +		return vfio_pci_core_feature_tph(device, flags,
> > > +						 arg, argsz);
> > >  	default:
> > >  		return -ENOTTY;
> > >  	}
> > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > > index c8dbf8219c4f..608d57dfe279 100644
> > > --- a/include/uapi/linux/vfio.h
> > > +++ b/include/uapi/linux/vfio.h
> > > @@ -1458,6 +1458,74 @@ struct vfio_device_feature_bus_master {  };
> > > #define VFIO_DEVICE_FEATURE_BUS_MASTER 10
> > >
> > > +/*
> > > + * Upon VFIO_DEVICE_FEATURE_SET, enable or disable PCIe TPH or set
> > > +steering tags
> > > + * on the device. Data provided when setting this feature is a __u32
> > > +with the
> > > + * following flags. VFIO_DEVICE_FEATURE_TPH_ENABLE enables PCIe TPH
> > > +in
> > > + * no-steering-tag, interrupt-vector, or device-specific mode when
> > > +feature flags
> > > + * VFIO_TPH_ST_NS_MODE, VFIO_TPH_ST_IV_MODE, and  
> > VFIO_TPH_ST_DS_MODE  
> > > +are set
> > > + * respectively.
> > > + * VFIO_DEVICE_FEATURE_TPH_DISABLE disables PCIe TPH on the device.
> > > + * VFIO_DEVICE_FEATURE_TPH_SET_ST set steering tags on a device at an
> > > +index in
> > > + * MSI-X or ST-table depending on the VFIO_TPH_ST_x_MODE flag used
> > > +and device
> > > + * capabilities. The caller can set multiple steering tags by passing
> > > +an array
> > > + * of vfio_pci_tph_info objects containing cpu_id, cache_level, and
> > > + * MSI-X/ST-table index. The caller can also set the intended memory
> > > +type and
> > > + * the processing hint by setting VFIO_TPH_MEM_TYPE_x and
> > > +VFIO_TPH_HINT_x flags,
> > > + * respectively. The return value for each vfio_pci_tph_info object
> > > +is stored in
> > > + * err, with the steering-tag set on the device and the ph_ignore
> > > +status bit
> > > + * resulting from the steering-tag lookup operation. If err < 0, the
> > > +values
> > > + * stored in the st and ph_ignore fields should be considered invalid.
> > > + *  
> > 
> > Sorry, I don't understand ph_ignore as described here.  It's only ever set to 1.  I
> > guess we assume the incoming state is zero.  But what does it mean?
> >   
> 
> Agree, it's confusing and it has something to do with the current implementation
> of TPH in the Kernel. 
> PCIe firmware specification states root-port ACPI _DSM as the single source of
> truth for steering-tags. Its job is to maintain a table of steering-tags indexed by
> a tuple of CPU/Container UID, Cache-id and a processing-hint>. Each tuple is mapped
> to two steering-tags for persistent or volatile memory (either or both depending on
> the platform).
> A caller looking to acquire a steering tag for a cache closer to a CPU will pass above
> tuple in the format defined in the PCIe firmware specification. 
> But PCIe spec also allows implementing TPH without processing-hints, in such cases
> the _DSM is supposed to set ph_ignore bit in the return structure.
> Current TPH implementation in the Linux kernel ignores some of these details,
> it sets cache-id and the processing-hint in the above tuple to zeros. It also ignores the
> ph_ignore bit return by the _DSM as well. 
> However, I wrote this RFC to match the PCI firmware specification, therefore it sets
> ph_ignore bit to 1 to inform the caller that it is ignored so that caller can make an
> informed decision to stick with the default (bidirectional hint) or exit with error.
> This is why the cache_level is ignored as well, ideally cache levels (0 for L1D, 1 for L2D,
> Etc.) should be converted to a PPTT cache_id and passed into the _DSM.
> I'm working on a separate PCI patch to fix above issues.
> 
> > > + * Upon VFIO_DEVICE_FEATURE_GET,  return steering tags to the caller.
> > > + * VFIO_DEVICE_FEATURE_TPH_GET_ST returns steering tags to the caller.
> > > + * The return values per vfio_pci_tph_info object are stored in the
> > > + st,
> > > + * ph_ignore, and err fields.  
> > 
> > Why do we need to provide an interface to return the steering tags set by the
> > user?  Seems like this could be a write-only, SET, interface.
> >   
> 
> VFIO_DEVICE_FEATURE_TPH_GET_ST calls the _DSM to read the steering-tag
> for a vector of tuples of a cpu-id, cache-id, processing-hint. It is for devices that operate
> in device specific mode where they don't set steering-tags in the MSI-X or ST tables but
> in another construct like a queue context accessible from the user/kernel space. 
> For instance, this is what will be used by DPDK poll mode drivers as [2] fleshes
> out. 
> 
> So, VFIO_DEVICE_FEATURE_TPH_GET_ST doesn't return steering-tags set by the
> user. It's there to read the platform database of steering-tags which is an ACPI _DSM
> bound each PCIe root port.
> 
> > > + */
> > > +struct vfio_pci_tph_info {  
> > 
> > This seems more of an _entry than an _info.  Note that vfio has various INFO
> > ioctls which make this nomenclature confusing.
> >   
> 
> It is, it's more of a struct that describes the request.
> Perhaps vfio_pci_tph_request_descriptor, but that's too long. 
> Suggestions are welcome.
> 
> 
> > > +	/* in */
> > > +	__u32 cpu_id;  
> > 
> > The logical ID?
> >   
> Yes, this is logical ID.
> 
> > > +	__u32 cache_level;  
> > 
> > Zero based?  1-based?  How many cache levels are there?  What's valid here?
> >   
> It's Zero based. This is currently ignored due to limitations in the kernel TPH. 
> One way to validate would be iterating through the topology to find it, decrementing
> a copy_of_cache_level as we move further away from the cpu_id until the value
> reaches zero. If loop terminates before copy_of_cache_level reaching zero, -EINVAL.
> Is that a good approach?

Sounds like a try and hope for the best interface.  How would userspace
realistically infer the cache levels?  I don't think we want vfio
describing that.

> > > +	__u8  flags;
> > > +#define VFIO_TPH_MEM_TYPE_MASK		0x1
> > > +#define VFIO_TPH_MEM_TYPE_SHIFT		0
> > > +#define VFIO_TPH_MEM_TYPE_VMEM		0	/* Request  
> > volatile memory ST */  
> > > +#define VFIO_TPH_MEM_TYPE_PMEM		1	/* Request  
> > persistent memory ST */
> > 
> > Is there a relation to the cache_level here?  Spec references here and below,
> > assuming those are relevant to these values.
> >   
> There is no relation to cache level. Return from the _DSM has PMEM and VMEM
> sections. If either of those are not supported a valid bit is set to 0 in the return.
> 
> PCIe spec has four processing-hints, VFIO_TPH_HINT_* are there to set them
> in the flags field. Hints are forced to VFIO_TPH_HINT_BIDIR by current TPH
> implementation as I described above. 
> 
> > > +
> > > +#define VFIO_TPH_HINT_MASK		0x3
> > > +#define VFIO_TPH_HINT_SHIFT		1
> > > +#define VFIO_TPH_HINT_BIDIR		0
> > > +#define VFIO_TPH_HINT_REQSTR		(1 << VFIO_TPH_HINT_SHIFT)
> > > +#define VFIO_TPH_HINT_TARGET		(2 << VFIO_TPH_HINT_SHIFT)
> > > +#define VFIO_TPH_HINT_TARGET_PRIO	(3 << VFIO_TPH_HINT_SHIFT)  
> > 
> > There needs to be a __u8 padding in here somewhere or flags extended to
> > __u16.
> >   
> > > +	__u16 index;			/* MSI-X/ST-table index to set ST */
> > > +	/* out */
> > > +	__u16 st;			/* Steering-Tag */  
> > 
> > Sorry if I'm just not familiar with TPH, but why do we need to return the ST?
> > Doesn't hardware make use of the ST index and the physical value gets applied
> > automatically in HW?
> >   
> 
> Device specific mode requires this. For instance, intel E810 NIC can set
> Steering-tags in queue context which is in user-space when used with DPDK.
> For such use cases we must return steering-tags read from the _DSM. 
> Going back to DPDK RFC mentioned here, if TPH gets enabled in VFIO, the E810 
> poll mode driver in the DPDK will ask the kernel a steering-tag for a combination
> of a cpu-id, a cache-level and a processing-hint. In response the kernel will
> invoke the ACPI _DSM for the root port of the device via pcie_tph_get_cpu_st()
> and return the steering-tag back to the user. E810 driver will set the returned tag
> in appropriate queue context.
> 
> For cases where steering-tag is not required in user-space (VMMs), caller asks the
> kernel to set steering-tag that corresponds the cpu-id, cache-level, and PH
> combination at a given MSI-X vector entry or ST table in config space. For such
> cases too, the kernel will read the steering-tag from the _DSM and set the tag
> in the corresponding MSI-X or ST table entry.

We need to consider that there are vfio-pci variant drivers that
support migration and make use of the vfio-pci-core feature interface.
TPH programming appears to be very non-migration friendly, the
attributes are very specific to the current host system.  Should
migration and TPH be mutually exclusive?  This may be a factor in the
decision to use a feature or ioctl.

> > > +	__u8  ph_ignore;		/* Processing hint was ignored by */  
> > 
> > Padding/alignment, same as above.  "ignored by..."  By what?  Is that an error?
> >   
> 
> An error. "Processing hint was ignored by the platform"
> 
> > > +	__s32 err;			/* Error on getting/setting Steering-  
> > Tag*/  
> > > +};  
> > 
> > Generally we'd expect a system call either works or fails.  Why do we need per
> > entry error report?  Can't we validate and prepare the entire operation before
> > setting any of it into the device?  Ultimately we're just writing hints to config
> > space or MSI-X table space, so the write operation itself is not likely to be the
> > point of failure.
> >   
> 
> Write to MSI-X won't fail but reading the steering-tag from the _DSM for an
> invalid <cpu_id, cach_id, ph_hint> combo can fail in both GET_ST and SET_ST
> cases.
> We can change this to an all or nothing interface, such that success if all the
> entries are valid and otherwise if at least one is invalid. But in that case the
> caller may not know which entries were invalid. If you think that's ok, I can
> change it.

It's not exactly uncommon for an ioctl to fail for a single bad arg
among many.  I think it's harder for userspace to deal with recovering
from a partially implemented call.

> > > +
> > > +struct vfio_pci_tph {
> > > +	__u32 argsz;			/* Size of vfio_pci_tph and info[] */
> > > +	__u32 flags;
> > > +#define VFIO_DEVICE_FEATURE_TPH_OP_MASK		0x7
> > > +#define VFIO_DEVICE_FEATURE_TPH_OP_SHIFT	3
> > > +#define VFIO_DEVICE_FEATURE_TPH_ENABLE		0	/*  
> > Enable TPH on device */  
> > > +#define VFIO_DEVICE_FEATURE_TPH_DISABLE	1	/* Disable TPH  
> > on device */  
> > > +#define VFIO_DEVICE_FEATURE_TPH_GET_ST		2	/* Get  
> > steering-tags */  
> > > +#define VFIO_DEVICE_FEATURE_TPH_SET_ST		4	/* Set  
> > steering-rags */
> > 
> > s/rags/tags/
> > 
> > vfio device features already have GET and SET as part of their base structure, why
> > are they duplicated here?  It really seems like there are two separate features
> > here, one that allows enabling with a given mode or disable, and another that
> > allows writing specific steering tags.  Both seem like they could be SET-only
> > features.  It's also not clear to me that there's a significant advantage to providing
> > an array of steering tags.  Isn't updating STs an infrequent operation and likely
> > bound to at most 2K tags in the case of MSI-X?
> >   
> 
> VFIO_DEVICE_FEATURE_TPH_ENABLE and VFIO_DEVICE_FEATURE_TPH_DISABLE
> are SET features.
> Since, VFIO_DEVICE_FEATURE_TPH_SET_ST requires writing to ST table, so that too
> falls under SET features.
> Therefore, these flags are only valid when VFIO_DEVICE_FEATURE_SET flag is set.
> The only GET features use case here is VFIO_DEVICE_FEATURE_TPH_GET_ST that
> reads the steering-tags from the _DSM and returns it back to caller. Only valid
> with VFIO_DEVICE_FEATURE_GET flag. 
> These are checked in vfio_pci_feature_tph_prepare().
> 
> As I mentioned earlier, does it make sense to leave enable/disable under VFIO
> Feature and move GET_ST and SET_ST to a separate ioctl?

Splitting TPH between a feature and new ioctls doesn't seem like a
logical interface.

> There isn't much rational to providing an array of tuples other than following
> the format VFIO_DEVICE_SET_IRQS that sets eventfds.
> 
> > > +
> > > +#define	VFIO_TPH_ST_MODE_MASK	(0x3 <<  
> > VFIO_DEVICE_FEATURE_TPH_OP_SHIFT)  
> > > +#define	VFIO_TPH_ST_NS_MODE	(0 <<  
> > VFIO_DEVICE_FEATURE_TPH_OP_SHIFT)  
> > > +#define	VFIO_TPH_ST_IV_MODE	(1 <<  
> > VFIO_DEVICE_FEATURE_TPH_OP_SHIFT)  
> > > +#define	VFIO_TPH_ST_DS_MODE	(2 <<  
> > VFIO_DEVICE_FEATURE_TPH_OP_SHIFT)  
> > > +	__u32 count;				/* Number of entries in info[]  
> > */  
> > > +	struct vfio_pci_tph_info info[];
> > > +#define VFIO_TPH_INFO_MAX	64		/* Max entries allowed  
> > in info[] */
> > 
> > This seems to match the limit if the table is located in extended config space, but
> > it's not particularly relevant if the table is located in MSI-X space.  Why is this a
> > good limit?  
> 
> Yes, for MSI-X the caller may have to invoke the ioctl multiple times if more
> than 64 entries need to be updated. 
> 
> > 
> > Also, try to keep these all in 80 column.  Thanks,
> >   
> 
> I will, try to keep these under 80 columns.
> 
> Thanks Alex, for your input. Let me know what you think about moving SET_ST
> and GET_ST to an ioctl and keep ENABLE/DISABLE TPH as a vfio feature.

It seems like kind of a hodgepodge to split the interface like that,
imo.  Sorry for a less that complete reply, I'm on PTO this week.
Thanks,

Alex