lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Sun, 3 Oct 2010 15:02:10 +0200
From:	"Michael S. Tsirkin" <mst@...hat.com>
To:	Tom Lyon <pugs@...co.com>
Cc:	linux-pci@...r.kernel.org, jbarnes@...tuousgeek.org,
	linux-kernel@...r.kernel.org, kvm@...r.kernel.org,
	randy.dunlap@...cle.com, arnd@...db.de, joro@...tes.org,
	hjk@...utronix.de, avi@...hat.com, gregkh@...e.de,
	chrisw@...s-sol.org, alex.williamson@...hat.com
Subject: Re: [PATCH 3/3] VFIO V4: VFIO driver: Non-privileged user level PCI
 drivers

On Thu, Sep 30, 2010 at 04:09:39PM -0700, Tom Lyon wrote:
> On Monday, September 27, 2010 04:54:21 am Michael S. Tsirkin wrote:
> > On Wed, Sep 22, 2010 at 02:18:24PM -0700, Tom Lyon wrote:
> > > Signed-off-by: Tom Lyon <pugs@...co.com>
> > 
> > Some comments on the pci bits:
> > 
> > After going over them for the Nth time - something needs to be done
> > with the rvirt/write tables. I doubt anyone besides me and you
> > has gone over them: one has to pull
> > up the spec just to understand which bits are set here. Why can't we
> > have a module init routine and use the macros in pci_regs.h for this
> > purpose, as all other drivers do? We will also get to use the nice
> > endian-ness macros linux has for portability ...
> 
> I tried a couple of approaches before settling on this one. Yes, its dense, 
> perhaps its ugly, but a table approach beats open coding of rules.

Sure. But fill the table in an init routine in code.


> And I absolutely don't expect it to verifiable to someone without a PCI spec;
> I had to have the PCI spec to write it! This is an example of where 
> correctness and readability pull to different places. And the defines in 
> pci_regs are incomplete and not real self-consistent anyways.
> 
> Perhaps its time for someone to code up a different approach? Code talks!

I'll try to find the time to show the basic idea if I fail to
convey the meaning by mail.

> > 
> > > +static struct perm_bits pci_cap_basic_perm[] = {
> > 
> > What endian-ness is this? Native?
> > You pass this to user as is but pci config is little endian...
> > Also -are all these readonly? So const? read mostly?
> Yes, I need some consts.
> Endianness - I haven't really thought much and I don't have  any big-endian 
> machines to test with.  However, for now everything is byte-at-a-time which 
> makes things easier.

The tables are built up of 32 bit integers though.

> I'll take a pass at cleaning it up.
> > 
> > > +	{ 0xFFFFFFFF,	0, },		/* 0x00 vendor & device id - RO */
> > > +	{ 0x00000003,	0xFFFFFFFF, },	/* 0x04 cmd - mem & io bits virt */
> > 
> > Interrupt disable bit is under host control.
> Yes?

But you make it writeable from guest. So the value will change
under the feet of host/guest.

> > 
> > > +	{ 0,		0, },		/* 0x08 class code & revision id */
> > > +	{ 0,		0xFF00FFFF, },	/* 0x0c bist, htype, lat, cache */
> > > +	{ 0xFFFFFFFF,	0xFFFFFFFF, },	/* 0x10 bar */
> > > +	{ 0xFFFFFFFF,	0xFFFFFFFF, },	/* 0x14 bar */
> > > +	{ 0xFFFFFFFF,	0xFFFFFFFF, },	/* 0x18 bar */
> > > +	{ 0xFFFFFFFF,	0xFFFFFFFF, },	/* 0x1c bar */
> > > +	{ 0xFFFFFFFF,	0xFFFFFFFF, },	/* 0x20 bar */
> > > +	{ 0xFFFFFFFF,	0xFFFFFFFF, },	/* 0x24 bar */
> > > +	{ 0,		0, },		/* 0x28 cardbus - not yet */
> > > +	{ 0,		0, },		/* 0x2c subsys vendor & dev */
> > > +	{ 0xFFFFFFFF,	0xFFFFFFFF, },	/* 0x30 rom bar */
> > > +	{ 0,		0, },		/* 0x34 capability ptr & resv */
> > > +	{ 0,		0, },		/* 0x38 resv */
> > > +	{ 0x000000FF,	0x000000FF, },	/* 0x3c max_lat ... irq */
> > > +};
> > > +
> > > +static struct perm_bits pci_cap_pm_perm[] = {
> > > +	{ 0,		0, },		/* 0x00 PM capabilities */
> > > +	{ 0,		0xFFFFFFFF, },	/* 0x04 PM control/status */
> > > +};
> > > +
> > > +static struct perm_bits pci_cap_vpd_perm[] = {
> > > +	{ 0,		0xFFFF0000, },	/* 0x00 address */
> > > +	{ 0,		0xFFFFFFFF, },	/* 0x04 data */
> > > +};
> > > +
> > > +static struct perm_bits pci_cap_slotid_perm[] = {
> > > +	{ 0,		0, },		/* 0x00 all read only */
> > > +};
> > > +
> > > +/* 4 different possible layouts of MSI capability */
> > > +static struct perm_bits pci_cap_msi_10_perm[] = {
> > > +	{ 0x00FF0000,	0x00FF0000, },	/* 0x00 MSI message control */
> > > +	{ 0xFFFFFFFF,	0xFFFFFFFF, },	/* 0x04 MSI message address */
> > > +	{ 0x0000FFFF,	0x0000FFFF, },	/* 0x08 MSI message data */
> > > +};
> > > +static struct perm_bits pci_cap_msi_14_perm[] = {
> > > +	{ 0x00FF0000,	0x00FF0000, },	/* 0x00 MSI message control */
> > > +	{ 0xFFFFFFFF,	0xFFFFFFFF, },	/* 0x04 MSI message address */
> > > +	{ 0xFFFFFFFF,	0xFFFFFFFF, },	/* 0x08 MSI message upper addr */
> > > +	{ 0x0000FFFF,	0x0000FFFF, },	/* 0x0c MSI message data */
> > > +};
> > > +static struct perm_bits pci_cap_msi_20_perm[] = {
> > > +	{ 0x00FF0000,	0x00FF0000, },	/* 0x00 MSI message control */
> > > +	{ 0xFFFFFFFF,	0xFFFFFFFF, },	/* 0x04 MSI message address */
> > > +	{ 0x0000FFFF,	0x0000FFFF, },	/* 0x08 MSI message data */
> > > +	{ 0,		0xFFFFFFFF, },	/* 0x0c MSI mask bits */
> > > +	{ 0,		0xFFFFFFFF, },	/* 0x10 MSI pending bits */
> > > +};
> > > +static struct perm_bits pci_cap_msi_24_perm[] = {
> > > +	{ 0x00FF0000,	0x00FF0000, },	/* 0x00 MSI message control */
> > > +	{ 0xFFFFFFFF,	0xFFFFFFFF, },	/* 0x04 MSI message address */
> > > +	{ 0xFFFFFFFF,	0xFFFFFFFF, },	/* 0x08 MSI message upper addr */
> > > +	{ 0x0000FFFF,	0x0000FFFF, },	/* 0x0c MSI message data */
> > > +	{ 0,		0xFFFFFFFF, },	/* 0x10 MSI mask bits */
> > > +	{ 0,		0xFFFFFFFF, },	/* 0x14 MSI pending bits */
> > > +};
> > 
> > You let guest control MSI mask and do not virtualize,
> > but the host expects to control this register - it is
> > used for masking during interrupt rebalancing.
> > This will break the device, or even the host with
> > 'unsafe interrupts' enabled.
> > 
> > Note that if you virtualize it, you will have to virtualize
> > the pending bits too.
> I can't find any interaction between MSIs and irqbalance. I thought it was all 
> magic in the APIC. Please point me at counter-example.
> Otherwise I don't see a problem with the guest controlling the mask.

See mask_msi_irq/unmask_msi_irq.

> > > +
> > > +static struct perm_bits pci_cap_pcix_perm[] = {
> > > +	{ 0,		0xFFFF0000, },	/* 0x00 PCI_X_CMD */
> > > +	{ 0,		0, },		/* 0x04 PCI_X_STATUS */
> > > +	{ 0,		0xFFFFFFFF, },	/* 0x08 ECC ctlr & status */
> > > +	{ 0,		0, },		/* 0x0c ECC first addr */
> > > +	{ 0,		0, },		/* 0x10 ECC second addr */
> > > +	{ 0,		0, },		/* 0x14 ECC attr */
> > > +};
> > > +
> > > +/* pci express capabilities */
> > > +static struct perm_bits pci_cap_exp_perm[] = {
> > > +	{ 0,		0, },		/* 0x00 PCIe capabilities */
> > > +	{ 0,		0, },		/* 0x04 PCIe device capabilities */
> > > +	{ 0,		0xFFFFFFFF, },	/* 0x08 PCIe device control & status */
> > > +	{ 0,		0, },		/* 0x0c PCIe link capabilities */
> > > +	{ 0,		0x000000FF, },	/* 0x10 PCIe link ctl/stat - SAFE? */
> > > +	{ 0,		0, },		/* 0x14 PCIe slot capabilities */
> > > +	{ 0,		0x00FFFFFF, },	/* 0x18 PCIe link ctl/stat - SAFE? */
> > > +	{ 0,		0, },		/* 0x1c PCIe root port stuff */
> > > +	{ 0,		0, },		/* 0x20 PCIe root port stuff */
> > > +};
> > > +
> > > +static struct perm_bits pci_cap_msix_perm[] = {
> > > +	{ 0,		0, },		/* 0x00 MSI-X Enable */
> > > +	{ 0,		0, },		/* 0x04 table offset & bir */
> > > +	{ 0,		0, },		/* 0x08 pba offset & bir */
> > > +};
> > > +
> > > +static struct perm_bits pci_cap_af_perm[] = {
> > > +	{ 0,		0, },		/* 0x00 af capability */
> > > +	{ 0,		0x0001,	 },	/* 0x04 af flr bit */
> > > +};
> > > +
> > > +static struct perm_bits *pci_cap_perms[] = {
> > > +	[PCI_CAP_ID_BASIC]	= pci_cap_basic_perm,
> > > +	[PCI_CAP_ID_PM]		= pci_cap_pm_perm,
> > > +	[PCI_CAP_ID_VPD]	= pci_cap_vpd_perm,
> > > +	[PCI_CAP_ID_SLOTID]	= pci_cap_slotid_perm,
> > > +	[PCI_CAP_ID_MSI]	= NULL,			/* special */
> > > +	[PCI_CAP_ID_PCIX]	= pci_cap_pcix_perm,
> > > +	[PCI_CAP_ID_EXP]	= pci_cap_exp_perm,
> > > +	[PCI_CAP_ID_MSIX]	= pci_cap_msix_perm,
> > > +	[PCI_CAP_ID_AF]		= pci_cap_af_perm,
> > > +};
> > > 
> > > 
> > > 
> > > 
> > > +int vfio_build_config_map(struct vfio_dev *vdev)
> > > +{
> > > +	struct pci_dev *pdev = vdev->pdev;
> > > +	u8 *map;
> > > +	int i, len;
> > > +	u8 pos, cap, tmp;
> > > +	u16 flags;
> > > +	int ret;
> > > +#ifndef PCI_FIND_CAP_TTL
> > > +#define PCI_FIND_CAP_TTL	48
> > > +#endif
> > 
> > Could you add a comment to explain what is the TTL for?
> > Where do you get the number 48?
> > Why do you have the ifndef here?
> > Why does it start with PCI_?
> The TTL just keeps the code from looping forever if the chain in the pci 
> device is looped.  The name and constant are copied from pci/pci.c

Better to simply use pci_find_cap rather than copy code.

> > > +	int loops = PCI_FIND_CAP_TTL;
> > > +
> > > +	map = kmalloc(pdev->cfg_size, GFP_KERNEL);
> > 
> > So this will allocate a 4K for express, but you
> > never even look at express capabilities, so
> > you will never even touch them?
> Someday this'll probably be extended for PCIe capabilities.
> > 
> > > +	if (map == NULL)
> > > +		return -ENOMEM;
> > > +	for (i = 0; i < pdev->cfg_size; i++)
> > > +		map[i] = 0xFF;
> > > +	vdev->pci_config_map = map;
> > > +
> > > +	/* default config space */
> > > +	for (i = 0; i < pci_capability_length[0]; i++)
> > > +		map[i] = 0;
> > > +
> > > +	/* any capabilities? */
> > > +	ret = pci_read_config_word(pdev, PCI_STATUS, &flags);
> > > +	if (ret < 0)
> > > +		return ret;
> > > +	if ((flags & PCI_STATUS_CAP_LIST) == 0)
> > > +		return 0;
> > > +
> > > +	ret = pci_read_config_byte(pdev, PCI_CAPABILITY_LIST, &pos);
> > > +	if (ret < 0)
> > > +		return ret;
> > > +	while (pos && --loops > 0) {
> > > +		ret = pci_read_config_byte(pdev, pos, &cap);
> > > +		if (ret < 0)
> > > +			return ret;
> > > +		if (cap == 0) {
> > > +			printk(KERN_WARNING "%s: cap 0\n", __func__);
> > > +			break;
> > > +		}
> > > +		if (cap > PCI_CAP_ID_MAX) {
> > > +			printk(KERN_WARNING "%s: unknown pci capability id %x\n",
> > > +					__func__, cap);
> > > +			len = 0;
> > > +		} else
> > > +			len = pci_capability_length[cap];
> > > +		if (len == 0) {
> > > +			printk(KERN_WARNING "%s: unknown length for pci cap %x\n",
> > > +					__func__, cap);
> > > +			len = 4;
> > > +		}
> > > +		if (len == 0xFF) {
> > > +			switch (cap) {
> > > +			case PCI_CAP_ID_MSI:
> > > +				len = vfio_msi_cap_len(vdev, pos);
> > > +				if (len < 0)
> > > +					return len;
> > > +				break;
> > > +			case PCI_CAP_ID_PCIX:
> > > +				ret = pci_read_config_word(pdev, pos + 2,
> > > +					&flags);
> > > +				if (ret < 0)
> > > +					return ret;
> > > +				if (flags & 0x3000)
> > > +					len = 24;
> > > +				else
> > > +					len = 8;
> > > +				break;
> > > +			case PCI_CAP_ID_VNDR:
> > > +				/* length follows next field */
> > > +				ret = pci_read_config_byte(pdev, pos + 2, &tmp);
> > > +				if (ret < 0)
> > > +					return ret;
> > > +				len = tmp;
> > > +				break;
> > > +			default:
> > > +				len = 0;
> > > +				break;
> > > +			}
> > > +		}
> > > +
> > > +		for (i = 0; i < len; i++) {
> > > +			if (map[pos+i] != 0xFF)
> > > +				printk(KERN_WARNING
> > > +					"%s: pci config conflict at %x, "
> > > +					"caps %x %x\n",
> > > +					__func__, i, map[pos+i], cap);
> > > +			map[pos+i] = cap;
> > > +		}
> > > +		ret = pci_read_config_byte(pdev, pos + PCI_CAP_LIST_NEXT, &pos);
> > > +		if (ret < 0)
> > > +			return ret;
> > > +	}
> > > +	if (loops <= 0)
> > > +		printk(KERN_ERR "%s: config space loop!\n", __func__);
> > 
> > If you just want to find all predefined capabilities,
> > you should porobably just use pci_find_cap.
> But this is a lot quicker.

Who cares? Come on, this is a ton of duplicated code.
Using standard APIs you will get e.g. express support
almost for free.

> > 
> > > +	return 0;
> > > +}
> > > +
> > > +static int vfio_virt_init(struct vfio_dev *vdev)
> > > +{
> > > +	struct pci_dev *pdev = vdev->pdev;
> > > +	u32 *lp;
> > > +	int i;
> > > +
> > > +	vdev->vconfig = kmalloc(256, GFP_KERNEL);
> > > +	if (vdev->vconfig == NULL)
> > > +		return -ENOMEM;
> > 
> > This one is only 256. It looks like it's sometimes
> > indexed by position which is up to cfg_size,
> > so it can crash for pci express.
> No, because the capability maps don't do pcie yet, but I'll clean this up.
> > 
> > > +
> > > +	lp = (u32 *)vdev->vconfig;
> > > +	for (i = 0; i < 256/sizeof(u32); i++, lp++)
> > > +		pci_read_config_dword(pdev, i * sizeof(u32), lp);
> > > +	vdev->bardirty = 1;
> > > +
> > > +	vdev->rbar[0] = *(u32 *)&vdev->vconfig[PCI_BASE_ADDRESS_0];
> > > +	vdev->rbar[1] = *(u32 *)&vdev->vconfig[PCI_BASE_ADDRESS_1];
> > > +	vdev->rbar[2] = *(u32 *)&vdev->vconfig[PCI_BASE_ADDRESS_2];
> > > +	vdev->rbar[3] = *(u32 *)&vdev->vconfig[PCI_BASE_ADDRESS_3];
> > > +	vdev->rbar[4] = *(u32 *)&vdev->vconfig[PCI_BASE_ADDRESS_4];
> > > +	vdev->rbar[5] = *(u32 *)&vdev->vconfig[PCI_BASE_ADDRESS_5];
> > > +	vdev->rbar[6] = *(u32 *)&vdev->vconfig[PCI_ROM_ADDRESS];
> > > +
> > > +	/* for sr-iov devices */
> > > +	vdev->vconfig[PCI_VENDOR_ID] = pdev->vendor & 0xFF;
> > > +	vdev->vconfig[PCI_VENDOR_ID+1] = pdev->vendor >> 8;
> > > +	vdev->vconfig[PCI_DEVICE_ID] = pdev->device & 0xFF;
> > > +	vdev->vconfig[PCI_DEVICE_ID+1] = pdev->device >> 8;
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +/*
> > > + * Restore the *real* BARs after we detect a backdoor reset.
> > > + * (backdoor = some device specific technique that we didn't catch)
> > > + */
> > > +static void vfio_bar_restore(struct vfio_dev *vdev)
> > > +{
> > > +	printk(KERN_WARNING "%s: restoring real bars\n", __func__);
> > > +
> > > +#define do_bar(off, which) \
> > > +	pci_user_write_config_dword(vdev->pdev, off, vdev->rbar[which])
> > > +
> > > +	do_bar(PCI_BASE_ADDRESS_0, 0);
> > > +	do_bar(PCI_BASE_ADDRESS_1, 1);
> > > +	do_bar(PCI_BASE_ADDRESS_2, 2);
> > > +	do_bar(PCI_BASE_ADDRESS_3, 3);
> > > +	do_bar(PCI_BASE_ADDRESS_4, 4);
> > > +	do_bar(PCI_BASE_ADDRESS_5, 5);
> > > +	do_bar(PCI_ROM_ADDRESS, 6);
> > > +#undef do_bar
> > > +}
> > > +
> > > +/*
> > > + * Pretend we're hardware and tweak the values
> > > + * of the *virtual* pci BARs to reflect the hardware
> > > + * capabilities
> > > + */
> > > +static void vfio_bar_fixup(struct vfio_dev *vdev)
> > > +{
> > 
> > So you do this on each read?
> > Why don't you mask the appropriate bits on write?
> > This is what real hardware does, after all.
> > Then you won't need the bardirty field.
> > 
> > > +	struct pci_dev *pdev = vdev->pdev;
> > > +	int bar;
> > > +	u32 *lp;
> > > +	u64 mask;
> > > +
> > > +	for (bar = 0; bar <= 5; bar++) {
> > > +		if (pci_resource_start(pdev, bar))
> > > +			mask = ~(pci_resource_len(pdev, bar) - 1);
> > > +		else
> > > +			mask = 0;
> > > +		lp = (u32 *)vdev->vconfig + PCI_BASE_ADDRESS_0 + 4*bar;
> > > +		*lp &= (u32)mask;
> > > +
> > > +		if (pci_resource_flags(pdev, bar) & IORESOURCE_IO)
> > > +			*lp |= PCI_BASE_ADDRESS_SPACE_IO;
> > > +		else if (pci_resource_flags(pdev, bar) & IORESOURCE_MEM) {
> > > +			*lp |= PCI_BASE_ADDRESS_SPACE_MEMORY;
> > > +			if (pci_resource_flags(pdev, bar) & IORESOURCE_PREFETCH)
> > > +				*lp |= PCI_BASE_ADDRESS_MEM_PREFETCH;
> > > +			if (pci_resource_flags(pdev, bar) & IORESOURCE_MEM_64) {
> > > +				*lp |= PCI_BASE_ADDRESS_MEM_TYPE_64;
> > > +				lp++;
> > > +				*lp &= (u32)(mask >> 32);
> > > +				bar++;
> > > +			}
> > > +		}
> > > +	}
> > > +
> > > +	if (pci_resource_start(pdev, PCI_ROM_RESOURCE))
> > 
> > Not if (pci_resource_len)?
> > 
> > > +		mask = ~(pci_resource_len(pdev, PCI_ROM_RESOURCE) - 1);
> > > +	else
> > > +		mask = 0;
> > > +	lp = (u32 *)vdev->vconfig + PCI_ROM_ADDRESS;
> > > +	*lp &= (u32)mask;
> > > +
> > > +	vdev->bardirty = 0;
> > 
> > Aren't the pci values in little endian format?
> > If so doing operations on them in native endianness is wrong.
> > sparse generally is good at catching these, but you will
> > have to avoid so many type casts and annotate endian-ness
> > for it to be of any use.
> Will clean.
> > 
> > > +ssize_t vfio_config_readwrite(int write,
> > > +		struct vfio_dev *vdev,
> > > +		char __user *buf,
> > > +		size_t count,
> > > +		loff_t *ppos)
> > > +{
> > > +	struct pci_dev *pdev = vdev->pdev;
> > > +	int done = 0;
> > > +	int ret;
> > > +	u16 pos;
> > > +
> > > +
> > > +	if (vdev->pci_config_map == NULL) {
> > > +		ret = vfio_build_config_map(vdev);
> > > +		if (ret)
> > > +			goto out;
> > > +	}
> > > +	if (vdev->vconfig == NULL) {
> > > +		ret = vfio_virt_init(vdev);
> > > +		if (ret)
> > > +			goto out;
> > > +	}
> > 
> > Why don't you init all this on open?
> Keeps all the hooks in the same code file.

Make it a function and call from open then.
Less state -> less bugs.

> > 
> > > +
> > > +	while (count > 0) {
> > > +		pos = *ppos;
> > > +		if (pos == pdev->cfg_size)
> > > +			break;
> > > +		if (pos > pdev->cfg_size) {
> > > +			ret = -EINVAL;
> > > +			goto out;
> > > +		}
> > > +
> > > +		ret = vfio_config_rwbyte(write, vdev, pos, buf);
> > 
> > So you split each access to 1 byte accesses. Hmm.
> > This will break atomicity guarantees - maybe it does not matter,
> > but this is why the _userspace accesses are there after all.
> > 
> > Wouldn't it be easier to only support 1, 2 and 4 byte accesses?
> > This is what all drivers do after all...
> Future optimization.  PCI is actually pretty good about not needing
> atomics, so 8 and 16  bit processors are happy.

Not atomic at the processor level, atomic at the PCI level.

> I wanted to settle on the table format and contents before doing the
> 2 and 4 byte accesses.
> > 
> > > +
> > > +		if (ret < 0)
> > > +			goto out;
> > > +		buf++;
> > > +		done++;
> > > +		count--;
> > > +		(*ppos)++;
> > > +	}
> > > +	ret = done;
> > > +out:
> > > +	return ret;
> > > +}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists