netdev - RE: [PATCH iwl-next v4 09/12] ice: Save and load TX Queue head

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <BN9PR11MB52766AECA2168F37AEF6995D8C8BA@BN9PR11MB5276.namprd11.prod.outlook.com>
Date: Thu, 7 Dec 2023 08:22:53 +0000
From: "Tian, Kevin" <kevin.tian@...el.com>
To: "Cao, Yahui" <yahui.cao@...el.com>, "intel-wired-lan@...ts.osuosl.org"
	<intel-wired-lan@...ts.osuosl.org>
CC: "kvm@...r.kernel.org" <kvm@...r.kernel.org>, "netdev@...r.kernel.org"
	<netdev@...r.kernel.org>, "Liu, Lingyu" <lingyu.liu@...el.com>, "Chittim,
 Madhu" <madhu.chittim@...el.com>, "Samudrala, Sridhar"
	<sridhar.samudrala@...el.com>, "alex.williamson@...hat.com"
	<alex.williamson@...hat.com>, "jgg@...dia.com" <jgg@...dia.com>,
	"yishaih@...dia.com" <yishaih@...dia.com>,
	"shameerali.kolothum.thodi@...wei.com"
	<shameerali.kolothum.thodi@...wei.com>, "brett.creeley@....com"
	<brett.creeley@....com>, "davem@...emloft.net" <davem@...emloft.net>,
	"edumazet@...gle.com" <edumazet@...gle.com>, "kuba@...nel.org"
	<kuba@...nel.org>, "pabeni@...hat.com" <pabeni@...hat.com>
Subject: RE: [PATCH iwl-next v4 09/12] ice: Save and load TX Queue head

> From: Cao, Yahui <yahui.cao@...el.com>
> Sent: Tuesday, November 21, 2023 10:51 AM
> 
> To advance TX Head queue, HW needs to touch memory by DMA. But
> directly
> touching VM's memory to advance TX Queue head does not follow vfio
> migration protocol design, because vIOMMU state is not defined by the
> protocol. Even this may introduce functional and security issue under
> hostile guest circumstances.

this limitation is not restricted to vIOMMU. Even when it's absent
there is still no guarantee that the GPA address space has been
re-attached to this device.

> 
> In order not to touch any VF memory or IO page table, TX Queue head
> loading is using PF managed memory and PF isolation domain. This will

PF doesn't manage memory. It's probably clearer to say that TX queue
is temporarily moved to PF when the head is being restored.

> also introduce another dependency that while switching TX Queue between
> PF space and VF space, TX Queue head value is not changed. HW provides
> an indirect context access so that head value can be kept while
> switching context.
> 
> In virtual channel model, VF driver only send TX queue ring base and
> length info to PF, while rest of the TX queue context are managed by PF.
> TX queue length must be verified by PF during virtual channel message
> processing. When PF uses dummy descriptors to advance TX head, it will
> configure the TX ring base as the new address managed by PF itself. As a
> result, all of the TX queue context is taken control of by PF and this
> method won't generate any attacking vulnerability

So basically the key points are:

1) TX queue head cannot be directly updated via VF mmio interface;
2) Using dummy descriptors to update TX queue head is possible but it
    must be done in PF's context;
3) FW provides a way to keep TX queue head intact when moving
    the TX queue ownership between VF and PF;
4) the TX queue context affected by the ownership change is largely
    initialized by the PF driver already, except ring base/size coming from
    virtual channel messages. This implies that a malicious guest VF driver
    cannot attack this small window though the tx head restore is done
    after all the VF state are restored;
5) and a missing point is that the temporary owner change doesn't
    expose the TX queue to the software stack on top of the PF driver
    otherwise that would be a severe issue.

> +static int
> +ice_migration_save_tx_head(struct ice_vf *vf,
> +			   struct ice_migration_dev_state *devstate)
> +{
> +	struct ice_vsi *vsi = ice_get_vf_vsi(vf);
> +	struct ice_pf *pf = vf->pf;
> +	struct device *dev;
> +	int i = 0;
> +
> +	dev = ice_pf_to_dev(pf);
> +
> +	if (!vsi) {
> +		dev_err(dev, "VF %d VSI is NULL\n", vf->vf_id);
> +		return -EINVAL;
> +	}
> +
> +	ice_for_each_txq(vsi, i) {
> +		u16 tx_head;
> +		u32 reg;
> +
> +		devstate->tx_head[i] = 0;
> +		if (!test_bit(i, vf->txq_ena))
> +			continue;
> +
> +		reg = rd32(&pf->hw, QTX_COMM_HEAD(vsi->txq_map[i]));
> +		tx_head = (reg & QTX_COMM_HEAD_HEAD_M)
> +					>> QTX_COMM_HEAD_HEAD_S;
> +
> +		/* 1. If TX head is QTX_COMM_HEAD_HEAD_M marker,
> which means
> +		 *    it is the value written by software and there are no
> +		 *    descriptors write back happened, then there are no
> +		 *    packets sent since queue enabled.

It's unclear why it's not zero when no packet is sent.

> +static int
> +ice_migration_inject_dummy_desc(struct ice_vf *vf, struct ice_tx_ring
> *tx_ring,
> +				u16 head, dma_addr_t tx_desc_dma)

based on intention this reads clearer to be:

	ice_migration_restore_tx_head()


> +
> +	/* 1.3 Disable TX queue interrupt */
> +	wr32(hw, QINT_TQCTL(tx_ring->reg_idx), QINT_TQCTL_ITR_INDX_M);
> +
> +	/* To disable tx queue interrupt during run time, software should
> +	 * write mmio to trigger a MSIX interrupt.
> +	 */
> +	if (tx_ring->q_vector)
> +		wr32(hw, GLINT_DYN_CTL(tx_ring->q_vector->reg_idx),
> +		     (ICE_ITR_NONE << GLINT_DYN_CTL_ITR_INDX_S) |
> +		     GLINT_DYN_CTL_SWINT_TRIG_M |
> +		     GLINT_DYN_CTL_INTENA_M);

this needs more explanation as it's not intuitive to disable interrupt by
triggering another interrupt.

> +
> +	ice_for_each_txq(vsi, i) {
> +		struct ice_tx_ring *tx_ring = vsi->tx_rings[i];
> +		u16 *tx_heads = devstate->tx_head;
> +
> +		/* 1. Skip if TX Queue is not enabled */
> +		if (!test_bit(i, vf->txq_ena) || tx_heads[i] == 0)
> +			continue;
> +
> +		if (tx_heads[i] >= tx_ring->count) {
> +			dev_err(dev, "VF %d: invalid tx ring length to load\n",
> +				vf->vf_id);
> +			ret = -EINVAL;
> +			goto err;
> +		}
> +
> +		/* Dummy descriptors must be re-initialized after use, since
> +		 * it may be written back by HW
> +		 */
> +		ice_migration_init_dummy_desc(tx_desc, ring_len,
> tx_pkt_dma);
> +		ret = ice_migration_inject_dummy_desc(vf, tx_ring,
> tx_heads[i],
> +						      tx_desc_dma);
> +		if (ret)
> +			goto err;
> +	}
> +
> +err:
> +	dma_free_coherent(dev, ring_len * sizeof(struct ice_tx_desc),
> +			  tx_desc, tx_desc_dma);
> +	dma_free_coherent(dev, SZ_4K, tx_pkt, tx_pkt_dma);
> +
> +	return ret;

there is no err unwinding for the tx ring context itself.

> +
> +	/* Only load the TX Queue head after rest of device state is loaded
> +	 * successfully.
> +	 */

"otherwise it might be changed by virtual channel messages e.g. reset"

> @@ -1351,6 +1351,24 @@ static int ice_vc_ena_qs_msg(struct ice_vf *vf, u8
> *msg)
>  			continue;
> 
>  		ice_vf_ena_txq_interrupt(vsi, vf_q_id);
> +
> +		/* TX head register is a shadow copy of on-die TX head which
> +		 * maintains the accurate location. And TX head register is
> +		 * updated only after a packet is sent. If nothing is sent
> +		 * after the queue is enabled, then the value is the one
> +		 * updated last time and out-of-date.

when is "last time"? Is it even not updated upon reset?

or does it talk about a disable-enable sequence in which the real TX head
is left with a stale value from last enable?

> +		 *
> +		 * QTX_COMM_HEAD.HEAD rang value from 0x1fe0 to 0x1fff
> is
> +		 * reserved and will never be used by HW. Manually write a
> +		 * reserved value into TX head and use this as a marker for
> +		 * the case that there's no packets sent.

why using a reserved value instead of setting it to 0?

> +		 *
> +		 * This marker is only used in live migration use case.
> +		 */
> +		if (vf->migration_enabled)
> +			wr32(&vsi->back->hw,
> +			     QTX_COMM_HEAD(vsi->txq_map[vf_q_id]),
> +			     QTX_COMM_HEAD_HEAD_M);