linux-kernel - Re: [PATCH iwl-next v2 5/9] igc: Add support for frame preemption verification

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250205171234.cuscjpzdyc34ofbn@skbuf>
Date: Wed, 5 Feb 2025 19:12:34 +0200
From: Vladimir Oltean <vladimir.oltean@....com>
To: Faizal Rahim <faizal.abdul.rahim@...ux.intel.com>
Cc: Tony Nguyen <anthony.l.nguyen@...el.com>,
	Przemek Kitszel <przemyslaw.kitszel@...el.com>,
	Andrew Lunn <andrew+netdev@...n.ch>,
	"David S . Miller" <davem@...emloft.net>,
	Eric Dumazet <edumazet@...gle.com>,
	Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
	Maxime Coquelin <mcoquelin.stm32@...il.com>,
	Alexandre Torgue <alexandre.torgue@...s.st.com>,
	Simon Horman <horms@...nel.org>,
	Russell King <linux@...linux.org.uk>,
	Alexei Starovoitov <ast@...nel.org>,
	Daniel Borkmann <daniel@...earbox.net>,
	Jesper Dangaard Brouer <hawk@...nel.org>,
	John Fastabend <john.fastabend@...il.com>,
	Russell King <rmk+kernel@...linux.org.uk>,
	Furong Xu <0x1207@...il.com>, Serge Semin <fancer.lancer@...il.com>,
	Xiaolei Wang <xiaolei.wang@...driver.com>,
	Suraj Jaiswal <quic_jsuraj@...cinc.com>,
	Kory Maincent <kory.maincent@...tlin.com>,
	Petr Tesarik <petr@...arici.cz>,
	Choong Yong Liang <yong.liang.choong@...ux.intel.com>,
	Kunihiko Hayashi <hayashi.kunihiko@...ionext.com>,
	Vinicius Costa Gomes <vinicius.gomes@...el.com>,
	intel-wired-lan@...ts.osuosl.org, netdev@...r.kernel.org,
	linux-kernel@...r.kernel.org,
	linux-stm32@...md-mailman.stormreply.com,
	linux-arm-kernel@...ts.infradead.org, bpf@...r.kernel.org
Subject: Re: [PATCH iwl-next v2 5/9] igc: Add support for frame preemption
 verification

On Wed, Feb 05, 2025 at 05:05:20AM -0500, Faizal Rahim wrote:
> This patch implements the "ethtool --set-mm" callback to trigger the
> frame preemption verification handshake.
> 
> Uses the MAC Merge Software Verification (mmsv) mechanism in ethtool
> to perform the verification handshake for igc.
> The structure fpe.mmsv is set by mmsv in ethtool and should remain
> read-only for the driver.
> 
> igc does not use two mmsv callbacks:
> a) configure_tx()
>    - igc lacks registers to configure FPE in the transmit direction.

Yes, maybe, but it's still important to handle this. It tells you when
the preemptible traffic classes should be sent as preemptible on the wire
(i.e. when the verification is either disabled, or it succeeded).

There is a selftest called manual_failed_verification() which supposedly
tests this exact condition: if verification fails, then packets sent to
TC0 are supposed to bump the eMAC's TX counters, even though TC0 is
configured as preemptible. Otherwise stated: even if the tc program says
that a certain traffic class is preemptible, you don't want to actually
send preemptible packets if you haven't verified the link partner can
handle them, since it will likely drop them on RX otherwise.

The problem with that selftest is that it relies on the driver's ability
to report split ethtool counters for eMAC/pMAC, rather than just aggregate.
In lack of that, you need to know through some other mechanism whether
the link partner received those packets as preemptible or express, and
that's kind of hard to add to a selftest.

> b) configure_pmac()
>    - this callback dynamically controls pmac_enabled at runtime. For
>      example, mmsv calls configure_pmac() and disables pmac_enabled when
>      the link partner goes down, even if the user previously enabled it.
>      The intention is to save power but it is not feasible in igc
>      because it causes an endless adapter reset loop:
> 
> 	 1) Board A and Board B complete the verification handshake.
>             Tx mode register for both boards are in TSN mode.
> 	 2) Board B link goes down.
> 
>          On Board A:
> 	 3) mmsv calls configure_pmac() with pmac_enabled = false.
> 	 4) configure_pmac() in igc updates a new field based on
>             pmac_enabled. Driver uses this field in igc_tsn_new_flags()
>             to indicate that the user enabled/disabled FPE.
> 	 5) configure_pmac() in igc calls igc_tsn_offload_apply() to check
>             whether an adapter reset is needed. Calls existing logic in
>             igc_tsn_will_tx_mode_change() and igc_tsn_new_flags().
> 	 6) Since pmac_enabled is now disabled and no other TSN feature
>             is active, igc_tsn_will_tx_mode_change() evaluates to true
>             because Tx mode will switch from TSN to Legacy.
> 	 7) Driver resets the adapter.
> 	 8) Registers are set, and Tx mode switches to Legacy.
> 	 9) When link partner is up, steps 3–8 repeat, but this time
>             with pmac_enabled = true, reactivating TSN.
>             igc_tsn_will_tx_mode_change() evaluates to true again,
>             since Tx mode will switch from Legacy to TSN.
> 	10) Driver resets the adapter.
> 	11) Rest adapter completes, registers are set, and Tx mode
> 	    switches to TSN.
> 
>         On Board B:
> 	12) Adapter reset on Board A at step 10 causes it to detect its
>             link partner as down.

Is this using fiber/DAC, or twisted pair (RJ45 PHY)?

> 	13) Repeats steps 3–8.
> 	14) Once reset adapter on Board A is completed at step 11, it
>             detects its link partner as up.
> 	15) Repeats steps 9–11.
> 
>    - this cycle repeats indefinitely. To avoid this issue, IGC only uses
>      mmsv.pmac_enabled to track whether FPE is enabled or disabled.

Not that I couldn't eventually tolerate this workaround, but I figure
it's worth asking anyway. Isn't there a way to do an adapter reset
without losing link in the PHY (assuming it's the PHY that loses link)?
Is that a consequence of software design that's not careful enough, or
of hardware design? I assume the PHY is a discrete component. Avoiding
the need to retrigger auto-negotiation saves a few seconds of waiting,
so it's worth pursuing if it's possible at all.

> Co-developed-by: Vinicius Costa Gomes <vinicius.gomes@...el.com>
> Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@...el.com>
> Co-developed-by: Choong Yong Liang <yong.liang.choong@...ux.intel.com>
> Signed-off-by: Choong Yong Liang <yong.liang.choong@...ux.intel.com>
> Signed-off-by: Faizal Rahim <faizal.abdul.rahim@...ux.intel.com>
> ---
> diff --git a/drivers/net/ethernet/intel/igc/igc_ethtool.c b/drivers/net/ethernet/intel/igc/igc_ethtool.c
> index 817838677817..41cb03816f92 100644
> --- a/drivers/net/ethernet/intel/igc/igc_ethtool.c
> +++ b/drivers/net/ethernet/intel/igc/igc_ethtool.c
> @@ -1781,6 +1782,23 @@ static int igc_ethtool_set_eee(struct net_device *netdev,
>  	return 0;
>  }
>  
> +static int igc_ethtool_set_mm(struct net_device *netdev,
> +			      struct ethtool_mm_cfg *cmd,
> +			      struct netlink_ext_ack *extack)
> +{
> +	struct igc_adapter *adapter = netdev_priv(netdev);
> +	struct fpe_t *fpe = &adapter->fpe;
> +
> +	ethtool_mmsv_set_mm(&fpe->mmsv, cmd);
> +
> +	if (fpe->mmsv.pmac_enabled)
> +		static_branch_enable(&igc_fpe_enabled);
> +	else
> +		static_branch_disable(&igc_fpe_enabled);

I don't think the API is correctly used here. If there are 2 igc cards
in the system and one gets an ethtool call with pmac_enabled=false, it
will break the static branch for the other card. You need to use
static_branch_inc() and static_branch_dec(), and react only on changes
(old fpe->mmsv.pmac_enabled ^ cmd->pmac_enabled).

> +
> +	return igc_tsn_offload_apply(adapter);
> +}
> +
>  static int igc_ethtool_get_link_ksettings(struct net_device *netdev,
>  					  struct ethtool_link_ksettings *cmd)
>  {
> diff --git a/drivers/net/ethernet/intel/igc/igc_main.c b/drivers/net/ethernet/intel/igc/igc_main.c
> index 44e4f925491f..396410522e01 100644
> --- a/drivers/net/ethernet/intel/igc/igc_main.c
> +++ b/drivers/net/ethernet/intel/igc/igc_main.c
> @@ -2618,6 +2618,15 @@ static int igc_clean_rx_irq(struct igc_q_vector *q_vector, const int budget)
>  			size -= IGC_TS_HDR_LEN;
>  		}
>  
> +		if (static_branch_unlikely(&igc_fpe_enabled) &&
> +		    igc_fpe_is_verify_or_response(rx_desc, size)) {

FWIW, the igc_fpe_enabled static branch is kernel-wide and does not
necessarily mean that FPE is enabled on _this_ NIC (that would still be
adapter->fpe.mmsv.pmac_enabled).

I don't think it's very safe to call ethtool_mmsv events when it doesn't
expect them (pmac_enabled = false), at least from a maintainance perspective
it would be easier not to do that.

I'm just ballparking this, but I think igc_fpe_is_verify_or_response() is
also just a bit more computationally expensive than just checking
adapter->fpe.mmsv.pmac_enabled for the "normal" case where FPE isn't
enabled, so it might make sense to add this as an extra check anyway.

> +			igc_fpe_lp_event_status(rx_desc, &adapter->fpe.mmsv);
> +			/* Advance the ring next-to-clean */
> +			igc_is_non_eop(rx_ring, rx_desc);
> +			cleaned_count++;
> +			continue;
> +		}
> +
>  		if (!skb) {
>  			xdp_init_buff(&ctx.xdp, truesize, &rx_ring->xdp_rxq);
>  			xdp_prepare_buff(&ctx.xdp, pktbuf - igc_rx_offset(rx_ring),
> diff --git a/drivers/net/ethernet/intel/igc/igc_tsn.c b/drivers/net/ethernet/intel/igc/igc_tsn.c
> index f0213cfce07d..52ff6f324d79 100644
> --- a/drivers/net/ethernet/intel/igc/igc_tsn.c
> +++ b/drivers/net/ethernet/intel/igc/igc_tsn.c
> @@ -1,10 +1,153 @@
>  // SPDX-License-Identifier: GPL-2.0
>  /* Copyright (c)  2019 Intel Corporation */
>  
> +#include <linux/kernel.h>
>  #include "igc.h"
> +#include "igc_base.h"
>  #include "igc_hw.h"
>  #include "igc_tsn.h"
>  
> +DEFINE_STATIC_KEY_FALSE(igc_fpe_enabled);
> +
> +static int igc_fpe_init_smd_frame(struct igc_ring *ring,
> +				  struct igc_tx_buffer *buffer,
> +				  struct sk_buff *skb)
> +{
> +	dma_addr_t dma = dma_map_single(ring->dev, skb->data, skb->len,
> +					DMA_TO_DEVICE);
> +
> +	if (dma_mapping_error(ring->dev, dma)) {
> +		netdev_err_once(ring->netdev, "Failed to map DMA for TX\n");
> +		return -ENOMEM;
> +	}
> +
> +	buffer->skb = skb;
> +	buffer->protocol = 0;
> +	buffer->bytecount = skb->len;
> +	buffer->gso_segs = 1;
> +	buffer->time_stamp = jiffies;
> +	dma_unmap_len_set(buffer, len, skb->len);
> +	dma_unmap_addr_set(buffer, dma, dma);
> +
> +	return 0;
> +}
> +
> +static int igc_fpe_init_tx_descriptor(struct igc_ring *ring,
> +				      struct sk_buff *skb, int type)
> +{
> +	struct igc_tx_buffer *buffer;
> +	union igc_adv_tx_desc *desc;
> +	u32 cmd_type, olinfo_status;
> +	int err;
> +
> +	if (!igc_desc_unused(ring))
> +		return -EBUSY;
> +
> +	if (type != IGC_TXD_POPTS_SMD_V && type != IGC_TXD_POPTS_SMD_R)
> +		return -EINVAL;

It should really be sufficient to guarantee at compile time that the argument
is valid (it never comes from user input, so it's not hard). Error handling
for a condition that can't be triggered shouldn't exist, except for the
very specific case of future proofing loosely coupled software components
(what one would call "API", not the case here).

I guess a concern is what happens when, for any reason, you add a new
type of packet that you can transmit, and igc_fpe_init_tx_descriptor()
isn't updated to handle it yet. If you make "int type" a proper enum
type, the compiler will automatically warn when an enum value isn't
handled in the "switch (type)" block, because there's no "default" case.
I think that's actually better than finding this out at runtime, so I
would recommend using an enum for stronger typing.

Warning: it's still only a hint and thus not perfect, in that you
could pass an out-of-range constant literal like "42" and the compiler
wouldn't warn.

> +
> +	buffer = &ring->tx_buffer_info[ring->next_to_use];
> +	err = igc_fpe_init_smd_frame(ring, buffer, skb);
> +	if (err)
> +		return err;
> +
> +	cmd_type = IGC_ADVTXD_DTYP_DATA | IGC_ADVTXD_DCMD_DEXT |
> +		   IGC_ADVTXD_DCMD_IFCS | IGC_TXD_DCMD |
> +		   buffer->bytecount;
> +	olinfo_status = buffer->bytecount << IGC_ADVTXD_PAYLEN_SHIFT;

Still some FIELD_PREP() conversions to do here?

> +
> +	switch (type) {
> +	case IGC_TXD_POPTS_SMD_V:
> +		olinfo_status |= (IGC_TXD_POPTS_SMD_V << IGC_TXD_POPTS_SMD_SHIFT);
> +		break;
> +	case IGC_TXD_POPTS_SMD_R:
> +		olinfo_status |= (IGC_TXD_POPTS_SMD_R << IGC_TXD_POPTS_SMD_SHIFT);
> +		break;
> +	}

What I just said above is a bit diluted by the fact that this doesn't
have to be a switch/case block. It's just "olinfo_status |= type << IGC_TXD_POPTS_SMD_SHIFT"
(also, parentheses are not necessary).

As a mere suggestion, I would write this as:

	switch (type) { // "type" being an enum
	case IGC_TXD_POPTS_SMD_V:
	case IGC_TXD_POPTS_SMD_R:
		olinfo_status |= FIELD_PREP(IGC_TXD_POPTS_SMD_MASK, type);
		break;
	}

> +
> +	desc = IGC_TX_DESC(ring, ring->next_to_use);
> +	desc->read.cmd_type_len = cpu_to_le32(cmd_type);
> +	desc->read.olinfo_status = cpu_to_le32(olinfo_status);
> +	desc->read.buffer_addr = cpu_to_le64(dma_unmap_addr(buffer, dma));
> +
> +	netdev_tx_sent_queue(txring_txq(ring), skb->len);
> +
> +	buffer->next_to_watch = desc;
> +	ring->next_to_use = (ring->next_to_use + 1) % ring->count;
> +
> +	return 0;
> +}
> +
> +bool igc_fpe_transmitted_smd_v(union igc_adv_tx_desc *tx_desc)
> +{
> +	u8 smd = FIELD_GET(IGC_TXD_POPTS_SMD_MASK, tx_desc->read.olinfo_status);
> +
> +	return smd == IGC_TXD_POPTS_SMD_V;
> +}
> +
> +static int igc_fpe_xmit_smd_frame(struct igc_adapter *adapter, int type)
> +{
> +	int cpu = smp_processor_id();
> +	struct netdev_queue *nq;
> +	struct igc_ring *ring;
> +	struct sk_buff *skb;
> +	void *data;
> +	int err;
> +
> +	if (!netif_running(adapter->netdev))
> +		return -ENOTCONN;

Have you seen a code path that would require protecting the transmission
using this conditional?

I get the feeling that this is defensive programming when we could give
guarantees in other ways. But maybe there are some race conditions I'm
not seeing.

ethtool_mmsv_send_mpacket() is called from ethtool_mmsv_verify_timer()
and from ethtool_mmsv_event_handle(event == ETHTOOL_MMSV_LP_SENT_VERIFY_MPACKET).

When called from ethtool_mmsv_verify_timer(), it means the verify timer
has been armed, which only happens if (netif_running(mmsv->dev)). After
netif_running() becomes false (aka __LINK_STATE_START is cleared, aka in
ndo_stop()), you have a chance to call ethtool_mmsv_stop(), which stops
the verification timer. I see you're not doing that. Is there a reason?

> +
> +	ring = igc_get_tx_ring(adapter, cpu);
> +	nq = txring_txq(ring);
> +
> +	skb = alloc_skb(SMD_FRAME_SIZE, GFP_KERNEL);
> +	if (!skb)
> +		return -ENOMEM;
> +
> +	data = skb_put(skb, SMD_FRAME_SIZE);
> +	memset(data, 0, SMD_FRAME_SIZE);

skb_put_zero()

> +
> +	__netif_tx_lock(nq, cpu);
> +
> +	err = igc_fpe_init_tx_descriptor(ring, skb, type);
> +	igc_flush_tx_descriptors(ring);
> +
> +	__netif_tx_unlock(nq);
> +
> +	return err;
> +}
> +
> +static void igc_fpe_send_mpacket(struct ethtool_mmsv *mmsv,
> +				 enum ethtool_mpacket type)
> +{
> +	struct fpe_t *fpe = container_of(mmsv, struct fpe_t, mmsv);
> +	struct igc_adapter *adapter;
> +	int err;
> +
> +	adapter = container_of(fpe, struct igc_adapter, fpe);
> +
> +	if (type == ETHTOOL_MPACKET_VERIFY) {
> +		err = igc_fpe_xmit_smd_frame(adapter, IGC_TXD_POPTS_SMD_V);
> +		if (err)
> +			netdev_err(adapter->netdev, "Error sending SMD-V\n");
> +	} else if (type == ETHTOOL_MPACKET_RESPONSE) {
> +		err = igc_fpe_xmit_smd_frame(adapter, IGC_TXD_POPTS_SMD_R);
> +		if (err)
> +			netdev_err(adapter->netdev, "Error sending SMD-R frame\n");

Can you please rate limit these prints?

> +	}
> +}