linux-kernel - Re: [net-next v15 06/12] net: mtip: Add net_device

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250722111639.3a53b450@wsk>
Date: Tue, 22 Jul 2025 11:16:39 +0200
From: Lukasz Majewski <lukma@...x.de>
To: Jakub Kicinski <kuba@...nel.org>
Cc: Andrew Lunn <andrew+netdev@...n.ch>, davem@...emloft.net, Eric Dumazet
 <edumazet@...gle.com>, Paolo Abeni <pabeni@...hat.com>, Rob Herring
 <robh@...nel.org>, Krzysztof Kozlowski <krzk+dt@...nel.org>, Conor Dooley
 <conor+dt@...nel.org>, Shawn Guo <shawnguo@...nel.org>, Sascha Hauer
 <s.hauer@...gutronix.de>, Pengutronix Kernel Team <kernel@...gutronix.de>,
 Fabio Estevam <festevam@...il.com>, Richard Cochran
 <richardcochran@...il.com>, netdev@...r.kernel.org,
 devicetree@...r.kernel.org, linux-kernel@...r.kernel.org,
 imx@...ts.linux.dev, linux-arm-kernel@...ts.infradead.org, Stefan Wahren
 <wahrenst@....net>, Simon Horman <horms@...nel.org>
Subject: Re: [net-next v15 06/12] net: mtip: Add net_device_ops functions to
 the L2 switch driver

Hi Jakub,

> On Wed, 16 Jul 2025 23:47:25 +0200 Lukasz Majewski wrote:
> > +static netdev_tx_t mtip_start_xmit_port(struct sk_buff *skb,
> > +					struct net_device *dev,
> > int port) +{
> > +	struct mtip_ndev_priv *priv = netdev_priv(dev);
> > +	struct switch_enet_private *fep = priv->fep;
> > +	unsigned short status;
> > +	struct cbd_t *bdp;
> > +	void *bufaddr;
> > +
> > +	spin_lock(&fep->hw_lock);  
> 
> I see some inconsistencies in how you take this lock.
> Bunch of bare spin_lock() calls from BH context, but there's also
> a _irqsave() call in mtip_adjust_link().

In the legacy NXP (Freescale) code for this IP block (i.e. MTIP switch)
the recommended way to re-setup it, when link or duplex changes, is to
reset and reconfigure it.

It requires setting up interrupts as well... In that situation, IMHO
disabling system interrupts is required to avoid some undefined
behaviour.

> Please align to the strictest
> context (not sure if the irqsave is actually needed, at a glance, IOW
> whether the lock is taken from an IRQ)

The spin_lock() for xmit port is similar to what is done for
fec_main.c. As this switch uses single uDMA for both ports as well as
there is no support (and need) for multiple queues it can be omitted.

> 
> > +	if (!fep->link[0] && !fep->link[1]) {
> > +		/* Link is down or autonegotiation is in progress.
> > */
> > +		netif_stop_queue(dev);
> > +		spin_unlock(&fep->hw_lock);
> > +		return NETDEV_TX_BUSY;
> > +	}
> > +
> > +	/* Fill in a Tx ring entry */
> > +	bdp = fep->cur_tx;
> > +
> > +	/* Force read memory barier on the current transmit
> > description */  
> 
> Barrier are between things. What is this barrier separating, and what
> write barrier does it pair with? As far as I can tell cur_tx is just
> a value in memory, and accesses are under ->hw_lock, so there should
> be no ordering concerns.

The bdp is the uDMA descritptor (memory allocated in the coherent dma
area). It is used by the uDMA when data is transferred to MTIP switch
internal buffer.

The bdp->cbd_sc is a half word, which is modified by uDMA engine, to
indicate if there are errors or transfer has ended.

The rmb() shall improve robustness - it assures that the status
corresponds to what was set by uDMA. On the other hand dma coherent
allocation shall do this as well.

The fec_main.c places the rmb() in similar places, so I followed their
approach.

> 
> > +	rmb();
> > +	status = bdp->cbd_sc;
> > +
> > +	if (status & BD_ENET_TX_READY) {
> > +		/* All transmit buffers are full. Bail out.
> > +		 * This should not happen, since dev->tbusy should
> > be set.
> > +		 */
> > +		netif_stop_queue(dev);
> > +		dev_err(&fep->pdev->dev, "%s: tx queue full!.\n",
> > dev->name);  
> 
> This needs to be rate limited, we don't want to flood the logs in case
> there's a bug.

+1

> 
> Also at a glance it seems like you have one fep for multiple netdevs.

Yes.

> So stopping one netdev's Tx queue when fep fills up will not stop the
> other ports from pushing frames, right?

This is a bit more complicated...

Other solutions - like cpsw_new - are conceptually simple; there are
two DMAs to two separate eth IP blocks.
During startup two separate devices are created. When one wants to
enable bridge (i.e. start in-hw offloading) - just single bit is setup
and ... that's it.

With vf610 / imx287 and MTIP it is a bit different (imx287 is even
worse as second ETH interface has incomplete functionality by design).

When switch is not active - you have two uDMA ports to two ENET IP
blocks. Full separation. That is what is done with fec_main.c driver.

When you enable MTIP switch - then you have just a single uDMA0 active
for "both" ports. In fact you "bridge" two ports into a single one -
that is why Freescale/NXP driver (for 2.6.y) just had eth0 to "model"
bridged interfaces. That was "simpler" (PHY management was done in the
driver as well).

Now, in this driver, we do have two network devices, which are "bridged"
(so there is br0). And of course there must be separation between
lan0/1 when this driver is used, but bridge is not (yet) created. This
works :-)


So I do have - 2x netdevs (handled by single uDMA0) + 2PHYS + br0 +
NAPI + switchdev (to avoid broadcast frame storms + {R}STP + FDB -
WIP).


Just pure fun :-) to model it all ... and make happy all maintainers :-)

> 
> > +		spin_unlock(&fep->hw_lock);
> > +		return NETDEV_TX_BUSY;
> > +	}
> > +
> > +	/* Clear all of the status flags */
> > +	status &= ~BD_ENET_TX_STATS;
> > +
> > +	/* Set buffer length and buffer pointer */
> > +	bufaddr = skb->data;
> > +	bdp->cbd_datlen = skb->len;
> > +
> > +	/* On some FEC implementations data must be aligned on
> > +	 * 4-byte boundaries. Use bounce buffers to copy data
> > +	 * and get it aligned.spin
> > +	 */
> > +	if ((unsigned long)bufaddr & MTIP_ALIGNMENT) {  
> 
> I think you should add 
> 
> 	if ... ||
>            fep->quirks & FEC_QUIRK_SWAP_FRAME)
> 
> here. You can't modify skb->data without calling skb_cow_data()
> but you already have buffers allocated so can as well use them.

The vf610 doesn't need the frame to be swapped, but has requirements
for alignment as well.

I would keep things as they are now - as they just improve readability.

Please keep in mind that this version only supports imx287, but the
plan is to add vf610 as well (to be more specific - this driver also
works on vf610, but I plan to add those patches after this one is
accepted and pulled). 

> 
> > +		unsigned int index;
> > +
> > +		index = bdp - fep->tx_bd_base;
> > +		memcpy(fep->tx_bounce[index],
> > +		       (void *)skb->data, skb->len);  
> 
> this fits on one 80 char line BTW, quite easily:
> 
> 		memcpy(fep->tx_bounce[index], (void *)skb->data,
> skb->len);
> 
> Also the cast to void * is not necessary in C.

+1

> 
> > +		bufaddr = fep->tx_bounce[index];
> > +	}
> > +
> > +	if (fep->quirks & FEC_QUIRK_SWAP_FRAME)
> > +		swap_buffer(bufaddr, skb->len);
> > +
> > +	/* Save skb pointer. */
> > +	fep->tx_skbuff[fep->skb_cur] = skb;
> > +
> > +	fep->skb_cur = (fep->skb_cur + 1) & TX_RING_MOD_MASK;  
> 
> Not sure if this is buggy, but maybe delay updating things until the
> mapping succeeds? Fewer things to unwind.

Yes, the skb storage as well as ring buffer modification can be done
after dma mapping code.

> 
> > +	/* Push the data cache so the CPM does not get stale memory
> > +	 * data.
> > +	 */
> > +	bdp->cbd_bufaddr = dma_map_single(&fep->pdev->dev, bufaddr,
> > +					  MTIP_SWITCH_TX_FRSIZE,
> > +					  DMA_TO_DEVICE);
> > +	if (unlikely(dma_mapping_error(&fep->pdev->dev,
> > bdp->cbd_bufaddr))) {
> > +		dev_err(&fep->pdev->dev,
> > +			"Failed to map descriptor tx buffer\n");
> > +		dev->stats.tx_errors++;
> > +		dev->stats.tx_dropped++;  
> 
> dropped and errors are two different counters
> I'd stick to dropped

Ok.

> 
> > +		dev_kfree_skb_any(skb);
> > +		goto err;
> > +	}
> > +
> > +	/* Send it on its way.  Tell FEC it's ready, interrupt
> > when done,
> > +	 * it's the last BD of the frame, and to put the CRC on
> > the end.
> > +	 */
> > +
> > +	status |= (BD_ENET_TX_READY | BD_ENET_TX_INTR
> > +			| BD_ENET_TX_LAST | BD_ENET_TX_TC);  
> 
> The | goes at the end of the previous line, start of new line adjusts 
> to the opening brackets..
> 

I've refactored it.

> > +
> > +	/* Synchronize all descriptor writes */
> > +	wmb();
> > +	bdp->cbd_sc = status;
> > +
> > +	netif_trans_update(dev);  
> 
> Is this call necessary?

I've added it when I was forward porting the old driver. It can be
removed.

> 
> > +	skb_tx_timestamp(skb);
> > +
> > +	/* Trigger transmission start */
> > +	writel(MCF_ESW_TDAR_X_DES_ACTIVE, fep->hwp + ESW_TDAR);
> > +
> > +	dev->stats.tx_bytes += skb->len;
> > +	/* If this was the last BD in the ring,
> > +	 * start at the beginning again.
> > +	 */
> > +	if (status & BD_ENET_TX_WRAP)
> > +		bdp = fep->tx_bd_base;
> > +	else
> > +		bdp++;
> > +
> > +	if (bdp == fep->dirty_tx) {
> > +		fep->tx_full = 1;
> > +		netif_stop_queue(dev);
> > +	}
> > +
> > +	fep->cur_tx = bdp;
> > + err:
> > +	spin_unlock(&fep->hw_lock);
> > +
> > +	return NETDEV_TX_OK;
> > +}  


Thanks for the feedback.

Best regards,

Lukasz Majewski

--

DENX Software Engineering GmbH, Managing Director: Johanna Denk,
Tabea Lutz HRB 165235 Munich, Office: Kirchenstr.5, D-82194
Groebenzell, Germany
Phone: (+49)-8142-66989-59 Fax: (+49)-8142-66989-80 Email: lukma@...x.de

Content of type "application/pgp-signature" skipped