netdev - Re: [net-next v15 06/12] net: mtip: Add net_device

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250723220517.063c204b@wsk>
Date: Wed, 23 Jul 2025 22:05:17 +0200
From: Lukasz Majewski <lukma@...x.de>
To: Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>
Cc: Andrew Lunn <andrew+netdev@...n.ch>, davem@...emloft.net, Eric Dumazet
 <edumazet@...gle.com>, Rob Herring <robh@...nel.org>, Krzysztof Kozlowski
 <krzk+dt@...nel.org>, Conor Dooley <conor+dt@...nel.org>, Shawn Guo
 <shawnguo@...nel.org>, Sascha Hauer <s.hauer@...gutronix.de>, Pengutronix
 Kernel Team <kernel@...gutronix.de>, Fabio Estevam <festevam@...il.com>,
 Richard Cochran <richardcochran@...il.com>, netdev@...r.kernel.org,
 devicetree@...r.kernel.org, linux-kernel@...r.kernel.org,
 imx@...ts.linux.dev, linux-arm-kernel@...ts.infradead.org, Stefan Wahren
 <wahrenst@....net>, Simon Horman <horms@...nel.org>
Subject: Re: [net-next v15 06/12] net: mtip: Add net_device_ops functions to
 the L2 switch driver

Hi Jakub, Paolo,

Do you have more comments and questions regarding this driver after my
explanation?

Shall I do something more?

Thanks in advance for you feedback.

> Hi Jakub,
> 
> > On Wed, 16 Jul 2025 23:47:25 +0200 Lukasz Majewski wrote:  
> > > +static netdev_tx_t mtip_start_xmit_port(struct sk_buff *skb,
> > > +					struct net_device *dev,
> > > int port) +{
> > > +	struct mtip_ndev_priv *priv = netdev_priv(dev);
> > > +	struct switch_enet_private *fep = priv->fep;
> > > +	unsigned short status;
> > > +	struct cbd_t *bdp;
> > > +	void *bufaddr;
> > > +
> > > +	spin_lock(&fep->hw_lock);    
> > 
> > I see some inconsistencies in how you take this lock.
> > Bunch of bare spin_lock() calls from BH context, but there's also
> > a _irqsave() call in mtip_adjust_link().  
> 
> In the legacy NXP (Freescale) code for this IP block (i.e. MTIP
> switch) the recommended way to re-setup it, when link or duplex
> changes, is to reset and reconfigure it.
> 
> It requires setting up interrupts as well... In that situation, IMHO
> disabling system interrupts is required to avoid some undefined
> behaviour.
> 
> > Please align to the strictest
> > context (not sure if the irqsave is actually needed, at a glance,
> > IOW whether the lock is taken from an IRQ)  
> 
> The spin_lock() for xmit port is similar to what is done for
> fec_main.c. As this switch uses single uDMA for both ports as well as
> there is no support (and need) for multiple queues it can be omitted.
> 
> >   
> > > +	if (!fep->link[0] && !fep->link[1]) {
> > > +		/* Link is down or autonegotiation is in
> > > progress. */
> > > +		netif_stop_queue(dev);
> > > +		spin_unlock(&fep->hw_lock);
> > > +		return NETDEV_TX_BUSY;
> > > +	}
> > > +
> > > +	/* Fill in a Tx ring entry */
> > > +	bdp = fep->cur_tx;
> > > +
> > > +	/* Force read memory barier on the current transmit
> > > description */    
> > 
> > Barrier are between things. What is this barrier separating, and
> > what write barrier does it pair with? As far as I can tell cur_tx
> > is just a value in memory, and accesses are under ->hw_lock, so
> > there should be no ordering concerns.  
> 
> The bdp is the uDMA descritptor (memory allocated in the coherent dma
> area). It is used by the uDMA when data is transferred to MTIP switch
> internal buffer.
> 
> The bdp->cbd_sc is a half word, which is modified by uDMA engine, to
> indicate if there are errors or transfer has ended.
> 
> The rmb() shall improve robustness - it assures that the status
> corresponds to what was set by uDMA. On the other hand dma coherent
> allocation shall do this as well.
> 
> The fec_main.c places the rmb() in similar places, so I followed their
> approach.
> 
> >   
> > > +	rmb();
> > > +	status = bdp->cbd_sc;
> > > +
> > > +	if (status & BD_ENET_TX_READY) {
> > > +		/* All transmit buffers are full. Bail out.
> > > +		 * This should not happen, since dev->tbusy
> > > should be set.
> > > +		 */
> > > +		netif_stop_queue(dev);
> > > +		dev_err(&fep->pdev->dev, "%s: tx queue full!.\n",
> > > dev->name);    
> > 
> > This needs to be rate limited, we don't want to flood the logs in
> > case there's a bug.  
> 
> +1
> 
> > 
> > Also at a glance it seems like you have one fep for multiple
> > netdevs.  
> 
> Yes.
> 
> > So stopping one netdev's Tx queue when fep fills up will not stop
> > the other ports from pushing frames, right?  
> 
> This is a bit more complicated...
> 
> Other solutions - like cpsw_new - are conceptually simple; there are
> two DMAs to two separate eth IP blocks.
> During startup two separate devices are created. When one wants to
> enable bridge (i.e. start in-hw offloading) - just single bit is setup
> and ... that's it.
> 
> With vf610 / imx287 and MTIP it is a bit different (imx287 is even
> worse as second ETH interface has incomplete functionality by design).
> 
> When switch is not active - you have two uDMA ports to two ENET IP
> blocks. Full separation. That is what is done with fec_main.c driver.
> 
> When you enable MTIP switch - then you have just a single uDMA0 active
> for "both" ports. In fact you "bridge" two ports into a single one -
> that is why Freescale/NXP driver (for 2.6.y) just had eth0 to "model"
> bridged interfaces. That was "simpler" (PHY management was done in the
> driver as well).
> 
> Now, in this driver, we do have two network devices, which are
> "bridged" (so there is br0). And of course there must be separation
> between lan0/1 when this driver is used, but bridge is not (yet)
> created. This works :-)
> 
> 
> So I do have - 2x netdevs (handled by single uDMA0) + 2PHYS + br0 +
> NAPI + switchdev (to avoid broadcast frame storms + {R}STP + FDB -
> WIP).
> 
> 
> Just pure fun :-) to model it all ... and make happy all maintainers
> :-)
> 
> >   
> > > +		spin_unlock(&fep->hw_lock);
> > > +		return NETDEV_TX_BUSY;
> > > +	}
> > > +
> > > +	/* Clear all of the status flags */
> > > +	status &= ~BD_ENET_TX_STATS;
> > > +
> > > +	/* Set buffer length and buffer pointer */
> > > +	bufaddr = skb->data;
> > > +	bdp->cbd_datlen = skb->len;
> > > +
> > > +	/* On some FEC implementations data must be aligned on
> > > +	 * 4-byte boundaries. Use bounce buffers to copy data
> > > +	 * and get it aligned.spin
> > > +	 */
> > > +	if ((unsigned long)bufaddr & MTIP_ALIGNMENT) {    
> > 
> > I think you should add 
> > 
> > 	if ... ||
> >            fep->quirks & FEC_QUIRK_SWAP_FRAME)
> > 
> > here. You can't modify skb->data without calling skb_cow_data()
> > but you already have buffers allocated so can as well use them.  
> 
> The vf610 doesn't need the frame to be swapped, but has requirements
> for alignment as well.
> 
> I would keep things as they are now - as they just improve
> readability.
> 
> Please keep in mind that this version only supports imx287, but the
> plan is to add vf610 as well (to be more specific - this driver also
> works on vf610, but I plan to add those patches after this one is
> accepted and pulled). 
> 
> >   
> > > +		unsigned int index;
> > > +
> > > +		index = bdp - fep->tx_bd_base;
> > > +		memcpy(fep->tx_bounce[index],
> > > +		       (void *)skb->data, skb->len);    
> > 
> > this fits on one 80 char line BTW, quite easily:
> > 
> > 		memcpy(fep->tx_bounce[index], (void *)skb->data,
> > skb->len);
> > 
> > Also the cast to void * is not necessary in C.  
> 
> +1
> 
> >   
> > > +		bufaddr = fep->tx_bounce[index];
> > > +	}
> > > +
> > > +	if (fep->quirks & FEC_QUIRK_SWAP_FRAME)
> > > +		swap_buffer(bufaddr, skb->len);
> > > +
> > > +	/* Save skb pointer. */
> > > +	fep->tx_skbuff[fep->skb_cur] = skb;
> > > +
> > > +	fep->skb_cur = (fep->skb_cur + 1) & TX_RING_MOD_MASK;    
> > 
> > Not sure if this is buggy, but maybe delay updating things until the
> > mapping succeeds? Fewer things to unwind.  
> 
> Yes, the skb storage as well as ring buffer modification can be done
> after dma mapping code.
> 
> >   
> > > +	/* Push the data cache so the CPM does not get stale
> > > memory
> > > +	 * data.
> > > +	 */
> > > +	bdp->cbd_bufaddr = dma_map_single(&fep->pdev->dev,
> > > bufaddr,
> > > +					  MTIP_SWITCH_TX_FRSIZE,
> > > +					  DMA_TO_DEVICE);
> > > +	if (unlikely(dma_mapping_error(&fep->pdev->dev,
> > > bdp->cbd_bufaddr))) {
> > > +		dev_err(&fep->pdev->dev,
> > > +			"Failed to map descriptor tx buffer\n");
> > > +		dev->stats.tx_errors++;
> > > +		dev->stats.tx_dropped++;    
> > 
> > dropped and errors are two different counters
> > I'd stick to dropped  
> 
> Ok.
> 
> >   
> > > +		dev_kfree_skb_any(skb);
> > > +		goto err;
> > > +	}
> > > +
> > > +	/* Send it on its way.  Tell FEC it's ready, interrupt
> > > when done,
> > > +	 * it's the last BD of the frame, and to put the CRC on
> > > the end.
> > > +	 */
> > > +
> > > +	status |= (BD_ENET_TX_READY | BD_ENET_TX_INTR
> > > +			| BD_ENET_TX_LAST | BD_ENET_TX_TC);    
> > 
> > The | goes at the end of the previous line, start of new line
> > adjusts to the opening brackets..
> >   
> 
> I've refactored it.
> 
> > > +
> > > +	/* Synchronize all descriptor writes */
> > > +	wmb();
> > > +	bdp->cbd_sc = status;
> > > +
> > > +	netif_trans_update(dev);    
> > 
> > Is this call necessary?  
> 
> I've added it when I was forward porting the old driver. It can be
> removed.
> 
> >   
> > > +	skb_tx_timestamp(skb);
> > > +
> > > +	/* Trigger transmission start */
> > > +	writel(MCF_ESW_TDAR_X_DES_ACTIVE, fep->hwp + ESW_TDAR);
> > > +
> > > +	dev->stats.tx_bytes += skb->len;
> > > +	/* If this was the last BD in the ring,
> > > +	 * start at the beginning again.
> > > +	 */
> > > +	if (status & BD_ENET_TX_WRAP)
> > > +		bdp = fep->tx_bd_base;
> > > +	else
> > > +		bdp++;
> > > +
> > > +	if (bdp == fep->dirty_tx) {
> > > +		fep->tx_full = 1;
> > > +		netif_stop_queue(dev);
> > > +	}
> > > +
> > > +	fep->cur_tx = bdp;
> > > + err:
> > > +	spin_unlock(&fep->hw_lock);
> > > +
> > > +	return NETDEV_TX_OK;
> > > +}    
> 
> 
> Thanks for the feedback.
> 
> Best regards,
> 
> Lukasz Majewski
> 
> --
> 
> DENX Software Engineering GmbH, Managing Director: Johanna Denk,
> Tabea Lutz HRB 165235 Munich, Office: Kirchenstr.5, D-82194
> Groebenzell, Germany
> Phone: (+49)-8142-66989-59 Fax: (+49)-8142-66989-80 Email:
> lukma@...x.de




Best regards,

Lukasz Majewski

--

DENX Software Engineering GmbH, Managing Director: Johanna Denk,
Tabea Lutz HRB 165235 Munich, Office: Kirchenstr.5, D-82194
Groebenzell, Germany
Phone: (+49)-8142-66989-59 Fax: (+49)-8142-66989-80 Email: lukma@...x.de

Content of type "application/pgp-signature" skipped