linux-kernel - Re: [net-next v19 4/7] net: mtip: Add net_device

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250910231552.13a5d963@wsk>
Date: Wed, 10 Sep 2025 23:15:52 +0200
From: Łukasz Majewski <lukasz.majewski@...lbox.org>
To: Jakub Kicinski <kuba@...nel.org>
Cc: Andrew Lunn <andrew+netdev@...n.ch>, davem@...emloft.net, Eric Dumazet
 <edumazet@...gle.com>, Paolo Abeni <pabeni@...hat.com>, Rob Herring
 <robh@...nel.org>, Krzysztof Kozlowski <krzk+dt@...nel.org>, Conor Dooley
 <conor+dt@...nel.org>, Shawn Guo <shawnguo@...nel.org>, Sascha Hauer
 <s.hauer@...gutronix.de>, Pengutronix Kernel Team <kernel@...gutronix.de>,
 Fabio Estevam <festevam@...il.com>, Richard Cochran
 <richardcochran@...il.com>, netdev@...r.kernel.org,
 devicetree@...r.kernel.org, linux-kernel@...r.kernel.org,
 imx@...ts.linux.dev, linux-arm-kernel@...ts.infradead.org, Stefan Wahren
 <wahrenst@....net>, Simon Horman <horms@...nel.org>
Subject: Re: [net-next v19 4/7] net: mtip: Add net_device_ops functions to
 the L2 switch driver

Hi Jakub,

> On Sun, 7 Sep 2025 18:38:54 +0200 Łukasz Majewski wrote:
> > > On Mon, 25 Aug 2025 00:07:33 +0200 Lukasz Majewski wrote:    
> > > > +	/* Set buffer length and buffer pointer */
> > > > +	bufaddr = skb->data;      
> > > 
> > > You can't write (swap) skb->data if the skb is a clone..    
> > 
> > I do use skb = buld_skb() which, "builds" the SKB around the memory
> > page (from pool).
> > 
> > Then, I "pass" this data (and swap it) to upper layer of the network
> > stack.
> > 
> > The same approach is used in the fec_main.c driver:
> > https://elixir.bootlin.com/linux/v6.17-rc3/source/drivers/net/ethernet/freescale/fec_main.c#L1853
> >  
> 
> I probably cut out too much context. I think I was quoting from Tx,
> indeed on Rx this is not an issue.

Ok. No adjustments needed then. Good :)

> 
> > What I'm trying to do - is to model the HW which I do have...
> > 
> > When switch is enabled I do have ONE uDMA0 which works for both eth
> > ports (lan0 and lan1).
> > 
> > That is why I do have only one NAPI queue.
> >   
> > > What makes my head spin is that you seem to record which
> > > netdev/port was doing Rx _last_ and then pass that netdev to
> > > mtip_switch_tx(). Why?    
> > 
> > You may have port == 1 || port == 2 when you receive packet from
> > ingres ports.
> > You may also have port == 0xFF when you first time encounter the SA
> > on the port and port == 0 when you send/receive data from the "host"
> > interface.
> > 
> > When port 1/2 is "detected" then the net dev for this particular
> > port is used. In other cases the one for NAPI is used (which is one
> > of those two - please see comment above).
> > 
> > This was the approach from original NXP (Freescale) driver. It in
> > some way prevents from "starvation" from net devices when L2 switch
> > is disabled and I need to provide port separation.
> > 
> > (port separation in fact is achieved by programming L2 switch
> > registers and is realized in HW).  
> 
> But what if we have mixed traffic from port 1 and 2?
> Does the current code correctly Rx the packets from port 1 on the
> netdev from port 1 and packets from port 2 on the netdev from port 2?

Yes, it does.

In the mtip_rx_napi() you have call to mtip_switch_rx() which accepts
pointer to port variable.

It sets it according to the information provided by the switch IP block
HW and also adjust the skb's ndev.

I'm just wondering if the snippet from mtip_rx_napi():
-------8<--------
if ((port == 1 || port == 2) && fep->ndev[port - 1]
	mtip_switch_tx(fep->ndev[port - 1]);
else
	mtip_switch_tx(napi->dev);
------->8--------

could be replaced just with mtip_switch_tx(napi->dev);
as TX via napi->dev shall be forward to both ports if required.

I will check if this can be done in such a way.

> 
> > > Isn't the dev that we're completing Tx for is
> > > best read from skb->dev packet by packet?    
> > 
> > It may be worth to try.... I think that the code, which we do have
> > now, tries to reuse some kind of "locality".
> >   
> > > Also this wake up logic
> > > looks like it will wake up _one_ netdev's queue and then set
> > > tx_full = 0, so presumably it will not wake the other port if
> > > both ports queues were stopped. Why keep tx_full state in the
> > > first place? Just check if the queues is stopped..?    
> > 
> > As I said - we do have only ONE queue, which corresponds to uDMA0
> > when the switch is enabled. This single queue is responsible for
> > handling transmission for both ports (this is how the HW is
> > designed).  
> 
> Right but kernel has two SW queues.

You mean a separate SW queues for each devices? This is not supported
in the MTIP L2 switch driver. Maybe such high level SW queues
management is available in the upper layers?

> Which can be independently
> stopped.

It supports separate RX and TX HW queues (i.e. ring buffers for
descriptors) for the uDMA0 when switch is enabled.

When you want to send data (no matter from which lan[01] device) the
same mtip_start_xmit() is called, the HW TX descriptor is setup and is
passed via uDMA0 to L2 switch IP block.

For next TX transmission (even from different port) we assign another
descriptor from the ring buffer.

> So my concerns is that for example port 1 is very busy so
> the queue is full of packets for port 1, port 1's netdev's queue gets
> stopped. Then port 2 tries to Tx, queue is shared, and is full, so
> netdev 2's SW queue is also stopped. Then we complete the packets,
> because packets were for port 1 we wake the queue for port 1. But
> port 2 also got stopped, even tho it never put a packet on the ring..
> 

As fair as I can tell - both ports call mtip_start_xmit(), their data
is serialized to the TX queue (via descriptors).

Queued descriptors are sent always at some point (or overridden if
critical error encountered).

> Long story short I think you need to always stop and start queues from
> both netdevs.. There's just 2 so not too bad of a hack.

Maybe there are some peculiarities in for example bridge code (or upper
network stack layers in general), but I think, that I don't need any
extra "queue" management for TX code of MTIP L2 switch.

-- 
Best regards,

Łukasz Majewski