[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20250610010216.7331ebb5@wsk>
Date: Tue, 10 Jun 2025 01:02:16 +0200
From: Lukasz Majewski <lukma@...x.de>
To: Paolo Abeni <pabeni@...hat.com>
Cc: Andrew Lunn <andrew+netdev@...n.ch>, davem@...emloft.net, Eric Dumazet
<edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>, Rob Herring
<robh@...nel.org>, Krzysztof Kozlowski <krzk+dt@...nel.org>, Conor Dooley
<conor+dt@...nel.org>, Shawn Guo <shawnguo@...nel.org>, Sascha Hauer
<s.hauer@...gutronix.de>, Pengutronix Kernel Team <kernel@...gutronix.de>,
Fabio Estevam <festevam@...il.com>, Richard Cochran
<richardcochran@...il.com>, netdev@...r.kernel.org,
devicetree@...r.kernel.org, linux-kernel@...r.kernel.org,
imx@...ts.linux.dev, linux-arm-kernel@...ts.infradead.org, Stefan Wahren
<wahrenst@....net>, Simon Horman <horms@...nel.org>, Andrew Lunn
<andrew@...n.ch>
Subject: Re: [net-next v12 4/7] net: mtip: The L2 switch driver for imx287
Hi Paolo,
> On 5/28/25 12:53 PM, Lukasz Majewski wrote:
> >> On 5/22/25 9:54 AM, Lukasz Majewski wrote:
> >>> +/* dynamicms MAC address table learn and migration */
> >>> +static void mtip_aging_timer(struct timer_list *t)
> >>> +{
> >>> + struct switch_enet_private *fep = from_timer(fep, t,
> >>> timer_aging); +
> >>> + fep->curr_time = mtip_timeincrement(fep->curr_time);
> >>> +
> >>> + mod_timer(&fep->timer_aging,
> >>> + jiffies +
> >>> msecs_to_jiffies(LEARNING_AGING_INTERVAL)); +}
> >>
> >> It's unclear to me why you need to maintain a timer just to update
> >> a timestamp?!?
> >>
> >
> > This timestamp is afterwards used in:
> > mtip_atable_dynamicms_learn_migration(), which in turn manages the
> > entries in switch "dynamic" table (it is one of the fields in the
> > record.
> >
> >> (jiffies >> msecs_to_jiffies(LEARNING_AGING_INTERVAL)) & ((1 <<
> >> AT_DENTRY_TIMESTAMP_WIDTH) - 1)
> >>
> >
> > If I understood you correctly - I shall remove the timer and then
> > just use the above line (based on jiffies) when
> > mtip_atable_dynamicms_learn_migration() is called (and it requires
> > the timestamp)?
> >
> > Otherwise the mtip_timeincrement() seems like a nice wrapper on
> > incrementing the timestamp.
>
> Scheduling a timer to obtain a value you can have for free is not a
> good resource usage strategy. Note that is a pending question/check
> above: verify that the suggested expression yield the expected value
> in all the possible use-case.
This is a bit more tricky than just getting value from jiffies.
The current code provides a monotonic, starting from 0 time "base" for
learning and managing entries in internal routing tables for MTIP.
To be more specific - the fep->curr_time is a value incremented after
each ~10ms.
Simple masking of jiffies would not provide such features.
However, I've rewritten relevant portions where GENMASK() could be used
to simplify and make the code more readable.
>
> >>> + if (!fep->link[0] && !fep->link[1]) {
> >>> + /* Link is down or autonegotiation is in
> >>> progress. */
> >>> + netif_stop_queue(dev);
> >>> + spin_unlock_irqrestore(&fep->hw_lock, flags);
> >>> + return NETDEV_TX_BUSY;
> >>
> >> Intead you should probably stop the queue when such events happen
> >
> > Please correct me if I'm wrong - the netif_stop_queue(dev); is
> > called before return. Shall something different be also done?
>
> The xmit routine should assume the link is up and the tx ring has
> enough free slot to enqueue a packet.
In the case of MTIP driver, there is a circular buffer of 16 "sets" of
descriptors (allocated as coherent) and corresponding buffer
(dma_map_single at start).
The size of each "buffer" is set to 2048B to accommodate at least single
packet.
> After enqueueing it should
> check for enough space availble for the next xmit and stop the queue,
> likely using the netif_txq_maybe_stop() helper.
The problem with not using the netif_txq_maybe_stop() is that I'm not
using the "txq" (netdev_queue).
With the current code it looks like netif_stop_queue() is the most
suitable one from the network API.
>
> Documentation/networking/driver.rst
>
> >>> + }
> >>> +
> >>> + /* Clear all of the status flags */
> >>> + status &= ~BD_ENET_TX_STATS;
> >>> +
> >>> + /* Set buffer length and buffer pointer */
> >>> + bufaddr = skb->data;
> >>> + bdp->cbd_datlen = skb->len;
> >>> +
> >>> + /* On some FEC implementations data must be aligned on
> >>> + * 4-byte boundaries. Use bounce buffers to copy data
> >>> + * and get it aligned.
> >>> + */
> >>> + if ((unsigned long)bufaddr & MTIP_ALIGNMENT) {
> >>> + unsigned int index;
> >>> +
> >>> + index = bdp - fep->tx_bd_base;
> >>> + memcpy(fep->tx_bounce[index],
> >>> + (void *)skb->data, skb->len);
> >>> + bufaddr = fep->tx_bounce[index];
> >>> + }
> >>> +
> >>> + if (fep->quirks & FEC_QUIRK_SWAP_FRAME)
> >>> + swap_buffer(bufaddr, skb->len);
> >>
> >> Ouch, the above will kill performances.
> >
> > This unfortunately must be done in such a way (the same approach is
> > present on fec_main.c) as the IP block is implemented in such a way
> > (explicit conversion from big endian to little endian).
> >
> >> Also it looks like it will
> >> access uninitialized memory if skb->len is not 4 bytes aligned.
> >>
> >
> > There is a few lines above a special code to prevent from such a
> > situation ((unsigned long)bufaddr & MTIP_ALIGNMENT).
>
> The problem here is not with memory buffer alignment, but with the
> packet length, that can be not a multiple of 4. In such a case the
> last swap will do an out-of-bound read touching uninitialized data.
On the init function the size of allocation for each buffer is set to
be 2048 bytes, so there is no such a thread.
>
> >>> + bdp->cbd_sc = status;
> >>> +
> >>> + netif_trans_update(dev);
> >>> + skb_tx_timestamp(skb);
> >>> +
> >>> + /* For port separation - force sending via specified port
> >>> */
> >>> + if (!fep->br_offload && port != 0)
> >>> + mtip_forced_forward(fep, port, 1);
> >>> +
> >>> + /* Trigger transmission start */
> >>> + writel(MCF_ESW_TDAR_X_DES_ACTIVE, fep->hwp + ESW_TDAR);
> >>>
> >>
> >> Possibly you should check skb->xmit_more to avoid ringing the
> >> doorbell when not needed.
> >
> > I couldn't find skb->xmit_more in the current sources. Instead,
> > there is netdev_xmit_more().
>
> Yeah, I referred to the old code, sorry.
>
> > However, the TX code just is supposed to setup one frame
> > transmission and hence there is no risk that we trigger "empty"
> > transmission.
>
> The point is that doorbell ringing is usually very expensive (slow)
> for the H/W. And is not needed when netdev_xmit_more() is true,
> because the another xmit operation will follow. If you care about
> performances you should leverage such info.
I do have an impression, that this is very important for network
devices having many queues with separate priorities.
In my case - I do have a single uDMA0 port with a single RX and TX
circular buffer (16 packets can be "queued").
>
> >
> >>> + /* First, grab all of the stats for the incoming packet.
> >>> + * These get messed up if we get called due to a busy
> >>> condition.
> >>> + */
> >>> + bdp = fep->cur_rx;
> >>> +
> >>> + while (!((status = bdp->cbd_sc) & BD_ENET_RX_EMPTY)) {
> >>> + if (pkt_received >= budget)
> >>> + break;
> >>> +
> >>> + pkt_received++;
> >>> + /* Since we have allocated space to hold a
> >>> complete frame,
> >>> + * the last indicator should be set.
> >>> + */
> >>> + if ((status & BD_ENET_RX_LAST) == 0)
> >>> + dev_warn_ratelimited(&dev->dev,
> >>> + "SWITCH ENET: rcv is
> >>> not +last\n"); +
> >>> + if (!fep->usage_count)
> >>> + goto rx_processing_done;
> >>> +
> >>> + /* Check for errors. */
> >>> + if (status & (BD_ENET_RX_LG | BD_ENET_RX_SH |
> >>> BD_ENET_RX_NO |
> >>> + BD_ENET_RX_CR | BD_ENET_RX_OV)) {
> >>> + dev->stats.rx_errors++;
> >>> + if (status & (BD_ENET_RX_LG |
> >>> BD_ENET_RX_SH)) {
> >>> + /* Frame too long or too short.
> >>> */
> >>> + dev->stats.rx_length_errors++;
> >>> + }
> >>> + if (status & BD_ENET_RX_NO) /*
> >>> Frame alignment */
> >>> + dev->stats.rx_frame_errors++;
> >>> + if (status & BD_ENET_RX_CR) /* CRC
> >>> Error */
> >>> + dev->stats.rx_crc_errors++;
> >>> + if (status & BD_ENET_RX_OV) /*
> >>> FIFO overrun */
> >>> + dev->stats.rx_fifo_errors++;
> >>> + }
> >>> +
> >>> + /* Report late collisions as a frame error.
> >>> + * On this error, the BD is closed, but we don't
> >>> know what we
> >>> + * have in the buffer. So, just drop this frame
> >>> on the floor.
> >>> + */
> >>> + if (status & BD_ENET_RX_CL) {
> >>> + dev->stats.rx_errors++;
> >>> + dev->stats.rx_frame_errors++;
> >>> + goto rx_processing_done;
> >>> + }
> >>> +
> >>> + /* Process the incoming frame */
> >>> + pkt_len = bdp->cbd_datlen;
> >>> + data = (__u8 *)__va(bdp->cbd_bufaddr);
> >>> +
> >>> + dma_unmap_single(&fep->pdev->dev,
> >>> bdp->cbd_bufaddr,
> >>> + bdp->cbd_datlen,
> >>> DMA_FROM_DEVICE);
> >>
> >> I have read your explaination WRT unmap/map. Actually you don't
> >> need to do any mapping here,
> >
> > There are 16 cbd_t descriptors allocated (as dma_alloc_coherent).
> > Those descriptors contain pointer to data (being read in this
> > case).
>
> I'm referring to the actual packet payload, that is the buffer at
> bdp-cbd_bufaddr with len bdp->cbd_datlen; I'm not discussing the
> descriptors contents.
+1
>
> > Hence the need to perform dma_map_single() for each descriptor,
>
> You are not unmapping the descriptor, you are unmapping the packet
> payload.
+1
>
> >> since you are unconditionally copying the
> >> whole buffer (why???)
> >
> > Only the value of
> > pkt_len = bdp->cbd_datlen; is copied to SKB (after byte
> > swap_buffer()).
>
> The relevant line is:
>
> skb_copy_to_linear_data(skb, data, pkt_len);
>
> AFAICS that copies whole packet contents, which is usually quite
> sub-optimal from performance PoV.
>
fec_main.c just assigns:
data·=·skb->data;
so I would prefer to keep the:
skb_copy_to_linear_data(skb, data, pkt_len);
> >> and re-using it.
> >>
> >> Still you need a dma_sync_single() to ensure the CPUs see the
> >> correct data.
> >
> > The descriptors - i.e. struct cbd_t fields are allocated with
> > dma_alloc_coherent(), so this is OK.
>
> I'm talking about packets contents, not packet descriptors. Please
> re-read the above and have a look at other drivers code.
The usage of dma_sync_single_for_cpu() works without issues in the
mtip_switch_rx().
>
> An additional point that I missed in the previous review is that the
> rx allocation schema is quite uncorrect. At ring initialization time
> you allocate full skbs, while what you need and use is just raw
> buffers for the packet payload. Instead you could/should use the page
> pool:
>
> Documentation/networking/page_pool.rst
>
Yes, for RX packets payload the page of 2048 bytes is allocated. By
using dma page pool - I can state the same maximal size, but the usage
of memory can be much more flexible.
> That will also help doing the right thing WRT DMA handling.
>
The dma_sync_single_for_cpu() shall work correctly with the current
approach as well.
> >> This patch is really too big, I'm pretty sure I missed some
> >> relevant issues. You should split it in multiple ones: i.e.
> >> initialization and h/w access, rx/tx, others ndos.
> >
> > It is quite hard to "scatter" this patch as:
> >
> > 1. I've already split it to several files (which correspond to
> > different "logical" entities - like mtipl2sw_br.c).
> > 2. The mtipl2sw.c file is the smallest part of the "core" of the
> > driver.
> > 3. If I split it, then at some point I would break bisectability for
> > imx28.
>
> Note that each patch don't need to provide complete functionality.
> i.e. patch 1 could implement ndo_open()/close and related helper,
> leaving ndo_start_xmit() and napi_poll empty and avoid allocating the
> rx buffers. patch 2 could implement the rx patch, patch 3 the tx path.
>
Yes, this seems to be a good idea... I will implement such approach.
> The only constraint is that each patch will build successufully, which
> is usually easy to achieve.
+1
>
> A 2K lines patches will probably lead to many more iterations and
> unhappy (or no) reviewers.
The problem is that all the patches "around" this driver (like *yaml,
bindings, defconfig) would get outdated very fast if not pull to
mainline.
In such a way that already done work would need to be redo...
>
> /P
>
Best regards,
Lukasz Majewski
--
DENX Software Engineering GmbH, Managing Director: Erika Unter
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-59 Fax: (+49)-8142-66989-80 Email: lukma@...x.de
Content of type "application/pgp-signature" skipped
Powered by blists - more mailing lists