linux-kernel - RE: [PATCH 1/1] can: m_can: Control tx flow to avoid message stuck

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <MW4PR11MB5870A2300CA4571C1C66B317F7192@MW4PR11MB5870.namprd11.prod.outlook.com>
Date: Wed, 15 Jan 2025 13:04:43 +0000
From: "Mohan, Subramanian" <subramanian.mohan@...el.com>
To: Markus Schneider-Pargmann <msp@...libre.com>
CC: "rcsekar@...sung.com" <rcsekar@...sung.com>, "davem@...emloft.net"
	<davem@...emloft.net>, "edumazet@...gle.com" <edumazet@...gle.com>,
	"kuba@...nel.org" <kuba@...nel.org>, "pabeni@...hat.com" <pabeni@...hat.com>,
	"balbi@...nel.org" <balbi@...nel.org>, "Tan, Raymond"
	<raymond.tan@...el.com>, "jarkko.nikula@...ux.intel.com"
	<jarkko.nikula@...ux.intel.com>, "linux-can@...r.kernel.org"
	<linux-can@...r.kernel.org>, "netdev@...r.kernel.org"
	<netdev@...r.kernel.org>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, "linux@...tq-group.com"
	<linux@...tq-group.com>, "lst@...gutronix.de" <lst@...gutronix.de>, "Hahn,
 Matthias" <matthias.hahn@...el.com>, "Chinnadurai, Srinivasan"
	<srinivasan.chinnadurai@...el.com>, "Mohan, Subramanian"
	<subramanian.mohan@...el.com>
Subject: RE: [PATCH 1/1] can: m_can: Control tx flow to avoid message stuck

Hi @Markus Schneider-Pargmann,

> -----Original Message-----
> From: Markus Schneider-Pargmann <mailto:msp@...libre.com>
> Sent: Thursday, January 9, 2025 9:14 PM
> To: Mohan, Subramanian <mailto:subramanian.mohan@...el.com>
> Cc: mailto:rcsekar@...sung.com; mailto:davem@...emloft.net; mailto:edumazet@...gle.com;
> mailto:kuba@...nel.org; mailto:pabeni@...hat.com; mailto:balbi@...nel.org; Tan, Raymond
> <mailto:raymond.tan@...el.com>; mailto:jarkko.nikula@...ux.intel.com; linux-
> mailto:can@...r.kernel.org; mailto:netdev@...r.kernel.org; mailto:linux-kernel@...r.kernel.org;
> mailto:linux@...tq-group.com; mailto:lst@...gutronix.de; Hahn, Matthias
> <mailto:matthias.hahn@...el.com>; Chinnadurai, Srinivasan
> <mailto:srinivasan.chinnadurai@...el.com>
> Subject: Re: [PATCH 1/1] can: m_can: Control tx flow to avoid message stuck
> 
> Hi,
> 
> On Wed, Jan 08, 2025 at 02:31:12PM +0530,
> mailto:subramanian.mohan@...el.com wrote:
> > From: Subramanian Mohan <mailto:subramanian.mohan@...el.com>
> >
> > The prolonged testing of passing can messages between two Elkhartlake
> > platforms resulted in message stuck i.e Message did not receive at
> > receiver side
> 
> Can you please describe the reason for the stuck messages in your commit
> message? I am reading this but I don't understand why this happens or why
> your proposed solution helps.

Let me describe problem bit more:
We are using 2 different Python Scripts(client and server) on both of the Elkhart lake connected systems. 
The "server" script sends out messages with Arbitration ID's, and then waits for a response. If the Arbitration ID is different than the 
one expected or no message arrives it logs an error.
The "client" script listens for messages, and depending on the Arbitration ID received it sends a message with a specific Arbitration ID back.
We have deployed both the scripts in 2 different systems and triggered the testing
If any message is lost/stuck then the "server" - Script will log an error.
The Message stuck corresponds over here, whenever the server sends out message and waits for reply, we wont me getting the reply message 
On the server side. Even though time slice increase in scripts did not help. On further debugging enabling TX/TEFN impacts the processing load.
To overcome this we disabled the TX/TEFN interrupt once processed and enable it back in the TX start xmit function.

> 
> >
> > Contolling TX i.e TEFN bit helped to resolve the message stuck issue.
> >
> > The current solution is enhanced/optimized from the below patch:
> > https://lore.kernel.org/lkml/20230623051124.64132-1-kumari.pallavi@int
> > el.com/T/
> >
> > Setup used to reproduce the issue:
> >
> > +---------------------+         +----------------------+
> > |Intel ElkhartLake    |         |Intel ElkhartLake     |
> > |       +--------+    |         |       +--------+     |
> > |       |m_can 0 |    |<=======>|       |m_can 0 |     |
> > |       +--------+    |         |       +--------+     |
> > +---------------------+         +----------------------+
> >
> > Steps to be run on the two Elkhartlake HW:
> > 1)Bus-Rate is 1 MBit/s
> > 2)Busload during the test is about 40% 3)we initialize the CAN with
> > following commands 4)ip link set can0 txqueuelen 100/1024/2048 5)ip
> > link set can0 up type can bitrate 1000000
> >
> > Python scripts are used send and receive the can messages between the
> > EHL systems.
> >
> > Signed-off-by: Hahn Matthias <mailto:matthias.hahn@...el.com>
> > Signed-off-by: Subramanian Mohan <mailto:subramanian.mohan@...el.com>
> > ---
> >  drivers/net/can/m_can/m_can.c | 11 +++++++++--
> >  1 file changed, 9 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/net/can/m_can/m_can.c
> > b/drivers/net/can/m_can/m_can.c index 97cd8bbf2e32..0a2c9a622842
> > 100644
> > --- a/drivers/net/can/m_can/m_can.c
> > +++ b/drivers/net/can/m_can/m_can.c
> > @@ -1220,7 +1220,7 @@ static void m_can_coalescing_update(struct
> > m_can_classdev *cdev, u32 ir)  static int
> > m_can_interrupt_handler(struct m_can_classdev *cdev)  {
> >      struct net_device *dev = cdev->net;
> > -   u32 ir = 0, ir_read;
> > +   u32 ir = 0, ir_read, new_interrupts;
> >      int ret;
> >
> >      if (pm_runtime_suspended(cdev->dev)) @@ -1283,6 +1283,9 @@
> static
> > int m_can_interrupt_handler(struct m_can_classdev *cdev)
> >                      ret = m_can_echo_tx_event(dev);
> >                      if (ret != 0)
> >                              return ret;
> > +
> > +                   new_interrupts = cdev->active_interrupts &
> ~(IR_TEFN);
> > +                   m_can_interrupt_enable(cdev, new_interrupts);
> 
> Here is a theoretical situation of two messages being sent. The first is being
> sent and handled in this interrupt handler. Then it would disable the TEFN bit
> right? If the second message wasn't done sending yet, how would it ever call
> the interrupt handler if the interrupt is disabled?

With this patch we are controlling only TEFN/TX interrupt bit, rest of the interrupts remains unaffected. 
Since We are enabling/disabling TEFN bit only, interrupt handler will be called normally with other interrupts.

> 
> Also you are disabling this interrupt here regardless of the type of mcan device
> and also regardless of the coalescing state. In the transmit part you are only
> enabling it for non-peripheral devices. For peripheral mcan devices this would
> also introduce an additional two transfers per transmit.

TEFN bit enabling/disabling applies only to non-peripheral device.
While disabling the TEFN bit in interrupt handler, we will add the check for non-peripheral device before disabling it(V2).
On the coalescing state, The snapshot is already taken while entering the interrupt handler.

> 
> In which situations is this really necessary? Does it help to implement
> coalescing for non-peripheral devices?
This helps in heavy load/traffic conditions 
Not exactly sure on the coalescing part.

Thanks,
Subbu

> 
> Best
> Markus
> 
> >              }
> >      }
> >
> > @@ -1989,6 +1992,7 @@ static netdev_tx_t m_can_start_xmit(struct
> sk_buff *skb,
> >      struct m_can_classdev *cdev = netdev_priv(dev);
> >      unsigned int frame_len;
> >      netdev_tx_t ret;
> > +   u32 new_interrupts;
> >
> >      if (can_dev_dropped_skb(dev, skb))
> >              return NETDEV_TX_OK;
> > @@ -2008,8 +2012,11 @@ static netdev_tx_t m_can_start_xmit(struct
> > sk_buff *skb,
> >
> >      if (cdev->is_peripheral)
> >              ret = m_can_start_peripheral_xmit(cdev, skb);
> > -   else
> > +   else {
> > +           new_interrupts = cdev->active_interrupts | IR_TEFN;
> > +           m_can_interrupt_enable(cdev, new_interrupts);
> >              ret = m_can_tx_handler(cdev, skb);
> > +   }
> >
> >      if (ret != NETDEV_TX_OK)
> >              netdev_completed_queue(dev, 1, frame_len);
> > --
> > 2.35.3
> >