lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0596fce8-223b-494e-907e-f13d75f211cd@microchip.com>
Date: Mon, 25 Mar 2024 13:24:13 +0000
From: <Parthiban.Veerasooran@...rochip.com>
To: <benjamin@...ler.one>
CC: <netdev@...r.kernel.org>, <devicetree@...r.kernel.org>,
	<linux-kernel@...r.kernel.org>, <linux-doc@...r.kernel.org>,
	<Horatiu.Vultur@...rochip.com>, <Woojung.Huh@...rochip.com>,
	<Nicolas.Ferre@...rochip.com>, <UNGLinuxDriver@...rochip.com>,
	<Thorsten.Kummermehr@...rochip.com>, <davem@...emloft.net>,
	<edumazet@...gle.com>, <kuba@...nel.org>, <pabeni@...hat.com>,
	<robh+dt@...nel.org>, <krzysztof.kozlowski+dt@...aro.org>,
	<conor+dt@...nel.org>, <corbet@....net>, <Steen.Hegelund@...rochip.com>,
	<rdunlap@...radead.org>, <horms@...nel.org>, <casper.casan@...il.com>,
	<andrew@...n.ch>
Subject: Re: [PATCH net-next v2 0/9] Add support for OPEN Alliance 10BASE-T1x
 MACPHY Serial Interface

Hi Benjamin Bigler,

Thank you for your testing and feedback. It would be really helpful to 
bring the driver to a good shape. We really appreciate your efforts on this.

On 24/03/24 5:25 pm, Benjamin Bigler wrote:
> [Some people who received this message don't often get email from benjamin@...ler.one. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
> 
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
> 
> Hi Parthiban
> 
> I hope I send this in the right context as it is not related to just one patch or
> some specific code.
> 
> I conducted UDP load testing using three i.MX8MM boards in conjunction with the
> LAN8651. The setup involved one board functioning as a server, which is just
> echoing back received data, while the remaining two boards acted as clients,
> sending UDP packets of different sizes in various bursts to the server.
> Due to hardware constraints, the SPI bus speed was limited to 15 MHz, which might
> have influenced the results.
> 
> During the tests I experienced some issues:
> 
> - The boards just start receiving after first sending something (ping another board).
>    Some measurements showed that the irq stays asserted after init. This makes sense
>    as far as I understand the chapter 7.7 of the specification, the irq is deasserted
>    on reception of the first data header following CSn being asserted. As a workaround
>    I trigger the thread at the end of oa_tc6_init.
It looks like the IRQ is asserted on RESET completion and expects a data
chunk from host to deassert the IRQ. I used to test the driver in RPI 4
using iperf3. For some reason I never faced this issue, may be when the
network device is being registered there might be some packet 
transmission which leads to deliver a data chunk so that the IRQ is
deasserted. Thanks for the workaround. I think that would be the 
solution to solve this issue. Adding the below lines in the end of the 
function oa_tc6_init() will trigger the oa_tc6_spi_thread_handler() to 
perform an empty data chunk transfer which will deassert the IRQ before 
starting the actual data transfer.

/* oa_tc6_sw_reset_macphy() function resets and clears the MAC-PHY reset
  * complete status. IRQ is also asserted on reset completion and it is
  * remain asserted until MAC-PHY receives a data chunk. So performing an
  * empty data chunk transmission will deassert the IRQ. Refer section
  * 7.7 and 9.2.8.8 in the OPEN Alliance specification for more details.
  */
tc6->int_flag = true;
wake_up_interruptible(&tc6->spi_wq);
> 
> - If there is a lot of traffic, the receive buffer overflow error spams the log.
> 
> - If there is a lot of traffic, I got various kernel panics in oa_tc6_update_rx_skb.
>    Mostly because more data to rx_skb is added than allocated and sometimes because
>    rx_skb is null in oa_tc6_update_rx_skb or oa_tc6_prcs_rx_frame_end. Some debugging
>    with a logic analyzer showed that the chip is not behave correctly. There is more
>    bytes between start_valid and end_valid than there should be. Also there
>    seems to be 2 end_valid without a start_valid between. What is common is that the incorrect
>    frame starts in a chunk where end_valid and start_valid is set.
>    In my opinion its a problem in the chip (maybe related to the errata in the next point)
>    but the driver should be resilent and just drop the packet and not cause a kernel panic.
Usually I run into this issue "receive buffer overflow" when I run RPI 4
with default cpu governor setting which is "ondemand". In this case, 
even though if I set SPI clock speed as 15 MHz the RPI 4 core clock is
clocking down when it is idle which leads delivering half of the
configured SPI clock speed around 5.9 MHz. So the systems like RPI 4 
need performance mode enabled to get the proper clock speed for SPI. 
Refer below link for more details.

https://github.com/raspberrypi/linux/issues/3381#issuecomment-1144723750

I used to enable performance mode using the below command.

echo performance | sudo tee 
/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor > /dev/null

So please ensure the SPI clock speed using a logic analyzer to get the
maximum throughput without receive buffer overflow.

Of course, I agree that the driver should not crash in case of receive
buffer overflow. By referring your investigations, I understand that the
buffers in the MAC-PHY is being continuously overwritten again and again
as the host is very slow to read the data from the MAC-PHY buffers
through SPI which alters the descriptors. There might be two reasons why
we run into this situation.
1. The host is busy doing something else and delays to initiate SPI even
    though SPI clock speed is 15 MHz.
2. The SPI clock speed is less than 15 MHz.

I use the below iperf3 setup for my testing and never faced the driver
crash issue even though faced "receive buffer overflow" error when I run
RPI 4 with "ondemand" default mode.

Node 0 - Raspberry Pi 4 with LAN8650 MAC-PHY
  $ iperf3 -s
Node 1 - Raspberry Pi 4 with EVB-LAN8670-USB USB Stick
  $ iperf3 -c 192.168.5.100 -u -b 10M -i 1 -t 0

and vice versa.

I never faced "receive buffer overflow" error when I run RPI 4 with
"performance" mode enabled and even though all the cores are stressed
using the below command,

$ yes >/dev/null & yes >/dev/null & yes >/dev/null & yes >/dev/null &

Can you share more details about your testing setup and applications you
use, so that I will try to reproduce the issue in my setup to debug the
driver?
> 
> - Sometimes the chip stops working. It always asserts the irq but there is no data (rca=0)
>    and also exst is not active. I found out that there is an errata (DS80001075) point s3
>    that explains this. I set the ZARFE bit in CONFIG0. This also fixes the point above.
>    The driver now works since about 2.5 weeks with various load with just one loss of frame
>    error where I had to reboot the system after about 4 days.
It is good to hear that the driver works fine with the above changes. As 
mentioned in the errata, this continuous interrupt issue is a known
issue with LAN8651 Rev.B0. Switching to LAN8651 Rev.B1 will solve this
issue and no need of any workaround. Setting ZARFE bit in the CONFIG0
will solve the continuous interrupt issue but don't know how the above
"receive buffer overflow" issue also solved. I think it is a good idea 
to test with LAN8651 Rev.B1 without setting ZARFE bit once. It would be 
interesting to see the result. I am always using LAN8651 Rev.B1 for my 
testing.

I should be able to reproduce the "receive buffer overflow" issue and 
consequently kernel crash in my setup with LAN8651 Rev.B1 so that I can 
investigate the issue further. As I am not able to reproduce in my RPI 
4, I need your support for the tests and applications you used in your 
setup.
> 
> Is there a reason why you removed the netdev watchdog which was active in v2?
When the timeout occurs, there is no further action except increasing
tx_errors. Not seeing this except USB-to-Ethernet which can be removed
unexpectedly. But this is SPI interface which will not be removed
unexpectedly as it is a platform device. That's why we removed this.

Best regards,
Parthiban V
> 
> Thanks,
> Benjamin Bigler
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ