netdev - Questions about porting stmmac to a HI3535 SoC

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [day] [month] [year] [list]

Message-ID: <00674cf0-92d9-7395-a3ae-c4a63060a106@edman007.com>
Date:   Tue, 19 Feb 2019 20:58:16 -0500
From:   Ed Martin <lists@...an007.com>
To:     netdev@...r.kernel.org
Subject: Questions about porting stmmac to a HI3535 SoC

Hi,

    So I hope this is the right place to be asking this, this is my
first time doing real kernel development for something useful, and this
is long winded, I've spent a lot of time on it. Anyways, I am attempting
to make the stmmac driver work on a HiSilicon HI3535 SoC (this is a SoC
targeted at a Network video recorder application [arm cortex9 based]).
Anyways, I found a kernel on github that boots and the stmmac driver
works just fine, but it's a 3.4 kernel (link below). I've ported what I
could forward, but the stmmac driver includes support for TCP offload
and thus contains quite a bit of extra stuff, so for the stmmac driver
I've gone to adding support for the SoC. I did manage to find the
datasheet (in Chinese) for this chip, and nothing sticks out as
different. With it I added the clocks and device tree stuff, and the
driver mostly loads. The hardware appears to be dwmac1000/dwmac-3.610
(User ID: 0x10, Synopsys ID: 0x36), and from the other kernel, it also
includes a "CreVinn TOE-NK-2G TCP Offload Engine". I've for the most
part ported it, which has mostly been setting up the clocks for it
(which I think/hope I did right). Also of note, this device has two
GMACs one one controller (and they don't auto-detect right).

The kernel that I know works:
https://github.com/uyhoangtran/linux-kernel-3.4-hi3535

For my actual problem, I am testing it by attempting to netboot with NFS
over TCP, right now it comes up, sends out DHCP/configures the
interface, and then kind of works. By that I mean it sends out some
packets, but not all of the ones it should be sending actually go, it
mounts my server, and from my NFS server I see many TCP packets with it
communicating, and then it abruptly stops, and my server keeps
re-transmitting trying to get it back. Eventually I get the following error:

[  244.050983] ------------[ cut here ]------------
[  244.063088] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:461
dev_watchdog+0x234/0x238
[  244.084632] NETDEV WATCHDOG: eth0 (stmmaceth): transmit queue 0 timed out
[  244.102332] CPU: 0 PID: 0 Comm: swapper Tainted: G        W        
4.19.0_hi3535-00055-g6218d4e6de03-dirty #455
[  244.128833] Hardware name: Generic DT based system

<snip the backtrace>

My efforts to debug it has shown that adding a pr_warn() anywhere within
stmmac_xmit() mostly solves the problem (and it doesn't matter where in
that function, first line and last line results in the same thing). I
thought this indicates some sort of race problem, and I've tried placing
memory barriers all over that function and it does nothing. I've also
found out that this seems to happen when netdev_tx_sent_queue() is
called and it decides that the tx queue should be stopped. Then it seems
like the tx queue isn't restarted and I don't know why. Also it appears
that the next time stmmac_tx_clean() gets called it doesn't find all the
bytes that the previous stmmac_xmit() sent (usually one to three packets
short). I am basically out of ideas, other than switching to the latest
5.0 git branch, but I don't see anything that looks like it would fix
this (no major changes in the stmmac driver at least, I went though
every commit between the 4.19 and 5.0 and I don't see anything
important). I suppose I'll try it next.

So my two leading theories:

#1 sort of race with DMA transfers, but dma memory barriers before all
the important things already exist, and the driver already works on
other systems, so I assume it's ok, plus the old working driver didn't
make major changes with respect to these barriers (and I tried the
changes it did make)

#2 some sort of issue with how the netdev_* functions work, my
investigation showed the queue is stopped because the BQL queue runs
negative and there is a CONFIG_BQL option around all that code. But if
that was the cause, I'd expect other drivers to have a problem, and I
can find nothing on that issue. I can't seem to find where CONFIG_BQL is
enabled so I assume it's required.

So does anyone have any idea how I can debug this issue, I feel like
there is something obvious I'm missing, I can absolutely share
everything I have if someone wants to look through the changes I did
make, I just didn't get around to hosting it somewhere yet. Is there
something that's different about SoCs that I need to do.