netdev - Re: net: dsa: realtek: silent indirect reg read failures on SMP

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <878rukib4f.fsf@bang-olufsen.dk>
Date:   Wed, 9 Feb 2022 10:32:17 +0000
From:   Alvin Šipraga <ALSI@...g-olufsen.dk>
To:     Luiz Angelo Daros de Luca <luizluca@...il.com>
CC:     "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        Linus Walleij <linus.walleij@...aro.org>,
        Arınç ÜNAL <arinc.unal@...nc9.com>
Subject: Re: net: dsa: realtek: silent indirect reg read failures on SMP

Hi Luiz,

Thanks very much for the bug report!

Luiz Angelo Daros de Luca <luizluca@...il.com> writes:

> Hello,
>
> Arinç reported an issue with interfaces going down and up in a fixed
> interval (a little less than 40s). It happened with both SMI or MDIO
> interfaces for two different rtl8365mb supported models.

Is this using the Generic PHY driver or the Realtek RTL8365MB-VC PHY
driver? If it's the latter then I would be interested to know whether
the interrupt counters are going up too. Maybe the switch really has
something to say?

I think the Generic PHY driver is based on polling of PHY registers,
which seems more likely to trigger the issue you describe.

In any case, I think I can reproduce your issue. Using an SMI setup with
just one cable attached to port 2, I did the following:

1. Enable ftrace event logging for regmap and mdio
2. Poll BSMR PHY register for my connected port; it should always read
   the same (0x79ed).
3. Wait for 2 to read out something else.

So I used this command:

    while true; do phytool read swp2/2/0x01; sleep 0.1; done

After a couple of seconds I already read out 0x0000. And after checking
the ftrace log I can see that this is because of some interleaved
register access:

     kworker/3:4-70      [003] .......  1927.139849: regmap_reg_write: ethernet-switch reg=1004 val=bd
         phytool-16816   [002] .......  1927.139979: regmap_reg_read: ethernet-switch reg=1f01 val=0
     kworker/3:4-70      [003] .......  1927.140381: regmap_reg_read: ethernet-switch reg=1005 val=0
         phytool-16816   [002] .......  1927.140468: regmap_reg_read: ethernet-switch reg=1d15 val=a69
     kworker/3:4-70      [003] .......  1927.140864: regmap_reg_read: ethernet-switch reg=1003 val=0
         phytool-16816   [002] .......  1927.140955: regmap_reg_write: ethernet-switch reg=1f02 val=2041
     kworker/3:4-70      [003] .......  1927.141390: regmap_reg_read: ethernet-switch reg=1002 val=0
         phytool-16816   [002] .......  1927.141479: regmap_reg_write: ethernet-switch reg=1f00 val=1
     kworker/3:4-70      [003] .......  1927.142311: regmap_reg_write: ethernet-switch reg=1004 val=be
         phytool-16816   [002] .......  1927.142410: regmap_reg_read: ethernet-switch reg=1f01 val=0
     kworker/3:4-70      [003] .......  1927.142534: regmap_reg_read: ethernet-switch reg=1005 val=0
         phytool-16816   [002] .......  1927.142618: regmap_reg_read: ethernet-switch reg=1f04 val=0
         phytool-16816   [002] .......  1927.142641: mdio_access: SMI-0 read  phy:0x02 reg:0x01 val:0x0000
     kworker/3:4-70      [003] .......  1927.143037: regmap_reg_read: ethernet-switch reg=1001 val=0
     kworker/3:4-70      [003] .......  1927.143133: regmap_reg_read: ethernet-switch reg=1000 val=2d89
     kworker/3:4-70      [003] .......  1927.143213: regmap_reg_write: ethernet-switch reg=1004 val=be
     kworker/3:4-70      [003] .......  1927.143291: regmap_reg_read: ethernet-switch reg=1005 val=0
     kworker/3:4-70      [003] .......  1927.143368: regmap_reg_read: ethernet-switch reg=1003 val=0
     kworker/3:4-70      [003] .......  1927.143443: regmap_reg_read: ethernet-switch reg=1002 val=6

The kworker here is polling MIB counters for stats, as evidenced by the
register 0x1004 that we are writing to (RTL8365MB_MIB_ADDRESS_REG).

Sometimes I can even read out completely junk values if I tighten my
phytool loop:

    # while true; do phytool read swp2/2/0x01; done | uniq
    0x79ed
    0000
    0x79ed
    0000
    0x79ed
    0000
    0x79ed
    0x0014 <-- ?!

This is just a preliminary investigation - I only checked your mail
today - but I will follow up on this next week and try and make a
fix. In the mean time could you and Arinç perhaps do some similar
tracing and confirm that in your case it is a similar issue? You needn't
use phytool if the issue reproduces itself every 40 seconds - in that
case just enable ftrace events and inspect the register access sequence
when a PHY read gives an unexpected value.

Let's please restrict the investigation to registers which we expect to
give well-defined values though - unused or otherwise inactive registers
may very well give us junk, but it is harder to argue that this is a
bug. Anyway, this seems pretty easy to reproduce, so I don't think you
need to go down that path :-)

Overall we probably need some stricter locking but I want to spend some
time on this before just sending a big-fat-lock solution to the
list. I didn't try your patch yet due to other items in my inbox today.

Kind regards,
Alvin

>
> The indirect read procedure is something like this:
>
> rtl8365mb_phy_read
>   rtl8365mb_phy_ocp_read
>     rtl8365mb_phy_poll_busy
>       regmap_read_poll_timeout(RTL8365MB_INDIRECT_ACCESS_STATUS_REG)
>     rtl8365mb_phy_ocp_prepare
>       regmap_update_bits(RTL8365MB_GPHY_OCP_MSB_0_REG)
>       regmap_write(RTL8365MB_INDIRECT_ACCESS_ADDRESS_REG)
>     regmap_write(RTL8365MB_INDIRECT_ACCESS_CTRL_REG)
>     rtl8365mb_phy_poll_busy
>       regmap_read_poll_timeout(RTL8365MB_INDIRECT_ACCESS_STATUS_REG)
>     regmap_read(RTL8365MB_INDIRECT_ACCESS_READ_DATA_REG)
>
> And the write is similar. The regmap read/write does have its own
> mutex. realtek-mdio does use a sequence of read/writes to mdio for
> each read/write ops but beyond regmap internal locks, I also lock
> bus->mdio_lock.
>
> In a non-SMP system or with extra cores disabled, everything works as
> expected. However, with an SMP system, there is some kind of
> concurrent access to registers and
> regmap_read(RTL8365MB_INDIRECT_ACCESS_READ_DATA_REG) will eventually
> read 0x0000. It is like the chip didn't have time to update that
> register or got lost in the way. It only happens when the system is
> idle. If you stress the CPU, the simultaneous register access gets
> unsynchronized or serialized.
>
> I can "simulate" a similar issue in a non-SMP system by dumping regmap
> from non-used registers like 0x20a0 to 0x20ff (0x20a0 is related to
> port 5 state, but I have 0,1,2,3,4 + 6,7). There are other regions
> that cause the same behavior but [0x20a0,0x20ff] is the first one I
> found. It never failed while reading good memory regions. Maybe it is
> just caused by another issue like the chip getting confused while
> reading undefined memory regions.
>
> for a in $(seq 1000); do dd
> if=/sys/kernel/debug/regmap/mdio-bus:1d/registers bs=11
> skip=$((0x20a0)) count=$((0x1)) 2>/dev/null; done
>
> (That range might not be precise as I don't know if dd is enough to
> select only a specific range)
>
> While the script is running, I can see both the driver and the debugfs
> dump flicking between good data, garbage and zero. When DSA reads
> 0x0000, it brings the port down and up.
>
> I tried to add a simple mutex over rtl8365mb_phy_ocp_read call but it
> still failed to prevent the event in a SMP system (it would not
> protect the debugfs dump). Maybe I'm missing something obvious.
>
> I also wonder if all those functions that use a sequence of
> regmap_{read,write} might need a mutex to protect from parallel
> execution. I didn't find a way to create a "regmap transaction" by
> locking it externally as its lock control is private as well as the
> non-locked _regmap_read(). I could disable its lock and do it inside
> the driver, but it would also disable the debugfs dump and add a bunch
> of new code. And I don't even know if it would fix the issue.
> Is there a better way to deal with this issue? Maybe force dsa polling
> to use a single processor?
>
> This is the failed patch.
>
> diff --git a/drivers/net/dsa/realtek/rtl8365mb.c
> b/drivers/net/dsa/realtek/rtl8365mb.c
> index 1a0562811c1a..5572271a2514 100644
> --- a/drivers/net/dsa/realtek/rtl8365mb.c
> +++ b/drivers/net/dsa/realtek/rtl8365mb.c
> @@ -817,6 +817,7 @@ struct rtl8365mb_port {
>  * @port_mask: mask of all ports
>  * @learn_limit_max: maximum number of L2 addresses the chip can learn
>  * @mib_lock: prevent concurrent reads of MIB counters
> + * @indirect_reg_lock: prevent concurrent access of indirect registers
>  * @ports: per-port data
>  * @jam_table: chip-specific initialization jam table
>  * @jam_size: size of the chip's jam table
> @@ -832,6 +833,7 @@ struct rtl8365mb {
>        u32 port_mask;
>        u32 learn_limit_max;
>        struct mutex mib_lock;
> +       struct mutex indirect_reg_lock;
>        struct rtl8365mb_port ports[RTL8365MB_MAX_NUM_PORTS];
>        const struct rtl8365mb_jam_tbl_entry *jam_table;
>        size_t jam_size;
> @@ -989,6 +991,7 @@ static int rtl8365mb_phy_ocp_write(struct
> realtek_priv *priv, int phy,
>
> static int rtl8365mb_phy_read(struct realtek_priv *priv, int phy, int regnum)
> {
> +       struct rtl8365mb *mb = priv->chip_data;
>        u32 ocp_addr;
>        u16 val;
>        int ret;
> @@ -1001,16 +1004,24 @@ static int rtl8365mb_phy_read(struct
> realtek_priv *priv, int phy, int regnum)
>
>        ocp_addr = RTL8365MB_PHY_OCP_ADDR_PHYREG_BASE + regnum * 2;
>
> -       ret = rtl8365mb_phy_ocp_read(priv, phy, ocp_addr, &val);
> +       ret = mutex_lock_interruptible(&mb->indirect_reg_lock);
>        if (ret) {
>                dev_err(priv->dev,
> -                       "failed to read PHY%d reg %02x @ %04x, ret %d\n", phy,
> +                       "read PHY%d reg %02x @ %04x, ret %d interrupted\n", phy,
>                        regnum, ocp_addr, ret);
>                return ret;
>        }
>
> -       dev_dbg(priv->dev, "read PHY%d register 0x%02x @ %04x, val <- %04x\n",
> -               phy, regnum, ocp_addr, val);
> +       ret = rtl8365mb_phy_ocp_read(priv, phy, ocp_addr, &val);
> +       if (ret)
> +               dev_err(priv->dev,
> +                       "failed to read PHY%d reg %02x @ %04x, ret %d\n", phy,
> +                       regnum, ocp_addr, ret);
> +       else
> +               dev_dbg(priv->dev, "read PHY%d register 0x%02x @ %04x,
> val <- %04x\n",
> +                       phy, regnum, ocp_addr, val);
> +
> +       mutex_unlock(&mb->indirect_reg_lock);
>
>        return val;
> }
> @@ -1018,6 +1029,7 @@ static int rtl8365mb_phy_read(struct
> realtek_priv *priv, int phy, int regnum)
> static int rtl8365mb_phy_write(struct realtek_priv *priv, int phy, int regnum,
>                               u16 val)
> {
> +       struct rtl8365mb *mb = priv->chip_data;
>        u32 ocp_addr;
>        int ret;
>
> @@ -1029,18 +1041,26 @@ static int rtl8365mb_phy_write(struct
> realtek_priv *priv, int phy, int regnum,
>
>        ocp_addr = RTL8365MB_PHY_OCP_ADDR_PHYREG_BASE + regnum * 2;
>
> -       ret = rtl8365mb_phy_ocp_write(priv, phy, ocp_addr, val);
> +       ret = mutex_lock_interruptible(&mb->indirect_reg_lock);
>        if (ret) {
>                dev_err(priv->dev,
> -                       "failed to write PHY%d reg %02x @ %04x, ret %d\n", phy,
> +                       "write PHY%d reg %02x @ %04x, ret %d
> interrupted\n", phy,
>                        regnum, ocp_addr, ret);
>                return ret;
>        }
>
> -       dev_dbg(priv->dev, "write PHY%d register 0x%02x @ %04x, val -> %04x\n",
> -               phy, regnum, ocp_addr, val);
> +       ret = rtl8365mb_phy_ocp_write(priv, phy, ocp_addr, val);
> +       if (ret)
> +               dev_err(priv->dev,
> +                       "failed to write PHY%d reg %02x @ %04x, ret %d\n", phy,
> +                       regnum, ocp_addr, ret);
> +       else
> +               dev_dbg(priv->dev, "write PHY%d register 0x%02x @
> %04x, val -> %04x\n",
> +                       phy, regnum, ocp_addr, val);
>
> -       return 0;
> +       mutex_unlock(&mb->indirect_reg_lock);
> +
> +       return ret;
> }
>
> static int rtl8365mb_dsa_phy_read(struct dsa_switch *ds, int phy, int regnum)
> @@ -2215,6 +2235,8 @@ static int rtl8365mb_setup(struct dsa_switch *ds)
>
>        mb = priv->chip_data;
>
> +       mutex_init(&mb->indirect_reg_lock);
> +
>        ret = rtl8365mb_reset_chip(priv);
>        if (ret) {
>                dev_err(priv->dev, "failed to reset chip: %d\n", ret);
> --
> 2.34.1
>
>
> ---
>    Luiz