[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <bef1e78d-4c02-68d5-ca1a-380f47ee2647@gmail.com>
Date: Tue, 12 Feb 2019 21:11:55 +0100
From: Heiner Kallweit <hkallweit1@...il.com>
To: Russell King - ARM Linux admin <linux@...linux.org.uk>
Cc: Andrew Lunn <andrew@...n.ch>,
John David Anglin <dave.anglin@...l.net>,
Vivien Didelot <vivien.didelot@...oirfairelinux.com>,
Florian Fainelli <f.fainelli@...il.com>, netdev@...r.kernel.org
Subject: Re: [PATCH net] dsa: mv88e6xxx: Ensure all pending interrupts are
handled prior to exit
On 12.02.2019 17:30, Russell King - ARM Linux admin wrote:
> On Tue, Feb 12, 2019 at 07:51:05AM +0100, Heiner Kallweit wrote:
>> On 12.02.2019 04:58, Andrew Lunn wrote:
>>> That change means we don't check the PHY device if it caused an
>>> interrupt when its state is less than UP.
>>>
>>> What i'm seeing is that the PHY is interrupting pretty early on after
>>> a reboot when the previous boot had the interface up.
>>>
>> So this means that when going down for reboot the interrupts are not
>> properly masked / disabled? Because (at least for net-next) we enable
>> interrupts in phy_start() only.
>
> Looking at Linus' tree as opposed to net-next, things do look rather
> broken wrt interrupts:
>
> +-phy_attach_direct
> `-phydev->state = PHY_READY
> +-phy_prepare_link
> +-phy_start_machine
> `-phy_trigger_machine()
> `-phy_start_interrupts
> +-request_threaded_irq()
> `-phy_enable_interrupts()
> +-phy_clear_interrupt()
> `-phy_config_interrupt(, PHY_INTERRUPT_ENABLED)
>
> At this point, the PHY is then able to generate interrupts, which,
> because phy_start() has not been called and phy_interrupt() checks
> that phydev->state >= PHY_UP, get ignored by the interrupt handler
> exactly as Andrew is finding.
>
> So it looks like 5.0-rc is already in need of this being fixed.
>
> In looking at this, I came across this chunk of code:
>
> static inline bool __phy_is_started(struct phy_device *phydev)
> {
> WARN_ON(!mutex_is_locked(&phydev->lock));
>
> return phydev->state >= PHY_UP;
> }
>
> /**
> * phy_is_started - Convenience function to check whether PHY is started
> * @phydev: The phy_device struct
> */
> static inline bool phy_is_started(struct phy_device *phydev)
> {
> bool started;
>
> mutex_lock(&phydev->lock);
> started = __phy_is_started(phydev);
> mutex_unlock(&phydev->lock);
>
> return started;
> }
>
> which looks to me like over-complication. The mutex locking there is
> completely pointless - what are you trying to achieve with it?
>
Even though this code is new it's kind of heritage in phylib that each
access (read or write) to phydev->state is protected by this lock.
I also once wondered whether it's actually needed but didn't spend
effort so far on challenging this. Seems that now the time has come ..
> Let's go through this. The above is exactly equivalent to:
>
> bool phy_is_started(phydev)
> {
> int state;
>
> mutex_lock(&phydev->lock);
> state = phydev->state;
> mutex_unlock(&phydev->lock);
>
> return state >= PHY_UP;
> }
>
> since when we do the test is irrelevant. Architectures that Linux
> runs on are single-copy atomic, which means that reading phydev->state
> itself is an atomic operation. So, the mutex locking around that
> doesn't add to the atomicity of the entire operation.
>
> How, depending on what you do with the rest of this function depends
> whether the entire operation is safe or not. For example, let's take
> this code at the end of phy_state_machine():
>
> if (phy_polling_mode(phydev) && phy_is_started(phydev))
> phy_queue_state_machine(phydev, PHY_STATE_TIME);
>
> state = PHY_UP
> thread 0 thread 1
> phy_disconnect()
> +-phy_is_started()
> phy_is_started() |
> `-phy_stop()
> +-phydev->state = PHY_HALTED
> `-phy_stop_machine()
> `-cancel_delayed_work_sync()
> phy_queue_state_machine()
> `-mod_delayed_work()
>
Thanks for describing this scenario, I'll have a closer look at it.
> At this point, the phydev->state_queue() has been added back onto the
> system workqueue despite phy_stop_machine() having been called and
> cancel_delayed_work_sync() called on it.
>
> The original code in 4.20 did not have this race condition.
>
> Basically, the lock inside phy_is_started() does nothing useful, and
> I'd say is dangerously misleading.
>
Heiner
Powered by blists - more mailing lists