[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c990858e-5473-570a-221e-d2b3eea603a3@axentia.se>
Date: Thu, 13 Jun 2019 07:52:15 +0000
From: Peter Rosin <peda@...ntia.se>
To: Kishon Vijay Abraham I <kishon@...com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
Roger Quadros <rogerq@...com>
Subject: Re: [PATCH v4 4/5] phy: ti: Add a new SERDES driver for TI's AM654x
SoC
Hi,
On 2019-06-13 06:57, Kishon Vijay Abraham I wrote:
> Hi Peter,
>
> On 13/06/19 4:20 AM, Peter Rosin wrote:
>> Hi!
>>
>> [I know this has already been merged upstream, but I only just
>> now noticed the code and went to the archives to find the
>> originating mail. I hope I managed to set in-reply-to correctly...]
>>
>> The mux handling is problematic and does not follow the rules.
>> It needs to be fixed, or you may face deadlocks. See below.
>>
>> On 2019-04-05 11:08, Kishon Vijay Abraham I wrote:
>>> Add a new SERDES driver for TI's AM654x SoC which configures
>>> the SERDES only for PCIe. Support fo USB3 will be added later.
>>>
>>> SERDES in am654x has three input clocks (left input, externel reference
>>> clock and right input) and two output clocks (left output and right
>>> output) in addition to a PLL mux clock which the SERDES uses for Clock
>>> Multiplier Unit (CMU refclock).
>>>
>>> The PLL mux clock can select from one of the three input clocks.
>>> The right output can select between left input and external reference
>>> clock while the left output can select between the right input and
>>> external reference clock.
>>>
>>> The driver has support to select PLL mux and left/right output mux as
>>> specified in device tree.
>>>
>>> [rogerq@...com: Fix boot lockup caused by accessing a structure member
>>> (hw->init) allocated in stack of probe() and accessed in get_parent]
>>> [rogerq@...com: Fix "Failed to find the parent" warnings]
>>> Signed-off-by: Roger Quadros <rogerq@...com>
>>> Signed-off-by: Kishon Vijay Abraham I <kishon@...com>
*snip*
>>> +static void serdes_am654_release(struct phy *x)
>>> +{
>>> + struct serdes_am654 *phy = phy_get_drvdata(x);
>>> +
>>> + phy->type = PHY_NONE;
>>> + phy->busy = false;
>>> + mux_control_deselect(phy->control);
>>
>> Here you unconditionally deselect the mux, and that seems
>> dangerous. Are you *sure* that ->release may not be called
>> without a successful xlate call?
>
> Yeah, without a successful xlate(), the consumer will never get a reference to
> the PHY and the ->release() is invoked only from phy_put() which needs a
> reference to the PHY.
Yes, I thought it might be ok, but good that you can confirm it.
>> I'm not 100% sure of that, but I have not looked at the phy
>> code before today, so it may very well be the case that this
>> is safe...
>>
>>> +}
>>> +
>>> +struct phy *serdes_am654_xlate(struct device *dev, struct of_phandle_args
>>> + *args)
>>> +{
>>> + struct serdes_am654 *am654_phy;
>>> + struct phy *phy;
>>> + int ret;
>>> +
>>> + phy = of_phy_simple_xlate(dev, args);
>>> + if (IS_ERR(phy))
>>> + return phy;
>>> +
>>> + am654_phy = phy_get_drvdata(phy);
>>> + if (am654_phy->busy)
>>> + return ERR_PTR(-EBUSY);
>>> +
>>> + ret = mux_control_select(am654_phy->control, args->args[1]);
>>> + if (ret) {
>>> + dev_err(dev, "Failed to select SERDES Lane Function\n");
>>> + return ERR_PTR(ret);
>>> + }
>>
>> *However*
>>
>> Here you select the mux as the last action, good, but, a mux must
>> be handled with that same care as a locking primitive, i.e.
>> successful selects must be perfectly balanced with deselects. I
>> see no guarantee of that here, since there are other failures
>> possible after the xlate call. So, being last in the function
>> does not really help. If I read the code correctly, the
>> phy core may fail if try_module_get fails in phy_get(). If that
>> ever happens, a successful call to mux_control_select is simply
>> forgotten, and the mux will be locked indefinitely.
>
> Good catch. While adding ->release() ops which is only invoked from phy_put,
> perhaps this was missed. Ideally it should be invoked from other places where
> there is a failure after phy_get.
>>
>> am654_phy->busy will also be set indefinitely, so you will get
>> -EBUSY and not a hard deadlock. At least here, but if the now
>> locked mux control happens to also control some other muxes
>> (probably unlikely, but if), then their consumers will potentially
>> deadlock hard. But that's just after a cursory reading, so I may
>> completely miss something...
>
> The ->busy here should prevent two consumers trying to control the same mux.
Aha, you do not seem to be aware that one mux controller can
have multiple independent consumers (a mux is not like a gpio
or a pwm in that aspect). What I'm talking about is a single
mux control that controls several parallel muxes, each with its
own consumer. In the specific case you are targeting, that may
not be possible due to some hardware reason or something, but
looking at just this driver *it* cannot know that the mux will
be available just because it has a local ->busy flag.
For this case, I get the feeling that the mux may be selected for
a very long time, right? It is never about selecting the mux,
doing something for x milli/microseconds or so and then deselecting
the mux. Perhaps the mux will typically sit in the same state for
the entire uptime of the machine?
If you have these very long access patterns, the sharing capability
of the mux controls are pretty much useless, and I have
contemplating a mux mode to support this case. I.e. where you
lock/unlock the mux control once at probe/release (or similar),
and then basically instead of a shared mux get an exclusive mux
where the consumer is responsible for not making parallel accesses.
In other words, just like a gpio or a pwm.
The problem is that I then need a definition of "long" and "short"
accesses, and I split the mux universe in two...
>> Thinking some more, the above mentioned forgotten phy problem
>> in fact looks like a generic leak in phy_get. It will simply
>> leak any and all phys where the owner module is not yet
>> available. Or, am I missing something?
>
> yes you are right. Ideally we should have invoked ->release() after phy_get()
> failures.
>>
>> Perhaps fixing that is all that is needed to make the mux
>> select/deselect calls guaranteed to be balanced?
>
> That's correct. Thanks for reporting this.
No problem!
Cheers,
Peter
Powered by blists - more mailing lists