[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <1c511eab-502b-8e4a-392e-79c38e08e858@quicinc.com>
Date: Fri, 15 Sep 2023 11:44:02 +0530
From: Sricharan Ramabadhran <quic_srichara@...cinc.com>
To: Chris Lew <quic_clew@...cinc.com>,
Pavan Kondeti <quic_pkondeti@...cinc.com>,
Praveenkumar I <quic_ipkumar@...cinc.com>
CC: <agross@...nel.org>, <andersson@...nel.org>,
<konrad.dybcio@...aro.org>, <linux-arm-msm@...r.kernel.org>,
<linux-kernel@...r.kernel.org>, <quic_varada@...cinc.com>
Subject: Re: [PATCH] soc: qcom: qmi: Signal the txn completion after releasing
the mutex
On 8/8/2023 1:56 AM, Chris Lew wrote:
>
>
> On 8/2/2023 1:07 AM, Sricharan Ramabadhran wrote:
>>
>>
>> On 8/2/2023 5:11 AM, Chris Lew wrote:
>>>
>>>
>>> On 8/1/2023 4:13 AM, Sricharan Ramabadhran wrote:
>>>> Hi,
>>>>
>>>> On 8/1/2023 6:06 AM, Chris Lew wrote:
>>>>>
>>>>>
>>>>> On 7/31/2023 8:19 AM, Pavan Kondeti wrote:
>>>>>> On Mon, Jul 31, 2023 at 06:37:55PM +0530, Praveenkumar I wrote:
>>>>>>> txn is in #1 stack
>>>>>>>
>>>>>>> Worker #1 Worker #2
>>>>>>> ******** *********
>>>>>>>
>>>>>>> qmi_txn_wait(txn) qmi_handle_message
>>>>>>> | |
>>>>>>> | |
>>>>>>> wait_for_complete(txn->complete) ....
>>>>>>> | mutex_lock(txn->lock)
>>>>>>> | |
>>>>>>> mutex_lock(txn->lock) |
>>>>>>> .....
>>>>>>> complete(txn->lock)
>>>>>>> | mutex_unlock(txn->lock)
>>>>>>> |
>>>>>>> mutex_unlock(txn->lock)
>>>>>>>
>>>>>>> In this case above, while #2 is doing the mutex_unlock(txn->lock),
>>>>>>> in between releasing lock and doing other lock related wakeup, #2
>>>>>>> gets
>>>>>>> scheduled out. As a result #1, acquires the lock, unlocks, also
>>>>>>> frees the txn also (where the lock resides)
>>>>>>>
>>>>>>> Now #2, gets scheduled again and tries to do the rest of the lock
>>>>>>> related wakeup, but lock itself is invalid because txn itself is
>>>>>>> gone.
>>>>>>>
>>>>>>> Fixing this, by doing the mutex_unlock(txn->lock) first and then
>>>>>>> complete(txn->lock) in #2
>>>>>>>
>>>>>>> Fixes: 3830d0771ef6 ("soc: qcom: Introduce QMI helpers")
>>>>>>> Cc: stable@...r.kernel.org
>>>>>>> Signed-off-by: Sricharan Ramabadhran <quic_srichara@...cinc.com>
>>>>>>> Signed-off-by: Praveenkumar I <quic_ipkumar@...cinc.com>
>>>>>>> ---
>>>>>>> drivers/soc/qcom/qmi_interface.c | 3 ++-
>>>>>>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/drivers/soc/qcom/qmi_interface.c
>>>>>>> b/drivers/soc/qcom/qmi_interface.c
>>>>>>> index 78d7361fdcf2..92e29db97359 100644
>>>>>>> --- a/drivers/soc/qcom/qmi_interface.c
>>>>>>> +++ b/drivers/soc/qcom/qmi_interface.c
>>>>>>> @@ -505,12 +505,13 @@ static void qmi_handle_message(struct
>>>>>>> qmi_handle *qmi,
>>>>>>> pr_err("failed to decode incoming message\n");
>>>>>>> txn->result = ret;
>>>>>>> - complete(&txn->completion);
>>>>>>> } else {
>>>>>>> qmi_invoke_handler(qmi, sq, txn, buf, len);
>>>>>>> }
>>>>>>> mutex_unlock(&txn->lock);
>>>>>>> + if (txn->dest && txn->ei)
>>>>>>> + complete(&txn->completion);
>>>>>>> } else {
>>>>>>> /* Create a txn based on the txn_id of the incoming
>>>>>>> message */
>>>>>>> memset(&tmp_txn, 0, sizeof(tmp_txn));
>>>>>>
>>>>>> What happens in a remote scenario where the waiter gets timed out
>>>>>> at the
>>>>>> very same time you are releasing the mutex but before calling
>>>>>> complete()? The caller might end up freeing txn structure and it
>>>>>> results
>>>>>> in the same issue you are currently facing.
>>>>>>
>>>>>> Thanks,
>>>>>> Pavan
>>>>>
>>>>> I think downstream we had various attempts of moving the signal
>>>>> around trying to avoid this, but hit scenarios like the one Pavan
>>>>> described.
>>>>>
>>>>> We eventually settled on removing the txn->lock and treating the
>>>>> qmi->txn_lock as a big lock. This remedied the issue where the
>>>>> txn->lock goes out of scope since qmi->txn_lock is tied to the qmi
>>>>> handle.
>>>>>
>>>>
>>>> ok agree. Using qmi->txn_lock looks a better approach.
>>>> That said, this race between mutex lock/unlock looks odd though.
>>>> If i remember we saw the issue only with CONFIG_DEBUG_LOCK_ALLOC.
>>>> Was that the same case for you guys as well ?
>>>>
>>>> Otherwise, ideally handling all members of the object inside lock
>>>> should be the right solution (ie moving the wait_for_complete(txn)
>>>> inside the mutex_lock in qmi_txn_wait. That should take care of the
>>>> scenario that Pavan described too.
>>>>
>>>
>>> No, we saw the issue even without CONFIG_DEBUG_LOCK_ALLOC. The
>>> callstacks always ended up showing that the mutex could be acquired
>>> before mutex_unlock() completely finished.
>>>
>>> It didn't seem wise to poke at the mutex implementation so we went
>>> with the txn_lock.
>>
>> ok, that's strange. That effectively means, mutex_lock/unlock are not
>> working/protecting the critical section ? Then qmi->txn_lock also would
>> result in a similar issue ? I guess, in this case, during issue, txn
>> (which holds the lock) was going out of context, while still the txn
>> was in used in other thread. That effectively shows up a mutex issue
>> maybe. While the downstream change to use qmi->txn_lock would fix the
>> mutex issue, will have to check if the txn object itself is protected
>> correctly.
>>
>
> Looked into this a bit more, I think the mutex was going into
> __mutex_unlock_slowpath because there is a waiter on the txn->lock. In
> the slow path there are two sections of the code, one where we release
> the owner and another where we notify waiters. It doesn't look like
> there is anything to prevent preemption between the two sections.
>
> /kernel/locking/mutex.c
>
> static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock,
> unsigned long ip)
> {
> ...
> /*
> * Release the lock before (potentially) taking the spinlock such that
> * other contenders can get on with things ASAP.
> *
> * Except when HANDOFF, in that case we must not clear the owner field,
> * but instead set it to the top waiter.
> */
>
> A mutex is able to guarantee mutual exclusion on the critical sections
> that we enclose in locks. It is not able to guarantee the lifetime of a
> object, that would have to be done through a kref like mechanism or code
> organization. In this case relying on qmi->txn_lock() would be relying
> on code organization to guarantee it won't go out of scope/be freed.
>
Sorry for delayed response, i missed it.
Yup, exactly, that is why earlier suspected even in your case also
CONFIG_DEBUG_LOCK_ALLOC might have been enabled. It was same for us as
well. Now it is correct for both cases.
Regards,
Sricharan
Powered by blists - more mailing lists