[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c2f50606-5589-276d-e106-8ee01c37eabe@quicinc.com>
Date: Wed, 2 Aug 2023 13:37:26 +0530
From: Sricharan Ramabadhran <quic_srichara@...cinc.com>
To: Chris Lew <quic_clew@...cinc.com>,
Pavan Kondeti <quic_pkondeti@...cinc.com>,
Praveenkumar I <quic_ipkumar@...cinc.com>
CC: <agross@...nel.org>, <andersson@...nel.org>,
<konrad.dybcio@...aro.org>, <linux-arm-msm@...r.kernel.org>,
<linux-kernel@...r.kernel.org>, <quic_varada@...cinc.com>
Subject: Re: [PATCH] soc: qcom: qmi: Signal the txn completion after releasing
the mutex
On 8/2/2023 5:11 AM, Chris Lew wrote:
>
>
> On 8/1/2023 4:13 AM, Sricharan Ramabadhran wrote:
>> Hi,
>>
>> On 8/1/2023 6:06 AM, Chris Lew wrote:
>>>
>>>
>>> On 7/31/2023 8:19 AM, Pavan Kondeti wrote:
>>>> On Mon, Jul 31, 2023 at 06:37:55PM +0530, Praveenkumar I wrote:
>>>>> txn is in #1 stack
>>>>>
>>>>> Worker #1 Worker #2
>>>>> ******** *********
>>>>>
>>>>> qmi_txn_wait(txn) qmi_handle_message
>>>>> | |
>>>>> | |
>>>>> wait_for_complete(txn->complete) ....
>>>>> |
>>>>> mutex_lock(txn->lock)
>>>>> | |
>>>>> mutex_lock(txn->lock) |
>>>>> ..... complete(txn->lock)
>>>>> | mutex_unlock(txn->lock)
>>>>> |
>>>>> mutex_unlock(txn->lock)
>>>>>
>>>>> In this case above, while #2 is doing the mutex_unlock(txn->lock),
>>>>> in between releasing lock and doing other lock related wakeup, #2 gets
>>>>> scheduled out. As a result #1, acquires the lock, unlocks, also
>>>>> frees the txn also (where the lock resides)
>>>>>
>>>>> Now #2, gets scheduled again and tries to do the rest of the lock
>>>>> related wakeup, but lock itself is invalid because txn itself is gone.
>>>>>
>>>>> Fixing this, by doing the mutex_unlock(txn->lock) first and then
>>>>> complete(txn->lock) in #2
>>>>>
>>>>> Fixes: 3830d0771ef6 ("soc: qcom: Introduce QMI helpers")
>>>>> Cc: stable@...r.kernel.org
>>>>> Signed-off-by: Sricharan Ramabadhran <quic_srichara@...cinc.com>
>>>>> Signed-off-by: Praveenkumar I <quic_ipkumar@...cinc.com>
>>>>> ---
>>>>> drivers/soc/qcom/qmi_interface.c | 3 ++-
>>>>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/soc/qcom/qmi_interface.c
>>>>> b/drivers/soc/qcom/qmi_interface.c
>>>>> index 78d7361fdcf2..92e29db97359 100644
>>>>> --- a/drivers/soc/qcom/qmi_interface.c
>>>>> +++ b/drivers/soc/qcom/qmi_interface.c
>>>>> @@ -505,12 +505,13 @@ static void qmi_handle_message(struct
>>>>> qmi_handle *qmi,
>>>>> pr_err("failed to decode incoming message\n");
>>>>> txn->result = ret;
>>>>> - complete(&txn->completion);
>>>>> } else {
>>>>> qmi_invoke_handler(qmi, sq, txn, buf, len);
>>>>> }
>>>>> mutex_unlock(&txn->lock);
>>>>> + if (txn->dest && txn->ei)
>>>>> + complete(&txn->completion);
>>>>> } else {
>>>>> /* Create a txn based on the txn_id of the incoming
>>>>> message */
>>>>> memset(&tmp_txn, 0, sizeof(tmp_txn));
>>>>
>>>> What happens in a remote scenario where the waiter gets timed out at
>>>> the
>>>> very same time you are releasing the mutex but before calling
>>>> complete()? The caller might end up freeing txn structure and it
>>>> results
>>>> in the same issue you are currently facing.
>>>>
>>>> Thanks,
>>>> Pavan
>>>
>>> I think downstream we had various attempts of moving the signal
>>> around trying to avoid this, but hit scenarios like the one Pavan
>>> described.
>>>
>>> We eventually settled on removing the txn->lock and treating the
>>> qmi->txn_lock as a big lock. This remedied the issue where the
>>> txn->lock goes out of scope since qmi->txn_lock is tied to the qmi
>>> handle.
>>>
>>
>> ok agree. Using qmi->txn_lock looks a better approach.
>> That said, this race between mutex lock/unlock looks odd though.
>> If i remember we saw the issue only with CONFIG_DEBUG_LOCK_ALLOC.
>> Was that the same case for you guys as well ?
>>
>> Otherwise, ideally handling all members of the object inside lock
>> should be the right solution (ie moving the wait_for_complete(txn)
>> inside the mutex_lock in qmi_txn_wait. That should take care of the
>> scenario that Pavan described too.
>>
>
> No, we saw the issue even without CONFIG_DEBUG_LOCK_ALLOC. The
> callstacks always ended up showing that the mutex could be acquired
> before mutex_unlock() completely finished.
>
> It didn't seem wise to poke at the mutex implementation so we went with
> the txn_lock.
ok, that's strange. That effectively means, mutex_lock/unlock are not
working/protecting the critical section ? Then qmi->txn_lock also would
result in a similar issue ? I guess, in this case, during issue, txn
(which holds the lock) was going out of context, while still the txn
was in used in other thread. That effectively shows up a mutex issue
maybe. While the downstream change to use qmi->txn_lock would fix the
mutex issue, will have to check if the txn object itself is protected
correctly.
Regards,
Sricharan
Powered by blists - more mailing lists