linux-kernel - Re: [PATCH] soc: qcom: qmi: Signal the txn completion after releasing the mutex

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <73f25c8f-6193-6001-d3ff-b7fd060cce83@quicinc.com>
Date:   Tue, 1 Aug 2023 16:41:16 -0700
From:   Chris Lew <quic_clew@...cinc.com>
To:     Sricharan Ramabadhran <quic_srichara@...cinc.com>,
        Pavan Kondeti <quic_pkondeti@...cinc.com>,
        Praveenkumar I <quic_ipkumar@...cinc.com>
CC:     <agross@...nel.org>, <andersson@...nel.org>,
        <konrad.dybcio@...aro.org>, <linux-arm-msm@...r.kernel.org>,
        <linux-kernel@...r.kernel.org>, <quic_varada@...cinc.com>
Subject: Re: [PATCH] soc: qcom: qmi: Signal the txn completion after releasing
 the mutex



On 8/1/2023 4:13 AM, Sricharan Ramabadhran wrote:
> Hi,
> 
> On 8/1/2023 6:06 AM, Chris Lew wrote:
>>
>>
>> On 7/31/2023 8:19 AM, Pavan Kondeti wrote:
>>> On Mon, Jul 31, 2023 at 06:37:55PM +0530, Praveenkumar I wrote:
>>>> txn is in #1 stack
>>>>
>>>> Worker #1                                       Worker #2
>>>> ********                    *********
>>>>
>>>> qmi_txn_wait(txn)                               qmi_handle_message
>>>>     |                                                  |
>>>>     |                                                  |
>>>>   wait_for_complete(txn->complete)                    ....
>>>>     |                                             mutex_lock(txn->lock)
>>>>     |                                                  |
>>>>   mutex_lock(txn->lock)                                |
>>>>     .....                                         complete(txn->lock)
>>>>     | mutex_unlock(txn->lock)
>>>>     |
>>>>   mutex_unlock(txn->lock)
>>>>
>>>> In this case above, while #2 is doing the mutex_unlock(txn->lock),
>>>> in between releasing lock and doing other lock related wakeup, #2 gets
>>>> scheduled out. As a result #1, acquires the lock, unlocks, also
>>>> frees the txn also (where the lock resides)
>>>>
>>>> Now #2, gets scheduled again and tries to do the rest of the lock
>>>> related wakeup, but lock itself is invalid because txn itself is gone.
>>>>
>>>> Fixing this, by doing the mutex_unlock(txn->lock) first and then
>>>> complete(txn->lock) in #2
>>>>
>>>> Fixes: 3830d0771ef6 ("soc: qcom: Introduce QMI helpers")
>>>> Cc: stable@...r.kernel.org
>>>> Signed-off-by: Sricharan Ramabadhran <quic_srichara@...cinc.com>
>>>> Signed-off-by: Praveenkumar I <quic_ipkumar@...cinc.com>
>>>> ---
>>>>   drivers/soc/qcom/qmi_interface.c | 3 ++-
>>>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/soc/qcom/qmi_interface.c 
>>>> b/drivers/soc/qcom/qmi_interface.c
>>>> index 78d7361fdcf2..92e29db97359 100644
>>>> --- a/drivers/soc/qcom/qmi_interface.c
>>>> +++ b/drivers/soc/qcom/qmi_interface.c
>>>> @@ -505,12 +505,13 @@ static void qmi_handle_message(struct 
>>>> qmi_handle *qmi,
>>>>                   pr_err("failed to decode incoming message\n");
>>>>               txn->result = ret;
>>>> -            complete(&txn->completion);
>>>>           } else  {
>>>>               qmi_invoke_handler(qmi, sq, txn, buf, len);
>>>>           }
>>>>           mutex_unlock(&txn->lock);
>>>> +        if (txn->dest && txn->ei)
>>>> +            complete(&txn->completion);
>>>>       } else {
>>>>           /* Create a txn based on the txn_id of the incoming 
>>>> message */
>>>>           memset(&tmp_txn, 0, sizeof(tmp_txn));
>>>
>>> What happens in a remote scenario where the waiter gets timed out at the
>>> very same time you are releasing the mutex but before calling
>>> complete()? The caller might end up freeing txn structure and it results
>>> in the same issue you are currently facing.
>>>
>>> Thanks,
>>> Pavan
>>
>> I think downstream we had various attempts of moving the signal around 
>> trying to avoid this, but hit scenarios like the one Pavan described.
>>
>> We eventually settled on removing the txn->lock and treating the 
>> qmi->txn_lock as a big lock. This remedied the issue where the 
>> txn->lock goes out of scope since qmi->txn_lock is tied to the qmi 
>> handle.
>>
> 
>   ok agree. Using qmi->txn_lock looks a better approach.
>   That said, this race between mutex lock/unlock looks odd though.
>   If i remember we saw the issue only with CONFIG_DEBUG_LOCK_ALLOC.
>   Was that the same case for you guys as well ?
> 
>   Otherwise, ideally handling all members of the object inside lock
>   should be the right solution (ie moving the wait_for_complete(txn)
>   inside the mutex_lock in qmi_txn_wait. That should take care of the
>   scenario that Pavan described too.
> 

No, we saw the issue even without CONFIG_DEBUG_LOCK_ALLOC. The 
callstacks always ended up showing that the mutex could be acquired 
before mutex_unlock() completely finished.

It didn't seem wise to poke at the mutex implementation so we went with 
the txn_lock.

> Regards,
>   Sricharan
>