linux-kernel - Re: [PATCH v2 0/2] venus driver fixes to avoid possible OOB read access

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f7f7923e-c5ee-4f47-8eba-972bf7f08c9d@linaro.org>
Date: Tue, 4 Mar 2025 13:30:46 +0000
From: Bryan O'Donoghue <bryan.odonoghue@...aro.org>
To: Vikash Garodia <quic_vgarodia@...cinc.com>,
 Vedang Nagar <quic_vnagar@...cinc.com>,
 Stanimir Varbanov <stanimir.k.varbanov@...il.com>,
 Mauro Carvalho Chehab <mchehab@...nel.org>
Cc: linux-media@...r.kernel.org, linux-arm-msm@...r.kernel.org,
 linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2 0/2] venus driver fixes to avoid possible OOB read
 access

On 03/03/2025 15:01, Vikash Garodia wrote:
> 
> On 3/3/2025 8:26 PM, Bryan O'Donoghue wrote:
>> On 03/03/2025 13:12, Vikash Garodia wrote:
>>>
>>> On 3/2/2025 9:26 PM, Bryan O'Donoghue wrote:
>>>> On 02/03/2025 11:58, Vedang Nagar wrote:
>>>>>>
>>>>>> The basic question : what is the lifetime of the data from RX interrupt to
>>>>>> consumption by another system agent, DSP, userspace, whatever ?
>>>>> As mentioned in [1], With the regular firmware, after RX interrupt the data
>>>>> can be considered as valid until next interrupt is raised, but with the rouge
>>>>> firmware, data can get invalid during the second read and our intention is to
>>>>> avoid out of bound access read because of such issues.
>>>>
>>>> This is definitely the part I don't compute.
>>>>
>>>> 1. RX interrupt
>>>> 2. Frame#0 Some amount of time data is always valid
>>> This is not correct. Its not the amount of time which determines the validity of
>>> the data, its the possibility of rogue firmware which, if incase, puts up the
>>> date in shared queue, would always be invalid, irrespective of time.
>>>
>>>> 3. RX interrupt - new data
>>>> 4. Frame#1 new data delivered into a buffer
>>>>
>>>> Are you describing a case between RX interrupts 1-3 or a case after 1-4?
>>>>
>>>> Why do we need to write code for rouge firmware anyway ?
>>> It is a way to prevent any possibility of OOB, similar to how any API does check
>>> for validity of any arguments passed to it, prior to processing.
>>>>
>>>> And the real question - if the data can be invalidated in the 1-3 window above
>>>> when is the safe time to snapshot that data ?
>>>>
>>>> We seem to have alot of submissions to deal with 'rouge' firmware without I
>>>> think properly describing the problem of the _expected_ data lifetime.
>>>>
>>>> So
>>>>
>>>> a) What is the expected data lifetime of an RX buffer between one
>>>>      RX IRQ and the next ?
>>>>      I hope the answer to this is - APSS owns the buffer.
>>>>      This is BTW usually the case in these types of asymmetric setups
>>>>      with a flag or some other kind of semaphore that indicates which
>>>>      side of the data-exchange owns the buffer.
>>>>
>>>> b) In this rouge - buggy - firmware case what is the scope of the
>>>>      potential race condition ?
>>>>
>>>>      What I'd really like to know here is why we have to seemingly
>>>>      memcpy() again and again in seemingly incongrous and not
>>>>      immediately obvious places in the code.
>>>>
>>>>      Would we not be better advised to do a memcpy() of the entire
>>>>      RX frame in the RX IRQ handler path if as you appear to me
>>>>      suggesting - the firmware can "race" with the APSS
>>>>      i.e. the data-buffer ownership flag either doesn't work
>>>>      or isn't respected by one side in the data-exchange.
>>>>
>>>> Can we please have a detailed description of the race condition here ?
>>> Below is the report which the reporter reported leading to OOB, let me know if
>>> you are unable to deduce the trail leading to OOB here.
>>>
>>> OOB read issue is in function event_seq_changed, please reference below code
>>> snippet:
>>>
>>> Buggy code snippet:
>>>
>>> static void event_seq_changed(struct venus_core *core, struct venus_inst *inst,
>>>           struct hfi_msg_event_notify_pkt *pkt)
>>> ...
>>> num_properties_changed = pkt->event_data2; //num_properties_changed is from
>>> message and is not validated.
>>> ...
>>> data_ptr = (u8 *)&pkt->ext_event_data[0];
>>> do {
>>>    ptype = *((u32 *)data_ptr);
>>>    switch (ptype) {
>>>    case HFI_PROPERTY_PARAM_FRAME_SIZE:
>>>     data_ptr += sizeof(u32);
>>>     frame_sz = (struct hfi_framesize *)data_ptr;
>>>     event.width = frame_sz->width;
>>> ...
>>>    }
>>>    num_properties_changed--;
>>> } while (num_properties_changed > 0);
>>> ```
>>> There is no validation against `num_properties_changed = pkt->event_data2`, so
>>> OOB read occurs.
>>>>
>>>> I don't doubt the new memcpy() makes sense to you but without this detailed
>>>> understanding of the underlying problem its virtually impossible to debate the
>>>> appropriate remediation - perhaps this patch you've submitted - or some other
>>>> solution.
>>>>
>>>> Sorry to dig into my trench here but, way more detail is needed.
>>>>
>>>>> [1]: https://lore.kernel.org/lkml/4cfc1fe1-2fab-4256-9ce2-
>>>>> b4a0aad1069e@...aro.org/T/#m5f1737b16e68f8b8fc1d75517356b6566d0ec619
>>>>>>
>>>>>> Why is it in this small specific window that the data can change but not
>>>>>> later ? What is the mechanism the data can change and how do the changes you
>>>>>> propose here address the data lifetime problem ?
>>>>> Currently this issue has been discovered by external researchers at this
>>>>> point, but if any such OOB issue is discovered at later point as well then we
>>>>> shall fix them as well.
>>>>
>>>> Right but, I'm looking for a detailed description of the problem.
>>>>
>>>> Can you describe from RX interrupt again what the expected data lifetime of the
>>>> RX frame is, which I hope we agree is until the next RX interrupt associated
>>>> with a given buffer with an ownership flag shared between firmware and APSS -
>>>> and then under what circumstances that "software contract" is being violated.
>>>>
>>>>> Also, with rougue firmware we cannot fix the data lifetime problem in my
>>>>> opinion, but atleast we can fix the out of bound issues.
>>>>>>
>>>>>> Without that context, I don't believe it is really possible to validate an
>>>>>> additional memcpy() here and there in the code as fixing anything.
>>>>> There is no additional memcpy() now in the v2 patch, but as part of the fix,
>>>>> we are just trying to retain the length of the packet which was being read in
>>>>> the first memcpy() to avoid the OOB read access.
>>>>
>>>> I can't make a suggestion because - personally speaking I still don't quite
>>>> understand the data-race you are describing.
>>> Go through the reports from the reporter, it was quite evident in leading upto
>>> OOB case.
>>> Putting up the sequence for you to go over the interrupt handling and message
>>> queue parsing of the packets from firmware
>>> 1.
>>> https://elixir.bootlin.com/linux/v6.14-rc4/source/drivers/media/platform/qcom/venus/hfi_venus.c#L1082
>>> 2.
>>> https://elixir.bootlin.com/linux/v6.14-rc4/source/drivers/media/platform/qcom/venus/hfi_msgs.c#L816
>>> 3. event handling (this particular case)
>>> https://elixir.bootlin.com/linux/v6.14-rc4/source/drivers/media/platform/qcom/venus/hfi_msgs.c#L658
>>> 4.
>>> https://elixir.bootlin.com/linux/v6.14-rc4/source/drivers/media/platform/qcom/venus/hfi_msgs.c#L22
>>>
>>> the "struct hfi_msg_event_notify_pkt *pkt" pkt here is having the data read from
>>> shared queue.
>>>
>>>>
>>>> I get that you say the firmware is breaking the contract but, without more
>>>> detail on _how_ it breaks that contract I don't think it's really possible to
>>>> validate your fix here, fixes anything.
>>>>
>>>> ---
>>>> bod
>>>
>>> Regards,
>>> Vikash
>>
>> I'll go through all of these links given here, thanks.
> I would request you to go through the description putup by the reporter of this
> OOB as well, i added in my earlier response. It provided a good background of
> how the firmware response can led to this particular OOB, atleast that was the
> source of OOB info for us.
> Regards,
> Vikash
>>
>> Whatever the result of the review, this detail needs to go into the commit log
>> so that a reviewer can reasonably read the problem description and evaluate
>> against submitted code as a fix.
>>
>> ---
>> bod

Replied to the wrong patch.

There is no memcpy() in this patch - there was in patch #1 which has 
subsequently been dropped.

Anyway I'll address my comment there.

---
bod