[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7b1c5e40-b11d-421b-8c8b-117a2a53298b@quicinc.com>
Date: Tue, 11 Mar 2025 16:29:42 +0800
From: Miaoqing Pan <quic_miaoqing@...cinc.com>
To: Johan Hovold <johan@...nel.org>
CC: <quic_jjohnson@...cinc.com>, <ath11k@...ts.infradead.org>,
<linux-wireless@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
<johan+linaro@...nel.org>
Subject: Re: [PATCH v2 ath-next 2/2] wifi: ath11k: fix HTC rx insufficient
length
On 3/10/2025 6:09 PM, Johan Hovold wrote:
> On Mon, Mar 10, 2025 at 09:02:17AM +0800, Miaoqing Pan wrote:
>> A relatively unusual race condition occurs between host software
>> and hardware, where the host sees the updated destination ring head
>> pointer before the hardware updates the corresponding descriptor.
>> When this situation occurs, the length of the descriptor returns 0.
>
> I still think this description is too vague and it doesn't explain how
> this race is even possible. It sounds like there's a bug somewhere in
> the driver or firmware, but if this really is an indication the hardware
> is broken as your reply here seems to suggest:
>
> https://lore.kernel.org/lkml/bc187777-588c-4fa0-ba8c-847e91c78d43@quicinc.com/
>
> then that too should be highlighted in the commit message (e.g. by
> describing this as "working around broken hardware").
>
Ok, will do.
>> The current error handling method is to increment descriptor tail
>> pointer by 1, but 'sw_index' is not updated, causing descriptor and
>> skb to not correspond one-to-one, resulting in the following error:
>>
>> ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1488, expected 1492
>> ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1460, expected 1484
>>
>> To address this problem, temporarily skip processing the current
>> descriptor and handle it again next time. However, to prevent this
>> descriptor from continuously returning 0, use skb cb to set a flag.
>> If the length returns 0 again, this descriptor will be discarded.
>
> The ath12k ring-buffer handling looks very similar. Do you need a
> corresponding workaround in ath12k_ce_completed_recv_next()? Or are you
> sure that this (hardware) bug only affects ath11k devices?
>
Yes, will submit a similar patch.
>> *nbytes = ath11k_hal_ce_dst_status_get_length(desc);
>> - if (*nbytes == 0) {
>> - ret = -EIO;
>> - goto err;
>> + if (unlikely(*nbytes == 0)) {
>> + struct ath11k_skb_rxcb *rxcb =
>> + ATH11K_SKB_RXCB(pipe->dest_ring->skb[sw_index]);
>> +
>> + /* A relatively unusual race condition occurs between host
>> + * software and hardware, where the host sees the updated
>> + * destination ring head pointer before the hardware updates
>> + * the corresponding descriptor.
>> + *
>> + * Temporarily skip processing the current descriptor and handle
>> + * it again next time. However, to prevent this descriptor from
>> + * continuously returning 0, set 'is_desc_len0' flag. If the
>> + * length returns 0 again, this descriptor will be discarded.
>> + */
>> + if (!rxcb->is_desc_len0) {
>> + rxcb->is_desc_len0 = true;
>> + ret = -EIO;
>> + goto err;
>> + }
>> }
>
> I'm still waiting for feedback from one user that can reproduce the
> ring-buffer corruption very easily, but another user mentioned seeing
> multiple zero-length descriptor warnings over the weekend when running
> with this patch:
>
> ath11k_pci 0006:01:00.0: rxed invalid length (nbytes 0, max 2048)
>
> Are there ever any valid reasons for seeing a zero-length descriptor
> (i.e. unrelated to the race at hand)? IIUC the warning would only be
> printed when processing such descriptors a second time (i.e. when
> is_desc_len0 is set).
>
That's exactly the logic, only can see the warning in a second time. For
the first time, ath12k_ce_completed_recv_next() returns '-EIO'.
Powered by blists - more mailing lists