linux-kernel - Re: [PATCH v2 ath-next 2/2] wifi: ath11k: fix HTC rx insufficient length

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <7b1c5e40-b11d-421b-8c8b-117a2a53298b@quicinc.com>
Date: Tue, 11 Mar 2025 16:29:42 +0800
From: Miaoqing Pan <quic_miaoqing@...cinc.com>
To: Johan Hovold <johan@...nel.org>
CC: <quic_jjohnson@...cinc.com>, <ath11k@...ts.infradead.org>,
        <linux-wireless@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
        <johan+linaro@...nel.org>
Subject: Re: [PATCH v2 ath-next 2/2] wifi: ath11k: fix HTC rx insufficient
 length



On 3/10/2025 6:09 PM, Johan Hovold wrote:
> On Mon, Mar 10, 2025 at 09:02:17AM +0800, Miaoqing Pan wrote:
>> A relatively unusual race condition occurs between host software
>> and hardware, where the host sees the updated destination ring head
>> pointer before the hardware updates the corresponding descriptor.
>> When this situation occurs, the length of the descriptor returns 0.
> 
> I still think this description is too vague and it doesn't explain how
> this race is even possible. It sounds like there's a bug somewhere in
> the driver or firmware, but if this really is an indication the hardware
> is broken as your reply here seems to suggest:
> 
> 	https://lore.kernel.org/lkml/bc187777-588c-4fa0-ba8c-847e91c78d43@quicinc.com/
> 
> then that too should be highlighted in the commit message (e.g. by
> describing this as "working around broken hardware").
> 
Ok, will do.

>> The current error handling method is to increment descriptor tail
>> pointer by 1, but 'sw_index' is not updated, causing descriptor and
>> skb to not correspond one-to-one, resulting in the following error:
>>
>> ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1488, expected 1492
>> ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1460, expected 1484
>>
>> To address this problem, temporarily skip processing the current
>> descriptor and handle it again next time. However, to prevent this
>> descriptor from continuously returning 0, use skb cb to set a flag.
>> If the length returns 0 again, this descriptor will be discarded.
> 
> The ath12k ring-buffer handling looks very similar. Do you need a
> corresponding workaround in ath12k_ce_completed_recv_next()? Or are you
> sure that this (hardware) bug only affects ath11k devices?
>   
Yes, will submit a similar patch.

>>   	*nbytes = ath11k_hal_ce_dst_status_get_length(desc);
>> -	if (*nbytes == 0) {
>> -		ret = -EIO;
>> -		goto err;
>> +	if (unlikely(*nbytes == 0)) {
>> +		struct ath11k_skb_rxcb *rxcb =
>> +			ATH11K_SKB_RXCB(pipe->dest_ring->skb[sw_index]);
>> +
>> +		/* A relatively unusual race condition occurs between host
>> +		 * software and hardware, where the host sees the updated
>> +		 * destination ring head pointer before the hardware updates
>> +		 * the corresponding descriptor.
>> +		 *
>> +		 * Temporarily skip processing the current descriptor and handle
>> +		 * it again next time. However, to prevent this descriptor from
>> +		 * continuously returning 0, set 'is_desc_len0' flag. If the
>> +		 * length returns 0 again, this descriptor will be discarded.
>> +		 */
>> +		if (!rxcb->is_desc_len0) {
>> +			rxcb->is_desc_len0 = true;
>> +			ret = -EIO;
>> +			goto err;
>> +		}
>>   	}
> 
> I'm still waiting for feedback from one user that can reproduce the
> ring-buffer corruption very easily, but another user mentioned seeing
> multiple zero-length descriptor warnings over the weekend when running
> with this patch:
> 
> 	ath11k_pci 0006:01:00.0: rxed invalid length (nbytes 0, max 2048)
> 
> Are there ever any valid reasons for seeing a zero-length descriptor
> (i.e. unrelated to the race at hand)? IIUC the warning would only be
> printed when processing such descriptors a second time (i.e. when
> is_desc_len0 is set).
> 

That's exactly the logic, only can see the warning in a second time. For 
the first time, ath12k_ce_completed_recv_next() returns '-EIO'.