[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8ea7fe7c-7b4d-4a6f-ae03-b9ca127c23f8@quicinc.com>
Date: Thu, 13 Mar 2025 21:31:56 +0800
From: Miaoqing Pan <quic_miaoqing@...cinc.com>
To: Johan Hovold <johan@...nel.org>
CC: Jeff Johnson <jeff.johnson@....qualcomm.com>, <ath11k@...ts.infradead.org>,
<linux-wireless@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
<johan+linaro@...nel.org>
Subject: Re: [PATCH v2 ath-next 2/2] wifi: ath11k: fix HTC rx insufficient
length
On 3/13/2025 12:43 AM, Johan Hovold wrote:
> On Wed, Mar 12, 2025 at 09:11:45AM +0800, Miaoqing Pan wrote:
>> On 3/11/2025 11:20 PM, Jeff Johnson wrote:
>>> On 3/11/2025 1:29 AM, Miaoqing Pan wrote:
>>>> On 3/10/2025 6:09 PM, Johan Hovold wrote:
>>>>> I'm still waiting for feedback from one user that can reproduce the
>>>>> ring-buffer corruption very easily, but another user mentioned seeing
>>>>> multiple zero-length descriptor warnings over the weekend when running
>>>>> with this patch:
>>>>>
>>>>> ath11k_pci 0006:01:00.0: rxed invalid length (nbytes 0, max 2048)
>>>>>
>>>>> Are there ever any valid reasons for seeing a zero-length descriptor
>>>>> (i.e. unrelated to the race at hand)? IIUC the warning would only be
>>>>> printed when processing such descriptors a second time (i.e. when
>>>>> is_desc_len0 is set).
>>>>
>>>> That's exactly the logic, only can see the warning in a second time. For
>>>> the first time, ath12k_ce_completed_recv_next() returns '-EIO'.
>>>
>>> That didn't answer Johan's first question:
>>> Are there ever any valid reasons for seeing a zero-length descriptor?
>>
>> The events currently observed are all firmware logs. The discarded
>> packets will not affect normal operation. I will adjust the logging to
>> debug level.
>
> That still does not answer the question whether there are ever any valid
> reasons for seeing zero-length descriptors. I assume there are none?
>
>>> We have an issue that there is a race condition where hardware updates the
>>> pointer before it has filled all the data. The current solution is just to
>>> read the data a second time. But if that second read also occurs before
>>> hardware has updated the data, then the issue isn't fixed.
>>
>> Thanks for the addition.
>>
>>> So should there be some forced delay before we read a second time?
>>> Or should we attempt to read more times?
>>
>> The initial fix was to keep waiting for the data to be ready. The
>> observed phenomenon is that when the second read fails, subsequent reads
>> will continue to fail until the firmware's CE2 ring is full and triggers
>> an assert after timeout. However, this situation is relatively rare, and
>> in most cases, the second read will succeed. Therefore, adding a delay
>> or multiple read attempts is not useful.
>
> The proposed fix is broken since ath11k_hal_ce_dst_status_get_length()
> not just reads the length but also sets it to zero so that the updated
> length may indeed never be seen.
>
> I've taken a closer look at the driver and it seems like we're missing a
> read barrier to make sure that the updated descriptor is not read until
> after the head pointer.
>
> Miaoqing, could you try the below patch with your reproducer and see if
> it is enough to fix the corruption?
>
> If so I can resend with the warning removed and include a corresponding
> fix for ath12k (it looks like there are further places where barriers
> are missing too).
>
> Johan
>
>
> From 656dbd0894741445aeb16ee8357e6fef51b6084c Mon Sep 17 00:00:00 2001
> From: Johan Hovold <johan+linaro@...nel.org>
> Date: Wed, 12 Mar 2025 16:49:20 +0100
> Subject: [PATCH] wifi: ath11k: fix ring-buffer corruption
>
> Users of the Lenovo ThinkPad X13s have reported that Wi-Fi sometimes
> breaks and the log fills up with errors like:
>
> ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1484, expected 1492
> ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1460, expected 1484
>
> which based on a quick look at the driver seemed to indicate some kind
> of ring-buffer corruption.
>
> Miaoqing Pan tracked it down to the host seeing the updated destination
> ring head pointer before the updated descriptor, and the error handling
> for that in turn leaves the ring buffer in an inconsistent state.
>
> Add the missing the read barrier to make sure that the descriptor is
> read after the head pointer to address the root cause of the corruption.
>
> The error handling can be fixed separately in case there can ever be
> actual zero-length descriptors.
>
> FIXME: remove WARN_ON_ONCE() added for verification purposes
>
> Tested-on: WCN6855 hw2.1 WLAN.HSP.1.1-03125-QCAHSPSWPL_V1_V2_SILICONZ_LITE-3.6510.41
>
> Fixes: d5c65159f289 ("ath11k: driver for Qualcomm IEEE 802.11ax devices")
> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218623
> Link: https://lore.kernel.org/20250310010217.3845141-3-quic_miaoqing@quicinc.com
> Cc: Miaoqing Pan <quic_miaoqing@...cinc.com>
> Cc: stable@...r.kernel.org # 5.6
> Signed-off-by: Johan Hovold <johan+linaro@...nel.org>
> ---
> drivers/net/wireless/ath/ath11k/ce.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/drivers/net/wireless/ath/ath11k/ce.c b/drivers/net/wireless/ath/ath11k/ce.c
> index e66e86bdec20..423b970e288c 100644
> --- a/drivers/net/wireless/ath/ath11k/ce.c
> +++ b/drivers/net/wireless/ath/ath11k/ce.c
> @@ -393,8 +393,12 @@ static int ath11k_ce_completed_recv_next(struct ath11k_ce_pipe *pipe,
> goto err;
> }
>
> + /* Make sure descriptor is read after the head pointer. */
> + dma_rmb();
> +
> *nbytes = ath11k_hal_ce_dst_status_get_length(desc);
> if (*nbytes == 0) {
> + WARN_ON_ONCE(1); // FIXME: remove
> ret = -EIO;
> goto err;
> }
This issue can still be reproduced.
[ 3283.687469] WARNING: CPU: 0 PID: 0 at
/drivers/net/wireless/ath/ath11k/ce.c:405
ath11k_ce_per_engine_service+0x228/0x3e4 [ath11k]
[ 3283.688685] Call trace:
[ 3283.688692] ath11k_ce_per_engine_service+0x228/0x3e4 [ath11k]
[ 3283.688807] ath11k_pcic_ce_tasklet+0x34/0x54 [ath11k]
[ 3283.688920] tasklet_action_common.isra.0+0xec/0x338
[ 3283.688958] tasklet_action+0x28/0x34
[ 3283.688972] handle_softirqs+0x120/0x36c
Powered by blists - more mailing lists