lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <a8fc9f07-013c-4e31-9d9e-46e042d81dbf@quicinc.com>
Date: Wed, 19 Mar 2025 14:47:12 +0800
From: Miaoqing Pan <quic_miaoqing@...cinc.com>
To: Johan Hovold <johan@...nel.org>
CC: Jeff Johnson <jeff.johnson@....qualcomm.com>, <ath11k@...ts.infradead.org>,
        <linux-wireless@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
        <johan+linaro@...nel.org>
Subject: Re: [PATCH v2 ath-next 2/2] wifi: ath11k: fix HTC rx insufficient
 length



On 3/19/2025 1:42 AM, Johan Hovold wrote:
> On Tue, Mar 18, 2025 at 03:53:39PM +0800, Miaoqing Pan wrote:
>> On 3/17/2025 9:04 PM, Johan Hovold wrote:
> 
>>> Then it seems we are looking at two separate root causes for the
>>> corruption as the memory barrier appears to be all that is needed to fix
>>> the X13s issue.
>>>
>>> A user who hit the corruption after 2 h without the fix has been running
>>> over the weekend with the memory barrier without any problems. I'll ask
>>> further users to test, but it certainly looks like it is working as
>>> intended.
>>>
>>> And the memory barrier is de-facto missing as the head pointer and
>>> descriptor are accessed through (two separate) coherent mappings so
>>> there are no ordering guarantees without explicit barriers.
>>
>> This situation should occur when there is only one descriptor in the
>> ring. If, as you mentioned, the CPU tries to load the descriptor first,
>> but the descriptor fetch fails before the HP load because the ring
>> returns empty, it won't trigger the current issue.
> 
> It could if the CPU observes the updates out of order due to the missing
> barrier. The driver could be processing an earlier interrupt when the
> new descriptor is added and head pointer updated. If for example the CPU
> speculatively fetches the descriptor before the head pointer is updated,
> then the descriptor length may be zero when the CPU sees the updated
> head pointer.
> 

Sorry, I still think this situation won't happen. Please see the 
following code.

ath11k_hal_srng_access_begin(ab, srng);
   => srng->u.dst_ring.cached_hp = *srng->u.dst_ring.hp_addr;
desc = ath11k_hal_srng_dst_get_next_entry(ab, srng);
   => if (srng->u.dst_ring.tp == srng->u.dst_ring.cached_hp) return NULL;
//dma_rmb();
*nbytes = ath11k_hal_ce_dst_status_get_length(desc);

If the condition 'srng->u.dst_ring.tp == srng->u.dst_ring.cached_hp' is 
true, the descriptor retrieval fails.

> This seems to be what is happening on the X13s since adding the memory
> barrier makes the zero-length descriptors go away.
> 

Hmm, it is indeed a bit strange. Could it be that dma_rmb() introduces 
some delay ?

>> The Copy Engine hardware module copies the metadata to the Status
>> Descriptor after the DMA is complete, then updates the HP to trigger an
>> interrupt. I think there might be some issues in this process, such as
>> the lack of a wmb instruction after the copy is complete, causing the HP
>> to be updated first.
> 
> Yeah, possibly. At least it seems there are more issues than the missing
> barrier on the machines you test.
>   
>>> Now obviously there are further issues in your system, which we should
>>> make sure we understand before adding workarounds to the driver.
>>>
>>> Do you have a pointer to the downstream kernel sources you are testing
>>> with? Or even better, can you reproduce the issue with mainline after
>>> adding the PCIe patches that were posted to the lists for these
>>> platforms?
>>>
>> https://github.com/qualcomm-linux/meta-qcom-hwe/blob/scarthgap/recipes-kernel/linux/linux-qcom-base_6.6.bb
> 
> Thanks for the pointer. That's a lot of out-of-tree patches on top of
> stable so not that easy to check the state of the resulting tree.
> 

Yes, but there are only a few patches for ath11k.

>>> Does it make any difference if you use a full rmb() barrier?
>>>
>> I've also tried rmb() and mb(), but they didn't work either.
> 
> Thanks for checking.
> 
> Just to be sure, you did add the barrier in the same place as my patch
> (i.e. just before the descriptor read)?
> 

Yes.

>>> And after modifying ath11k_hal_ce_dst_status_get_length() so that it
>>> does not clear the length, how many times you need to retry? Does it
>>> always work on the second try?
>>
>> Yes, the test has been running continuously for over 48 hours, always
>> work on the second try, updated in patch v4.
> 
> Good, at least the descriptor-length-sometimes-never-updated issue is
> solved.
> 
> Johan

Yeah, thanks for pointing out the issue.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ