lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <Z90yyrZcORhJJgNU@hovoldconsulting.com>
Date: Fri, 21 Mar 2025 10:35:06 +0100
From: Johan Hovold <johan@...nel.org>
To: Miaoqing Pan <quic_miaoqing@...cinc.com>
Cc: Jeff Johnson <jeff.johnson@....qualcomm.com>,
	ath11k@...ts.infradead.org, linux-wireless@...r.kernel.org,
	linux-kernel@...r.kernel.org, johan+linaro@...nel.org
Subject: Re: [PATCH v2 ath-next 2/2] wifi: ath11k: fix HTC rx insufficient
 length

On Wed, Mar 19, 2025 at 02:47:12PM +0800, Miaoqing Pan wrote:
> On 3/19/2025 1:42 AM, Johan Hovold wrote:

> > It could if the CPU observes the updates out of order due to the missing
> > barrier. The driver could be processing an earlier interrupt when the
> > new descriptor is added and head pointer updated. If for example the CPU
> > speculatively fetches the descriptor before the head pointer is updated,
> > then the descriptor length may be zero when the CPU sees the updated
> > head pointer.
> 
> Sorry, I still think this situation won't happen. Please see the 
> following code.
> 
> ath11k_hal_srng_access_begin(ab, srng);
>    => srng->u.dst_ring.cached_hp = *srng->u.dst_ring.hp_addr;
> desc = ath11k_hal_srng_dst_get_next_entry(ab, srng);
>    => if (srng->u.dst_ring.tp == srng->u.dst_ring.cached_hp) return NULL;
> //dma_rmb();
> *nbytes = ath11k_hal_ce_dst_status_get_length(desc);
> 
> If the condition 'srng->u.dst_ring.tp == srng->u.dst_ring.cached_hp' is 
> true, the descriptor retrieval fails.

The CPU can still speculate that this condition will be false and load
the descriptor.

If the speculation later turns out to be correct, then the descriptor
may have stale values from before the head pointer was updated.

> > This seems to be what is happening on the X13s since adding the memory
> > barrier makes the zero-length descriptors go away.
> 
> Hmm, it is indeed a bit strange. Could it be that dma_rmb() introduces 
> some delay ?

It's only expected since you must use memory barriers on weakly ordered
architectures like aarch64 to guarantee the ordering.
 
> >> The Copy Engine hardware module copies the metadata to the Status
> >> Descriptor after the DMA is complete, then updates the HP to trigger an
> >> interrupt. I think there might be some issues in this process, such as
> >> the lack of a wmb instruction after the copy is complete, causing the HP
> >> to be updated first.
> > 
> > Yeah, possibly. At least it seems there are more issues than the missing
> > barrier on the machines you test.
> >   
> >>> Now obviously there are further issues in your system, which we should
> >>> make sure we understand before adding workarounds to the driver.
> >>>
> >>> Do you have a pointer to the downstream kernel sources you are testing
> >>> with? Or even better, can you reproduce the issue with mainline after
> >>> adding the PCIe patches that were posted to the lists for these
> >>> platforms?
> >>>
> >> https://github.com/qualcomm-linux/meta-qcom-hwe/blob/scarthgap/recipes-kernel/linux/linux-qcom-base_6.6.bb
> > 
> > Thanks for the pointer. That's a lot of out-of-tree patches on top of
> > stable so not that easy to check the state of the resulting tree.
> 
> Yes, but there are only a few patches for ath11k.

Sure, but there are other components that come into play here such as
the PCIe controller driver.

A colleague of yours recently submitted an updated patch that overrides
the no_snoop bit for qcs8300:

	https://lore.kernel.org/lkml/20250318053836.tievnd5ohzl7bmox@thinkpad/

but that flag appears not to be set in your downstream tree:

	https://github.com/qualcomm-linux/meta-qcom-hwe/blob/scarthgap/recipes-kernel/linux/linux-qcom-base-6.6/drivers/qcs8300/0004-PCI-qcom-Add-QCS8300-PCIe-support.patch

Something like that may prevent a cached descriptor from being
invalidated when the controller updates it.

Similarly, the PCIe controllers are marked as dma-coherent in your
devicetrees. A misconfiguration there could also cause problems.

I suggest we merge my fix that adds the missing memory barrier, and
which users have now been testing for a week without hitting the
corruption (which they used to see several times a day).

Then we can continue to track down why you are having coherency issues
on qcs615 and qcs8300. You really want to make sure that that is fixed
properly as it may lead to subtle bugs elsewhere too.

Johan

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ