lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <Z9mwn3GzpPPZSiTG@hovoldconsulting.com>
Date: Tue, 18 Mar 2025 18:42:55 +0100
From: Johan Hovold <johan@...nel.org>
To: Miaoqing Pan <quic_miaoqing@...cinc.com>
Cc: Jeff Johnson <jeff.johnson@....qualcomm.com>,
	ath11k@...ts.infradead.org, linux-wireless@...r.kernel.org,
	linux-kernel@...r.kernel.org, johan+linaro@...nel.org
Subject: Re: [PATCH v2 ath-next 2/2] wifi: ath11k: fix HTC rx insufficient
 length

On Tue, Mar 18, 2025 at 03:53:39PM +0800, Miaoqing Pan wrote:
> On 3/17/2025 9:04 PM, Johan Hovold wrote:

> > Then it seems we are looking at two separate root causes for the
> > corruption as the memory barrier appears to be all that is needed to fix
> > the X13s issue.
> > 
> > A user who hit the corruption after 2 h without the fix has been running
> > over the weekend with the memory barrier without any problems. I'll ask
> > further users to test, but it certainly looks like it is working as
> > intended.
> > 
> > And the memory barrier is de-facto missing as the head pointer and
> > descriptor are accessed through (two separate) coherent mappings so
> > there are no ordering guarantees without explicit barriers.
> 
> This situation should occur when there is only one descriptor in the 
> ring. If, as you mentioned, the CPU tries to load the descriptor first, 
> but the descriptor fetch fails before the HP load because the ring 
> returns empty, it won't trigger the current issue.

It could if the CPU observes the updates out of order due to the missing
barrier. The driver could be processing an earlier interrupt when the
new descriptor is added and head pointer updated. If for example the CPU
speculatively fetches the descriptor before the head pointer is updated,
then the descriptor length may be zero when the CPU sees the updated
head pointer.

This seems to be what is happening on the X13s since adding the memory
barrier makes the zero-length descriptors go away.
 
> The Copy Engine hardware module copies the metadata to the Status 
> Descriptor after the DMA is complete, then updates the HP to trigger an 
> interrupt. I think there might be some issues in this process, such as 
> the lack of a wmb instruction after the copy is complete, causing the HP 
> to be updated first.

Yeah, possibly. At least it seems there are more issues than the missing
barrier on the machines you test.
 
> > Now obviously there are further issues in your system, which we should
> > make sure we understand before adding workarounds to the driver.
> > 
> > Do you have a pointer to the downstream kernel sources you are testing
> > with? Or even better, can you reproduce the issue with mainline after
> > adding the PCIe patches that were posted to the lists for these
> > platforms?
> > 
> https://github.com/qualcomm-linux/meta-qcom-hwe/blob/scarthgap/recipes-kernel/linux/linux-qcom-base_6.6.bb

Thanks for the pointer. That's a lot of out-of-tree patches on top of
stable so not that easy to check the state of the resulting tree.

> > Does it make any difference if you use a full rmb() barrier?
> > 
> I've also tried rmb() and mb(), but they didn't work either.

Thanks for checking.

Just to be sure, you did add the barrier in the same place as my patch
(i.e. just before the descriptor read)?

> > And after modifying ath11k_hal_ce_dst_status_get_length() so that it
> > does not clear the length, how many times you need to retry? Does it
> > always work on the second try?
> 
> Yes, the test has been running continuously for over 48 hours, always 
> work on the second try, updated in patch v4.

Good, at least the descriptor-length-sometimes-never-updated issue is
solved.

Johan

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ