lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z9gd9Aw5Bug8IKSV@hovoldconsulting.com>
Date: Mon, 17 Mar 2025 14:04:52 +0100
From: Johan Hovold <johan@...nel.org>
To: Miaoqing Pan <quic_miaoqing@...cinc.com>
Cc: Jeff Johnson <jeff.johnson@....qualcomm.com>,
	ath11k@...ts.infradead.org, linux-wireless@...r.kernel.org,
	linux-kernel@...r.kernel.org, johan+linaro@...nel.org
Subject: Re: [PATCH v2 ath-next 2/2] wifi: ath11k: fix HTC rx insufficient
 length

On Mon, Mar 17, 2025 at 01:52:15PM +0800, Miaoqing Pan wrote:
> On 3/14/2025 4:06 PM, Johan Hovold wrote:
> > On Fri, Mar 14, 2025 at 09:01:36AM +0800, Miaoqing Pan wrote:
 
> >> I think the hardware has already ensured synchronization between
> >> descriptor and head pointer, which isn't difficult to achieve. The issue
> >> is likely caused by something else and requires further debugging.
> > 
> > Yeah, but you still need memory barriers on the kernel side.
> > 
> > It could be that we are looking at two different causes for those
> > zero-length descriptors.
> > 
> > The error handling for that obviously needs to be fixed either way, but
> > I haven't heard anyone hitting the corruption with the memory barriers
> > in place on the X13s yet (even if we'd need some more time to test
> > this).

> After multiple and prolonged verifications, adding dma_rmb() did not 
> improve the issue at all. I think this Status Descriptor is updated by 
> hardware (Copy Engine) controlled by another system, not involving DMA 
> or out-of-order CPU access within the same system, so memory barriers do 
> not take effect.

Then it seems we are looking at two separate root causes for the
corruption as the memory barrier appears to be all that is needed to fix
the X13s issue.

A user who hit the corruption after 2 h without the fix has been running
over the weekend with the memory barrier without any problems. I'll ask
further users to test, but it certainly looks like it is working as
intended.

And the memory barrier is de-facto missing as the head pointer and
descriptor are accessed through (two separate) coherent mappings so
there are no ordering guarantees without explicit barriers.

Now obviously there are further issues in your system, which we should
make sure we understand before adding workarounds to the driver.

Do you have a pointer to the downstream kernel sources you are testing
with? Or even better, can you reproduce the issue with mainline after
adding the PCIe patches that were posted to the lists for these
platforms?

Apparently the descriptors can also be passed in non-coherent memory
for some devices (.alloc_cacheable_memory / HAL_SRNG_FLAGS_CACHED). That
implementation looks suspicious and could possibly result in similar
problems. Are you using .alloc_cacheable_memory in your setup?

Does it make any difference if you use a full rmb() barrier?

And after modifying ath11k_hal_ce_dst_status_get_length() so that it
does not clear the length, how many times you need to retry? Does it
always work on the second try?

Johan

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ