linux-kernel - Re: Excessive page cache occupies DMA32 memory

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <91fc0c41-6d25-4f60-9de3-23d440fc8e00@collabora.com>
Date: Tue, 22 Jul 2025 11:05:11 +0500
From: Muhammad Usama Anjum <usama.anjum@...labora.com>
To: Greg KH <gregkh@...uxfoundation.org>, Matthew Wilcox
 <willy@...radead.org>, Baochen Qiang <baochen.qiang@....qualcomm.com>,
 Jeff Hugo <jeff.hugo@....qualcomm.com>,
 Manivannan Sadhasivam <mani@...nel.org>, Jeff Johnson <jjohnson@...nel.org>,
 Marek Szyprowski <m.szyprowski@...sung.com>
Cc: linux-fsdevel@...r.kernel.org, linux-mm@...ck.org, kernel@...labora.com,
 Andrew Morton <akpm@...ux-foundation.org>, linux-kernel@...r.kernel.org,
 iommu@...ts.linux.dev, Robin Murphy <robin.murphy@....com>
Subject: Re: Excessive page cache occupies DMA32 memory

Adding ath/mhi and dma API developers to the discussion.

On 7/22/25 10:32 AM, Greg KH wrote:
> On Mon, Jul 21, 2025 at 06:13:10PM +0100, Matthew Wilcox wrote:
>> On Mon, Jul 21, 2025 at 08:03:12PM +0500, Muhammad Usama Anjum wrote:
>>> Hello,
>>>
>>> When 10-12GB our of total 16GB RAM is being used as page cache
>>> (active_file + inactive_file) at suspend time, the drivers fail to allocate
>>> dma memory at resume as dma memory is either occupied by the page cache or
>>> fragmented. Example:
>>>
>>> kworker/u33:5: page allocation failure: order:7, mode:0xc04(GFP_NOIO|GFP_DMA32), nodemask=(null),cpuset=/,mems_allowed=0
>>
>> Just to be clear, this is not a page cache problem.  The driver is asking
>> us to do a 512kB allocation without doing I/O!  This is a ridiculous
>> request that should be expected to fail.
>>
>> The solution, whatever it may be, is not related to the page cache.
>> I reject your diagnosis.  Almost all of the page cache is clean and
>> could be dropped (as far as I can tell from the output below).
>>
>> Now, I'm not too familiar with how the page allocator chooses to fail
>> this request.  Maybe it should be trying harder to drop bits of the page
>> cache.  Maybe it should be doing some compaction. 
That's very thoughtful. I'll look at the page allocator why isn't it dropping
cache or doing compaction.

>> I am not inclined to
>> go digging on your behalf, because frankly I'm offended by the suggestion
>> that the page cache is at fault.
I apologize—that wasn't my intention.

>>
>> Perhaps somebody else will help you, or you can dig into this yourself.
> 
> I'm with Matthew, this really looks like a driver bug somehow.  If there
> is page cache memory that is "clean", the driver should be able to
> access it just fine if really required.
> 
> What exact driver(s) is having this problem?  What is the exact error,
> and on what lines of code?
The issue occurs on both ath11k and mhi drivers during resume, when
dma_alloc_coherent(GFP_KERNEL) fails and returns -ENOMEM. This failure has
been observed at multiple points in these drivers.

For example, in the mhi driver, the failure is triggered when the
MHI's st_worker gets scheduled-in at resume.

mhi_pm_st_worker()
-> mhi_fw_load_handler()
   -> mhi_load_image_bhi()
      -> mhi_alloc_bhi_buffer()
         -> dma_alloc_coherent(GFP_KERNEL) returns -ENOMEM


Thank you,
- Usama