[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <914dfa88-d36c-44c2-a7d6-22f6fbd2b86f@oracle.com>
Date: Wed, 23 Apr 2025 18:49:30 -0700
From: jane.chu@...cle.com
To: Jason Gunthorpe <jgg@...pe.ca>
Cc: logane@...tatee.com, hch@....de, gregkh@...uxfoundation.org,
willy@...radead.org, kch@...dia.com, axboe@...nel.dk,
linux-kernel@...r.kernel.org, linux-mm@...ck.org,
linux-pci@...r.kernel.org, linux-nvme@...ts.infradead.org,
linux-block@...r.kernel.org
Subject: Re: Report: Performance regression from ib_umem_get on zone device
pages
On 4/23/2025 4:28 PM, Jason Gunthorpe wrote:
> On Wed, Apr 23, 2025 at 12:21:15PM -0700, jane.chu@...cle.com wrote:
>
>> So this looks like a case of CPU cache thrashing, but I don't know to fix
>> it. Could someone help address the issue? I'd be happy to help verifying.
>
> I don't know that we can even really fix it if that is the cause.. But
> it seems suspect, if you are only doing 2M at a time per CPU core then
> that is only 512 struct pages or 32k of data. The GUP process will
> have touched all of that if device-dax is not creating folios. So why
> did it fall out of the cache?
>
> If it is creating folios then maybe we can improve things by
> recovering the folios before adding the pages.
>
> Or is something weird going on like the device-dax is using 1G folios
> and all of these pins and checks are sharing and bouncing the same
> struct page cache lines?
I used ndctl to create 12 device-dax instances in 2M alignment by
default, and mmap the device-dax memory in 2M alignment and 2M-multiple
size, that should lead to the default 2MB hugepage mapping.
>
> Can the device-dax implement memfd_pin_folios()?
Could you elaborate? or perhaps Dan Williams could comment?
>
>> The flow of a single test run:
>> 1. reserve virtual address space for (61440 * 2MB) via mmap with PROT_NONE
>> and MAP_ANONYMOUS | MAP_NORESERVE| MAP_PRIVATE
>> 2. mmap ((61440 * 2MB) / 12) from each of the 12 device-dax to the
>> reserved virtual address space sequentially to form a continual VA
>> space
>
> Like is there any chance that each of these 61440 VMA's is a single
> 2MB folio from device-dax, or could it be?
That's 61440 mrs of 2MB each, they came from 12 device-dax.
The test process mmap them into its pre-reserved VMA, so the entire VMA
range is 61440 * 2M = 122880MB, or about 31million 4K-pages.
When it comes to mr registration via ibv_reg_mr(), there'll be about
31million of ->pgmap dereferences from "a->pgmap == b->pgmap", give the
small L1 Dcache, that is how I see the cache thrashing happening.
>
> IIRC device-dax does could not use folios until 6.15 so I'm assuming
> it is not folios even if it is a pmd mapping?
Probably not, there are very little change to device-dax, but Dan can
correct me.
In theory, the problem could be observed by using any kind of zone
device pages for the mrs, have you seen anything like this?
thanks,
-jane
>
> Jason
>
Powered by blists - more mailing lists