linux-kernel - Re: [PATCH v3 00/11] Mitigate a vmap lock contention v3

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ZdjAZQRVmP9gnfsJ@MiWiFi-R3L-srv>
Date: Fri, 23 Feb 2024 23:57:25 +0800
From: Baoquan He <bhe@...hat.com>
To: Uladzislau Rezki <urezki@...il.com>
Cc: Pedro Falcato <pedro.falcato@...il.com>,
	Matthew Wilcox <willy@...radead.org>, Mel Gorman <mgorman@...e.de>,
	kirill.shutemov@...ux.intel.com,
	Vishal Moola <vishal.moola@...il.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	LKML <linux-kernel@...r.kernel.org>,
	Lorenzo Stoakes <lstoakes@...il.com>,
	Christoph Hellwig <hch@...radead.org>,
	"Liam R . Howlett" <Liam.Howlett@...cle.com>,
	Dave Chinner <david@...morbit.com>,
	"Paul E . McKenney" <paulmck@...nel.org>,
	Joel Fernandes <joel@...lfernandes.org>,
	Oleksiy Avramchenko <oleksiy.avramchenko@...y.com>,
	linux-mm@...ck.org
Subject: Re: [PATCH v3 00/11] Mitigate a vmap lock contention v3

On 02/23/24 at 12:06pm, Uladzislau Rezki wrote:
> > On 02/23/24 at 10:34am, Uladzislau Rezki wrote:
> > > On Thu, Feb 22, 2024 at 11:15:59PM +0000, Pedro Falcato wrote:
> > > > Hi,
> > > > 
> > > > On Thu, Feb 22, 2024 at 8:35 AM Uladzislau Rezki <urezki@...il.com> wrote:
> > > > >
> > > > > Hello, Folk!
> > > > >
> > > > >[...]
> > > > > pagetable_alloc - gets increased as soon as a higher pressure is applied by
> > > > > increasing number of workers. Running same number of jobs on a next run
> > > > > does not increase it and stays on same level as on previous.
> > > > >
> > > > > /**
> > > > >  * pagetable_alloc - Allocate pagetables
> > > > >  * @gfp:    GFP flags
> > > > >  * @order:  desired pagetable order
> > > > >  *
> > > > >  * pagetable_alloc allocates memory for page tables as well as a page table
> > > > >  * descriptor to describe that memory.
> > > > >  *
> > > > >  * Return: The ptdesc describing the allocated page tables.
> > > > >  */
> > > > > static inline struct ptdesc *pagetable_alloc(gfp_t gfp, unsigned int order)
> > > > > {
> > > > >         struct page *page = alloc_pages(gfp | __GFP_COMP, order);
> > > > >
> > > > >         return page_ptdesc(page);
> > > > > }
> > > > >
> > > > > Could you please comment on it? Or do you have any thought? Is it expected?
> > > > > Is a page-table ever shrink?
> > > > 
> > > > It's my understanding that the vunmap_range helpers don't actively
> > > > free page tables, they just clear PTEs. munmap does free them in
> > > > mmap.c:free_pgtables, maybe something could be worked up for vmalloc
> > > > too.
> > > >
> > > Right. I see that for a user space, pgtables are removed. There was a
> > > work on it.
> > > 
> > > >
> > > > I would not be surprised if the memory increase you're seeing is more
> > > > or less correlated to the maximum vmalloc footprint throughout the
> > > > whole test.
> > > > 
> > > Yes, the vmalloc footprint follows the memory usage. Some uses cases
> > > map lot of memory.
> > 
> > The 'nr_threads=256' testing may be too radical. I took the test on
> > a bare metal machine as below, it's still running and hang there after
> > 30 minutes. I did this after system boot. I am looking for other
> > machines with more processors.
> > 
> > [root@...l-r640-068 ~]# nproc 
> > 64
> > [root@...l-r640-068 ~]# free -h
> >                total        used        free      shared  buff/cache   available
> > Mem:           187Gi        18Gi       169Gi        12Mi       262Mi       168Gi
> > Swap:          4.0Gi          0B       4.0Gi
> > [root@...l-r640-068 ~]# 
> > 
> > [root@...l-r640-068 linux]# tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=256
> > Run the test with following parameters: run_test_mask=127 nr_threads=256
> > 
> Agree, nr_threads=256 is a way radical :) Mine took 50 minutes to
> complete. So wait more :)

Right, mine could take the similar time to finish that. I got a machine
with 288 cpus, see if I can get some clues. When I go through the code
flow, suddenly realized it could be drain_vmap_area_work which is the 
bottle neck and cause the tremendous page table pages costing.

On your system, there's 64 cpus. then 

nr_lazy_max = lazy_max_pages() = 7*32M = 224M;

So with nr_threads=128 or 256, it's so easily getting to the nr_lazy_max
and triggering drain_vmap_work(). When cpu resouce is very limited, the
lazy vmap purging will be very slow. While the alloc/free in lib/tet_vmalloc.c 
are going far faster and more easily then vmap reclaiming. If old va is not
reused, new va is allocated and keep extending, the new page table surely
need be created to cover them.

I will take testing on the system with 288 cpus, will update if testing
is done.