linux-kernel - Re: [RFC PATCH 3/3] mm: increase scalability of global memory commitment accounting

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Thu, 11 Feb 2016 10:20:44 -0800
From:	Tim Chen <tim.c.chen@...ux.intel.com>
To:	Andrey Ryabinin <aryabinin@...tuozzo.com>
Cc:	Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org, Andi Kleen <ak@...ux.intel.com>,
	Mel Gorman <mgorman@...hsingularity.net>,
	Vladimir Davydov <vdavydov@...tuozzo.com>,
	Konstantin Khlebnikov <koct9i@...il.com>,
	Dave Hansen <dave@...1.net>
Subject: Re: [RFC PATCH 3/3] mm: increase scalability of global memory
 commitment accounting

On Thu, 2016-02-11 at 16:54 +0300, Andrey Ryabinin wrote:
> 
> On 02/11/2016 03:24 AM, Tim Chen wrote:
> > On Wed, 2016-02-10 at 13:28 -0800, Andrew Morton wrote:
> > 
> >>
> >> If a process is unmapping 4MB then it's pretty crazy for us to be
> >> hitting the percpu_counter 32 separate times for that single operation.
> >>
> >> Is there some way in which we can batch up the modifications within the
> >> caller and update the counter less frequently?  Perhaps even in a
> >> single hit?
> > 
> > I think the problem is the batch size is too small and we overflow
> > the local counter into the global counter for 4M allocations.
> > The reason for the small batch size was because we use
> > percpu_counter_read_positive in __vm_enough_memory and it is not precise
> > and the error could grow with large batch size.
> > 
> > Let's switch to the precise __percpu_counter_compare that is 
> > unaffected by batch size.  It will do precise comparison and only add up
> > the local per cpu counters when the global count is not precise
> > enough.  
> > 
> 
> I'm not certain about this. for_each_online_cpu() under spinlock somewhat doubtful.
> And if we are close to limit we will be hitting slowpath all the time.
> 

Yes, it is a trade-off between faster allocation for the general case vs
being on slowpath when we are within 3% of the memory limit. I'm
thinking when we are that close to the memory limit, it probably 
takes more time to do page reclaim and this slow path might be a
secondary effect.  But still it will be better than the original
proposal that strictly uses per cpu variables as we will then 
need to sum the variables up all the time.

The brk1 test is also somewhat pathologic.  It
does nothing but brk which is unlikely for real workload.
So we have to be careful when we are tuning our system
behavior for brk1 throughput. We'll need to make sure
whatever changes we made don't impact other more useful
workloads adversely.

Tim