linux-kernel - Re: [PATCH RFC 1/4] mm: throttle MADV

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150225000809.GA6468@blaptop>
Date:	Wed, 25 Feb 2015 09:08:09 +0900
From:	Minchan Kim <minchan@...nel.org>
To:	Michal Hocko <mhocko@...e.cz>
Cc:	Andrew Morton <akpm@...ux-foundation.org>,
	linux-kernel@...r.kernel.org, linux-mm@...ck.org,
	Rik van Riel <riel@...hat.com>,
	Johannes Weiner <hannes@...xchg.org>,
	Mel Gorman <mgorman@...e.de>, Shaohua Li <shli@...nel.org>,
	Yalin.Wang@...ymobile.com
Subject: Re: [PATCH RFC 1/4] mm: throttle MADV_FREE

Hi Michal,

On Tue, Feb 24, 2015 at 04:43:18PM +0100, Michal Hocko wrote:
> On Tue 24-02-15 17:18:14, Minchan Kim wrote:
> > Recently, Shaohua reported that MADV_FREE is much slower than
> > MADV_DONTNEED in his MADV_FREE bomb test. The reason is many of
> > applications went to stall with direct reclaim since kswapd's
> > reclaim speed isn't fast than applications's allocation speed
> > so that it causes lots of stall and lock contention.
> 
> I am not sure I understand this correctly. So the issue is that there is
> huge number of MADV_FREE on the LRU and they are not close to the tail
> of the list so the reclaim has to do a lot of work before it starts
> dropping them?

No, Shaohua already tested deactivating of hinted pages to head/tail
of inactive anon LRU and he said it didn't solve his problem.
I thought main culprit was scanning/rotating/throttling in
direct reclaim path.

> 
> > This patch throttles MADV_FREEing so it works only if there
> > are enough pages in the system which will not trigger backgroud/
> > direct reclaim. Otherwise, MADV_FREE falls back to MADV_DONTNEED
> > because there is no point to delay freeing if we know system
> > is under memory pressure.
> 
> Hmm, this is still conforming to the documentation because the kernel is
> free to free pages at its convenience. I am not sure this is a good
> idea, though. Why some MADV_FREE calls should be treated differently?

It's hint for VM to free pages so I think it's okay to free them instantly
sometime if it can save more important thing like system stall.
IOW, madvise is just hint, not a strict rule.

> Wouldn't that lead to hard to predict behavior? E.g. LIFO reused blocks
> would work without long stalls most of the time - except when there is a
> memory pressure.

True.

> 
> Comparison to MADV_DONTNEED is not very fair IMHO because the scope of the
> two calls is different.

I agree it's not a apple to apple comparison.

Acutally, MADV_FREE moves the cost from hot path(ie, system call path)
to slow path(ie, reclaim context) so it would be slower if there are
much memory pressure continuously due to a lot overhead of freeing pages
in reclaim context. So, it would be good if kernel detects it nicely
and prevent the situation. This patch aims for that.

> 
> > When I test the patch on my 3G machine + 12 CPU + 8G swap,
> > test: 12 processes
> > 
> > loop = 5;
> > mmap(512M);
> 
> Who is eating the rest of the memory?

As I wrote down,  there are 12 processes with below test.
IOW, 512M * 12 = 6G but system RAM is just 3G.

> 
> > while (loop--) {
> > 	memset(512M);
> > 	madvise(MADV_FREE or MADV_DONTNEED);
> > }
> > 
> > 1) dontneed: 6.78user 234.09system 0:48.89elapsed
> > 2) madvfree: 6.03user 401.17system 1:30.67elapsed
> > 3) madvfree + this ptach: 5.68user 113.42system 0:36.52elapsed
> > 
> > It's clearly win.
> > 
> > Reported-by: Shaohua Li <shli@...nel.org>
> > Signed-off-by: Minchan Kim <minchan@...nel.org>
> 
> I don't know. This looks like a hack with hard to predict consequences
> which might trigger pathological corner cases.

Yeb, it might be. That's why I tagged RFC so hope other guys suggest
better idea.

> 
> > ---
> >  mm/madvise.c | 13 +++++++++++--
> >  1 file changed, 11 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 6d0fcb8921c2..81bb26ecf064 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -523,8 +523,17 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
> >  		 * XXX: In this implementation, MADV_FREE works like
> >  		 * MADV_DONTNEED on swapless system or full swap.
> >  		 */
> > -		if (get_nr_swap_pages() > 0)
> > -			return madvise_free(vma, prev, start, end);
> > +		if (get_nr_swap_pages() > 0) {
> > +			unsigned long threshold;
> > +			/*
> > +			 * If we have trobule with memory pressure(ie,
> > +			 * under high watermark), free pages instantly.
> > +			 */
> > +			threshold = min_free_kbytes >> (PAGE_SHIFT - 10);
> > +			threshold = threshold + (threshold >> 1);
> 
> Why threshold += threshold >> 1 ?

I wanted to trigger this logic if we have free pages under high watermark.

> 
> > +			if (nr_free_pages() > threshold)
> > +				return madvise_free(vma, prev, start, end);
> > +		}
> >  		/* passthrough */
> >  	case MADV_DONTNEED:
> >  		return madvise_dontneed(vma, prev, start, end);
> > -- 
> > 1.9.1
> > 
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@...ck.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@...ck.org"> email@...ck.org </a>
> 
> -- 
> Michal Hocko
> SUSE Labs

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/