linux-kernel - Re: [PATCH] mm: vmscan: fix not scanning anonymous pages when detecting file refaults

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20190628065138.GA251482@google.com>
Date:   Fri, 28 Jun 2019 15:51:38 +0900
From:   Minchan Kim <minchan@...nel.org>
To:     Johannes Weiner <hannes@...xchg.org>
Cc:     Kuo-Hsin Yang <vovoy@...omium.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Michal Hocko <mhocko@...e.com>,
        Sonny Rao <sonnyrao@...omium.org>,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [PATCH] mm: vmscan: fix not scanning anonymous pages when
 detecting file refaults

Hi Johannes,

On Thu, Jun 27, 2019 at 02:41:23PM -0400, Johannes Weiner wrote:
> On Wed, Jun 19, 2019 at 04:08:35PM +0800, Kuo-Hsin Yang wrote:
> > When file refaults are detected and there are many inactive file pages,
> > the system never reclaim anonymous pages, the file pages are dropped
> > aggressively when there are still a lot of cold anonymous pages and
> > system thrashes. This issue impacts the performance of applications with
> > large executable, e.g. chrome.
> > 
> > When file refaults are detected. inactive_list_is_low() may return
> > different values depends on the actual_reclaim parameter, the following
> > 2 conditions could be satisfied at the same time.
> > 
> > 1) inactive_list_is_low() returns false in get_scan_count() to trigger
> >    scanning file lists only.
> > 2) inactive_list_is_low() returns true in shrink_list() to allow
> >    scanning active file list.
> > 
> > In that case vmscan would only scan file lists, and as active file list
> > is also scanned, inactive_list_is_low() may keep returning false in
> > get_scan_count() until file cache is very low.
> > 
> > Before commit 2a2e48854d70 ("mm: vmscan: fix IO/refault regression in
> > cache workingset transition"), inactive_list_is_low() never returns
> > different value in get_scan_count() and shrink_list() in one
> > shrink_node_memcg() run. The original design should be that when
> > inactive_list_is_low() returns false for file lists, vmscan only scan
> > inactive file list. As only inactive file list is scanned,
> > inactive_list_is_low() would soon return true.
> > 
> > This patch makes the return value of inactive_list_is_low() independent
> > of actual_reclaim.
> > 
> > The problem can be reproduced by the following test program.
> > 
> > ---8<---
> > void fallocate_file(const char *filename, off_t size)
> > {
> > 	struct stat st;
> > 	int fd;
> > 
> > 	if (!stat(filename, &st) && st.st_size >= size)
> > 		return;
> > 
> > 	fd = open(filename, O_WRONLY | O_CREAT, 0600);
> > 	if (fd < 0) {
> > 		perror("create file");
> > 		exit(1);
> > 	}
> > 	if (posix_fallocate(fd, 0, size)) {
> > 		perror("fallocate");
> > 		exit(1);
> > 	}
> > 	close(fd);
> > }
> > 
> > long *alloc_anon(long size)
> > {
> > 	long *start = malloc(size);
> > 	memset(start, 1, size);
> > 	return start;
> > }
> > 
> > long access_file(const char *filename, long size, long rounds)
> > {
> > 	int fd, i;
> > 	volatile char *start1, *end1, *start2;
> > 	const int page_size = getpagesize();
> > 	long sum = 0;
> > 
> > 	fd = open(filename, O_RDONLY);
> > 	if (fd == -1) {
> > 		perror("open");
> > 		exit(1);
> > 	}
> > 
> > 	/*
> > 	 * Some applications, e.g. chrome, use a lot of executable file
> > 	 * pages, map some of the pages with PROT_EXEC flag to simulate
> > 	 * the behavior.
> > 	 */
> > 	start1 = mmap(NULL, size / 2, PROT_READ | PROT_EXEC, MAP_SHARED,
> > 		      fd, 0);
> > 	if (start1 == MAP_FAILED) {
> > 		perror("mmap");
> > 		exit(1);
> > 	}
> > 	end1 = start1 + size / 2;
> > 
> > 	start2 = mmap(NULL, size / 2, PROT_READ, MAP_SHARED, fd, size / 2);
> > 	if (start2 == MAP_FAILED) {
> > 		perror("mmap");
> > 		exit(1);
> > 	}
> > 
> > 	for (i = 0; i < rounds; ++i) {
> > 		struct timeval before, after;
> > 		volatile char *ptr1 = start1, *ptr2 = start2;
> > 		gettimeofday(&before, NULL);
> > 		for (; ptr1 < end1; ptr1 += page_size, ptr2 += page_size)
> > 			sum += *ptr1 + *ptr2;
> > 		gettimeofday(&after, NULL);
> > 		printf("File access time, round %d: %f (sec)\n", i,
> > 		       (after.tv_sec - before.tv_sec) +
> > 		       (after.tv_usec - before.tv_usec) / 1000000.0);
> > 	}
> > 	return sum;
> > }
> > 
> > int main(int argc, char *argv[])
> > {
> > 	const long MB = 1024 * 1024;
> > 	long anon_mb, file_mb, file_rounds;
> > 	const char filename[] = "large";
> > 	long *ret1;
> > 	long ret2;
> > 
> > 	if (argc != 4) {
> > 		printf("usage: thrash ANON_MB FILE_MB FILE_ROUNDS\n");
> > 		exit(0);
> > 	}
> > 	anon_mb = atoi(argv[1]);
> > 	file_mb = atoi(argv[2]);
> > 	file_rounds = atoi(argv[3]);
> > 
> > 	fallocate_file(filename, file_mb * MB);
> > 	printf("Allocate %ld MB anonymous pages\n", anon_mb);
> > 	ret1 = alloc_anon(anon_mb * MB);
> > 	printf("Access %ld MB file pages\n", file_mb);
> > 	ret2 = access_file(filename, file_mb * MB, file_rounds);
> > 	printf("Print result to prevent optimization: %ld\n",
> > 	       *ret1 + ret2);
> > 	return 0;
> > }
> > ---8<---
> > 
> > Running the test program on 2GB RAM VM with kernel 5.2.0-rc5, the
> > program fills ram with 2048 MB memory, access a 200 MB file for 10
> > times. Without this patch, the file cache is dropped aggresively and
> > every access to the file is from disk.
> > 
> >   $ ./thrash 2048 200 10
> >   Allocate 2048 MB anonymous pages
> >   Access 200 MB file pages
> >   File access time, round 0: 2.489316 (sec)
> >   File access time, round 1: 2.581277 (sec)
> >   File access time, round 2: 2.487624 (sec)
> >   File access time, round 3: 2.449100 (sec)
> >   File access time, round 4: 2.420423 (sec)
> >   File access time, round 5: 2.343411 (sec)
> >   File access time, round 6: 2.454833 (sec)
> >   File access time, round 7: 2.483398 (sec)
> >   File access time, round 8: 2.572701 (sec)
> >   File access time, round 9: 2.493014 (sec)
> > 
> > With this patch, these file pages can be cached.
> > 
> >   $ ./thrash 2048 200 10
> >   Allocate 2048 MB anonymous pages
> >   Access 200 MB file pages
> >   File access time, round 0: 2.475189 (sec)
> >   File access time, round 1: 2.440777 (sec)
> >   File access time, round 2: 2.411671 (sec)
> >   File access time, round 3: 1.955267 (sec)
> >   File access time, round 4: 0.029924 (sec)
> >   File access time, round 5: 0.000808 (sec)
> >   File access time, round 6: 0.000771 (sec)
> >   File access time, round 7: 0.000746 (sec)
> >   File access time, round 8: 0.000738 (sec)
> >   File access time, round 9: 0.000747 (sec)
> > 
> > Fixes: 2a2e48854d70 ("mm: vmscan: fix IO/refault regression in cache workingset transition")
> > Signed-off-by: Kuo-Hsin Yang <vovoy@...omium.org>
> 
> Acked-by: Johannes Weiner <hannes@...xchg.org>
> 
> Your change makes sense - we should indeed not force cache trimming
> only while the page cache is experiencing refaults.
> 
> I can't say I fully understand the changelog, though. The problem of

I guess the point of the patch is "actual_reclaim" paramter made divergency
to balance file vs. anon LRU in get_scan_count. Thus, it ends up scanning
file LRU active/inactive list at file thrashing state.

So, Fixes: 2a2e48854d70 ("mm: vmscan: fix IO/refault regression in cache workingset transition")
would make sense to me since it introduces the parameter.

> forcing cache trimming while there is enough page cache is older than
> the commit you refer to. It could be argued that this commit is
> incomplete - it could have added refault detection not just to
> inactive:active file balancing, but also the file:anon balancing; but
> it didn't *cause* this problem.
> 
> Shouldn't this be
> 
> Fixes: e9868505987a ("mm,vmscan: only evict file pages when we have plenty")
> Fixes: 7c5bd705d8f9 ("mm: memcg: only evict file pages when we have plenty")

That would affect, too but it would be trouble to have stable backport
since we don't have refault machinery in there.