lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20180528100248.991011729@linuxfoundation.org>
Date:   Mon, 28 May 2018 12:00:03 +0200
From:   Greg Kroah-Hartman <gregkh@...uxfoundation.org>
To:     linux-kernel@...r.kernel.org
Cc:     Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        stable@...r.kernel.org, "shidao.ytt" <shidao.ytt@...baba-inc.com>,
        Caspar Zhang <jinli.zjl@...baba-inc.com>,
        Oliver Yang <zhiche.yy@...baba-inc.com>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Sasha Levin <alexander.levin@...rosoft.com>
Subject: [PATCH 4.9 074/329] mm/fadvise: discard partial page if endbyte is also EOF

4.9-stable review patch.  If anyone has any objections, please let me know.

------------------

From: "shidao.ytt" <shidao.ytt@...baba-inc.com>

[ Upstream commit a7ab400d6fe73d0119fdc234e9982a6f80faea9f ]

During our recent testing with fadvise(FADV_DONTNEED), we find that if
given offset/length is not page-aligned, the last page will not be
discarded.  The tool we use is vmtouch (https://hoytech.com/vmtouch/),
we map a 10KB-sized file into memory and then try to run this tool to
evict the whole file mapping, but the last single page always remains
staying in the memory:

$./vmtouch -e test_10K
           Files: 1
     Directories: 0
   Evicted Pages: 3 (12K)
         Elapsed: 2.1e-05 seconds

$./vmtouch test_10K
           Files: 1
     Directories: 0
  Resident Pages: 1/3  4K/12K  33.3%
         Elapsed: 5.5e-05 seconds

However when we test with an older kernel, say 3.10, this problem is
gone.  So we wonder if this is a regression:

$./vmtouch -e test_10K
           Files: 1
     Directories: 0
   Evicted Pages: 3 (12K)
         Elapsed: 8.2e-05 seconds

$./vmtouch test_10K
           Files: 1
     Directories: 0
  Resident Pages: 0/3  0/12K  0%  <-- partial page also discarded
         Elapsed: 5e-05 seconds

After digging a little bit into this problem, we find it seems not a
regression.  Not discarding partial page is likely to be on purpose
according to commit 441c228f817f ("mm: fadvise: document the
fadvise(FADV_DONTNEED) behaviour for partial pages") written by Mel
Gorman.  He explained why partial pages should be preserved instead of
being discarded when using fadvise(FADV_DONTNEED).

However, the interesting part is that the actual code did NOT work as
the same as it was described, the partial page was still discarded
anyway, due to a calculation mistake of `end_index' passed to
invalidate_mapping_pages().  This mistake has not been fixed until
recently, that's why we fail to reproduce our problem in old kernels.
The fix is done in commit 18aba41cbf ("mm/fadvise.c: do not discard
partial pages with POSIX_FADV_DONTNEED") by Oleg Drokin.

Back to the original testing, our problem becomes that there is a
special case that, if the page-unaligned `endbyte' is also the end of
file, it is not necessary at all to preserve the last partial page, as
we all know no one else will use the rest of it.  It should be safe
enough if we just discard the whole page.  So we add an EOF check in
this patch.

We also find a poosbile real world issue in mainline kernel.  Assume
such scenario: A userspace backup application want to backup a huge
amount of small files (<4k) at once, the developer might (I guess) want
to use fadvise(FADV_DONTNEED) to save memory.  However, FADV_DONTNEED
won't really happen since the only page mapped is a partial page, and
kernel will preserve it.  Our patch also fixes this problem, since we
know the endbyte is EOF, so we discard it.

Here is a simple reproducer to reproduce and verify each scenario we
described above:

  test_fadvise.c
  ==============================
  #include <sys/mman.h>
  #include <sys/stat.h>
  #include <fcntl.h>
  #include <stdlib.h>
  #include <string.h>
  #include <stdio.h>
  #include <unistd.h>

  int main(int argc, char **argv)
  {
  	int i, fd, ret, len;
  	struct stat buf;
  	void *addr;
  	unsigned char *vec;
  	char *strbuf;
  	ssize_t pagesize = getpagesize();
  	ssize_t filesize;

  	fd = open(argv[1], O_RDWR|O_CREAT, S_IRUSR|S_IWUSR);
  	if (fd < 0)
  		return -1;
  	filesize = strtoul(argv[2], NULL, 10);

  	strbuf = malloc(filesize);
  	memset(strbuf, 42, filesize);
  	write(fd, strbuf, filesize);
  	free(strbuf);
  	fsync(fd);

  	len = (filesize + pagesize - 1) / pagesize;
  	printf("length of pages: %d\n", len);

  	addr = mmap(NULL, filesize, PROT_READ, MAP_SHARED, fd, 0);
  	if (addr == MAP_FAILED)
  		return -1;

  	ret = posix_fadvise(fd, 0, filesize, POSIX_FADV_DONTNEED);
  	if (ret < 0)
  		return -1;

  	vec = malloc(len);
  	ret = mincore(addr, filesize, (void *)vec);
  	if (ret < 0)
  		return -1;

  	for (i = 0; i < len; i++)
  		printf("pages[%d]: %x\n", i, vec[i] & 0x1);

  	free(vec);
  	close(fd);

  	return 0;
  }
  ==============================

Test 1: running on kernel with commit 18aba41cbf reverted:

  [root@...par ~]# uname -r
  4.15.0-rc6.revert+
  [root@...par ~]# ./test_fadvise file1 1024
  length of pages: 1
  pages[0]: 0    # <-- partial page discarded
  [root@...par ~]# ./test_fadvise file2 8192
  length of pages: 2
  pages[0]: 0
  pages[1]: 0
  [root@...par ~]# ./test_fadvise file3 10240
  length of pages: 3
  pages[0]: 0
  pages[1]: 0
  pages[2]: 0    # <-- partial page discarded

Test 2: running on mainline kernel:

  [root@...par ~]# uname -r
  4.15.0-rc6+
  [root@...par ~]# ./test_fadvise test1 1024
  length of pages: 1
  pages[0]: 1    # <-- partial and the only page not discarded
  [root@...par ~]# ./test_fadvise test2 8192
  length of pages: 2
  pages[0]: 0
  pages[1]: 0
  [root@...par ~]# ./test_fadvise test3 10240
  length of pages: 3
  pages[0]: 0
  pages[1]: 0
  pages[2]: 1    # <-- partial page not discarded

Test 3: running on kernel with this patch:

  [root@...par ~]# uname -r
  4.15.0-rc6.patched+
  [root@...par ~]# ./test_fadvise test1 1024
  length of pages: 1
  pages[0]: 0    # <-- partial page and EOF, discarded
  [root@...par ~]# ./test_fadvise test2 8192
  length of pages: 2
  pages[0]: 0
  pages[1]: 0
  [root@...par ~]# ./test_fadvise test3 10240
  length of pages: 3
  pages[0]: 0
  pages[1]: 0
  pages[2]: 0    # <-- partial page and EOF, discarded

[akpm@...ux-foundation.org: tweak code comment]
Link: http://lkml.kernel.org/r/5222da9ee20e1695eaabb69f631f200d6e6b8876.1515132470.git.jinli.zjl@alibaba-inc.com
Signed-off-by: shidao.ytt <shidao.ytt@...baba-inc.com>
Signed-off-by: Caspar Zhang <jinli.zjl@...baba-inc.com>
Reviewed-by: Oliver Yang <zhiche.yy@...baba-inc.com>
Cc: Mel Gorman <mgorman@...hsingularity.net>
Signed-off-by: Andrew Morton <akpm@...ux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@...ux-foundation.org>
Signed-off-by: Sasha Levin <alexander.levin@...rosoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@...uxfoundation.org>
---
 mm/fadvise.c |   10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -126,7 +126,15 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, l
 		 */
 		start_index = (offset+(PAGE_SIZE-1)) >> PAGE_SHIFT;
 		end_index = (endbyte >> PAGE_SHIFT);
-		if ((endbyte & ~PAGE_MASK) != ~PAGE_MASK) {
+		/*
+		 * The page at end_index will be inclusively discarded according
+		 * by invalidate_mapping_pages(), so subtracting 1 from
+		 * end_index means we will skip the last page.  But if endbyte
+		 * is page aligned or is at the end of file, we should not skip
+		 * that page - discarding the last page is safe enough.
+		 */
+		if ((endbyte & ~PAGE_MASK) != ~PAGE_MASK &&
+				endbyte != inode->i_size - 1) {
 			/* First page is tricky as 0 - 1 = -1, but pgoff_t
 			 * is unsigned, so the end_index >= start_index
 			 * check below would be true and we'll discard the whole


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ