lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <B88D3073-440A-41C7-95F4-895D3F657EF2@gmail.com>
Date:   Fri, 28 Oct 2022 16:57:32 -0700
From:   Nadav Amit <nadav.amit@...il.com>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     Jann Horn <jannh@...gle.com>, John Hubbard <jhubbard@...dia.com>,
        X86 ML <x86@...nel.org>, Matthew Wilcox <willy@...radead.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        kernel list <linux-kernel@...r.kernel.org>,
        Linux-MM <linux-mm@...ck.org>,
        Andrea Arcangeli <aarcange@...hat.com>,
        "Kirill A . Shutemov" <kirill.shutemov@...ux.intel.com>,
        jroedel@...e.de, ubizjak@...il.com,
        Linus Torvalds <torvalds@...ux-foundation.org>
Subject: Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment

On Oct 27, 2022, at 2:44 PM, Nadav Amit <nadav.amit@...il.com> wrote:

> Anyhow, admittedly, I need to give it more thought. For instance, in respect
> to the code that you mentioned (in zap_pte_range), after reading it again,
> it seems strange: how is ok to defer the TLB flush after the rmap was
> removed, even if it is done while the PTL is held.
> folio_clear_dirty_for_io() would not sync on the PTL afterwards, so the page
> might be later re-dirtied using a stale cached PTE. Oh well.

Peter,

So it appears to be a problem - flushing after removing the rmap. I attach a
PoC that shows the problem.

The problem is in the following code of zap_pte_range():

	if (!PageAnon(page)) {
		if (pte_dirty(ptent)) {
			force_flush = 1;
			set_page_dirty(page);
		}
                …
	}
	page_remove_rmap(page, vma, false);

Once we remove the rmap, rmap_walk() would not acquire the page-table lock
anymore. As a result, nothing prevents the kernel from performing writeback
and cleaning the page-struct dirty-bit before the TLB is actually flushed.
As a result, if there is an additional thread that has the dirty-PTE cached
in the TLB, it can keep writing to the page and nothing (PTE/page-struct)
will keep track that the page has been dirtied.

In other words, writes to the memory mapped file after
munmap()/MADV_DONTNEED started can be lost.

This affects both munmap() and MADV_DONTNEED. One might argue that if you
don’t keep writing after munmap()/MADV_DONTNEED it’s your fault
(questionable), but if that’s the case why do we bother with force_flush at
all?

If we want it to behave correctly - i.e., writes after munmap/MADV_DONTNEED
to propagate to the file - we need to collect dirty pages and remove their
rmap only after we flush the TLB (and before we release the ptl). mmu_gather
would probably need to be changed for this matter.

Thoughts?

---

Some details about the PoC (below):

We got 3 threads that use 2MB range:

1. Maintains a counter in the first 4KB of the 2MB range and checks it
   actually updates the memory.

2. Dirties pages in 2MB range (just to make the race more likely).

3. Either (i) maps the same mapping at the first 4KB (to test munmap
   indirectly); or (ii) runs MADV_DONTNEED on the first 4KB.

In addition, a child process runs msync and fdatasync to writeback the first
4KB.

The PoC should be run with a file on a block RAM device. It manages to
trigger the issue relatively reliably (within 60 seconds) with munmap() and
slightly less reliably with MADV_DONTNEED. I have no idea if it works in VM,
deep C-states, etc..

-- >8 --

#define _GNU_SOURCE
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <time.h>

#define handle_error(msg) \
   do { perror(msg); exit(EXIT_FAILURE); } while (0)

void *p;
volatile bool stop = false;
pid_t flusher_pid;
int fd;

#define PAGE_SIZE	(4096ul)
#define PAGES_PER_PMD	(512)
#define HPAGE_SIZE	(PAGE_SIZE * PAGES_PER_PMD)

// Comment MUNMAP_TEST for MADV_DONTNEED test
#define MUNMAP_TEST

void *dirtying_thread(void *arg)
{
	int i;

	while (!stop) {
		for (i = 1; i < PAGES_PER_PMD; i++) {
			*(volatile char *)(p + (i * PAGE_SIZE) + 64) = 5;
		}
	}
	return NULL;
}

void *checking_thread(void *arg)
{
	volatile unsigned long *ul_p = (volatile unsigned long*)p;
	unsigned long cnt = 0;

	while (!stop) {
		*ul_p = cnt;
		if (*ul_p != cnt) {
			printf("FAILED: expected %ld, got %ld\n", cnt, *ul_p);
			kill(flusher_pid, SIGTERM);
			exit(0);
		}
		cnt++;
	}
	return NULL;
}

void *remap_thread(void *arg)
{
	void *ptr;
	struct timespec t = {
		.tv_nsec = 10000,
	};

	while (!stop) {
#ifdef MUNMAP_TEST
		ptr = mmap(p, PAGE_SIZE, PROT_READ|PROT_WRITE,
			   MAP_SHARED|MAP_FIXED|MAP_POPULATE, fd, 0);
		if (ptr == MAP_FAILED)
			handle_error("remap_thread");
#else
		if (madvise(p, PAGE_SIZE, MADV_DONTNEED) < 0)
			handle_error("MADV_DONTNEED");
		nanosleep(&t, NULL);
#endif
	}
	return NULL;
}

void flushing_process(void)
{
	// Remove the pages to speed up rmap_walk and allow to drop caches.
	if (madvise(p, HPAGE_SIZE, MADV_DONTNEED) < 0)
		handle_error("MADV_DONTNEED");

	while (true) {
		if (msync(p, PAGE_SIZE, MS_SYNC))
			handle_error("msync");
		if (posix_fadvise(fd, 0, PAGE_SIZE, POSIX_FADV_DONTNEED))
			handle_error("posix_fadvise");
	}
}

int main(int argc, char *argv[])
{
	void *(*thread_funcs[])(void*) = {
		&dirtying_thread,
		&checking_thread,
		&remap_thread,
	};	
	int r, i;
	int rc1, rc2;
	unsigned long addr;
	void *ptr;
	char *page = malloc(PAGE_SIZE);
	int n_threads = sizeof(thread_funcs) / sizeof(*thread_funcs);
	pthread_t *threads = malloc(sizeof(pthread_t) * n_threads);
	pid_t pid;
	
	if (argc < 2) {
		fprintf(stderr, "usages: %s [filename]\n", argv[0]);
		exit(EXIT_FAILURE);
	}

	fd = open(argv[1], O_RDWR|O_CREAT, 0666);
	if (fd == -1)
		handle_error("open fd");

	for (i = 0; i < PAGES_PER_PMD; i++) {
		if (write(fd, page, PAGE_SIZE) != PAGE_SIZE)
			handle_error("write");
	}
	free(page);

	ptr = mmap(NULL, HPAGE_SIZE * 2, PROT_NONE, MAP_PRIVATE|MAP_ANON,
                   -1, 0);

	if (ptr == MAP_FAILED)
		handle_error("mmap anon");

	addr = (unsigned long)(ptr + HPAGE_SIZE - 1) & ~(HPAGE_SIZE - 1);
	printf("starting...\n");

	ptr = mmap((void *)addr, HPAGE_SIZE, PROT_READ|PROT_WRITE,
		   MAP_SHARED|MAP_FIXED|MAP_POPULATE, fd, 0);

	if (ptr == MAP_FAILED)
		handle_error("mmap file - start");
	
	p = ptr;

	for (i = 0; i < n_threads; i++) {
		r = pthread_create(&threads[i], NULL, thread_funcs[i], NULL);
		if (r)
			handle_error("pthread_create");
	}

	// Run the flushing process in a different process, so msync() would
	// not require mmap_lock.
	pid = fork();
	if (pid == 0)
		flushing_process();
	flusher_pid = pid;

	sleep(60);

	stop = true;
	for (i = 0; i < n_threads; i++)
		pthread_join(threads[i], NULL);
	kill(flusher_pid, SIGTERM);
	printf("Finished without an error");

	exit(0);
}

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ