linux-ext4 - Re: [PATCH v8 16/19] fsnotify: generate pre-content permission event on page fault

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5d0cd660-251c-423a-8828-5b836a5130f9@gmail.com>
Date: Sun, 8 Dec 2024 17:58:42 +0100
From: Klara Modin <klarasmodin@...il.com>
To: Josef Bacik <josef@...icpanda.com>, kernel-team@...com,
 linux-fsdevel@...r.kernel.org, jack@...e.cz, amir73il@...il.com,
 brauner@...nel.org, torvalds@...ux-foundation.org, viro@...iv.linux.org.uk,
 linux-xfs@...r.kernel.org, linux-btrfs@...r.kernel.org, linux-mm@...ck.org,
 linux-ext4@...r.kernel.org
Subject: Re: [PATCH v8 16/19] fsnotify: generate pre-content permission event
 on page fault

Hi,

On 2024-11-15 16:30, Josef Bacik wrote:
> FS_PRE_ACCESS or FS_PRE_MODIFY will be generated on page fault depending
> on the faulting method.
> 
> This pre-content event is meant to be used by hierarchical storage
> managers that want to fill in the file content on first read access.
> 
> Export a simple helper that file systems that have their own ->fault()
> will use, and have a more complicated helper to be do fancy things with
> in filemap_fault.
> 

This patch (0790303ec869d0fd658a548551972b51ced7390c in next-20241206) 
interacts poorly with some programs which hang and are stuck at 100 % 
sys cpu usage (examples of programs are logrotate and atop with root 
privileges).

I also retested the new version on Jan Kara's for_next branch and it 
behaves the same way.

> Signed-off-by: Josef Bacik <josef@...icpanda.com>
> ---
>  include/linux/mm.h |  1 +
>  mm/filemap.c       | 78 ++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 79 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 01c5e7a4489f..90155ef8599a 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3406,6 +3406,7 @@ extern vm_fault_t filemap_fault(struct vm_fault *vmf);
>  extern vm_fault_t filemap_map_pages(struct vm_fault *vmf,
>  		pgoff_t start_pgoff, pgoff_t end_pgoff);
>  extern vm_fault_t filemap_page_mkwrite(struct vm_fault *vmf);
> +extern vm_fault_t filemap_fsnotify_fault(struct vm_fault *vmf);
>  
>  extern unsigned long stack_guard_gap;
>  /* Generic expand stack which grows the stack according to GROWS{UP,DOWN} */
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 68ea596f6905..0bf7d645dec5 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -47,6 +47,7 @@
>  #include <linux/splice.h>
>  #include <linux/rcupdate_wait.h>
>  #include <linux/sched/mm.h>
> +#include <linux/fsnotify.h>
>  #include <asm/pgalloc.h>
>  #include <asm/tlbflush.h>
>  #include "internal.h"
> @@ -3289,6 +3290,52 @@ static vm_fault_t filemap_fault_recheck_pte_none(struct vm_fault *vmf)
>  	return ret;
>  }
>  
> +/**
> + * filemap_fsnotify_fault - maybe emit a pre-content event.
> + * @vmf:	struct vm_fault containing details of the fault.
> + * @folio:	the folio we're faulting in.
> + *
> + * If we have a pre-content watch on this file we will emit an event for this
> + * range.  If we return anything the fault caller should return immediately, we
> + * will return VM_FAULT_RETRY if we had to emit an event, which will trigger the
> + * fault again and then the fault handler will run the second time through.
> + *
> + * This is meant to be called with the folio that we will be filling in to make
> + * sure the event is emitted for the correct range.
> + *
> + * Return: a bitwise-OR of %VM_FAULT_ codes, 0 if nothing happened.
> + */
> +vm_fault_t filemap_fsnotify_fault(struct vm_fault *vmf)

The parameters mentioned above do not seem to match with the function.

> +{
> +	struct file *fpin = NULL;
> +	int mask = (vmf->flags & FAULT_FLAG_WRITE) ? MAY_WRITE : MAY_ACCESS;
> +	loff_t pos = vmf->pgoff >> PAGE_SHIFT;
> +	size_t count = PAGE_SIZE;
> +	vm_fault_t ret;
> +
> +	/*
> +	 * We already did this and now we're retrying with everything locked,
> +	 * don't emit the event and continue.
> +	 */
> +	if (vmf->flags & FAULT_FLAG_TRIED)
> +		return 0;
> +
> +	/* No watches, we're done. */
> +	if (!fsnotify_file_has_pre_content_watches(vmf->vma->vm_file))
> +		return 0;
> +
> +	fpin = maybe_unlock_mmap_for_io(vmf, fpin);
> +	if (!fpin)
> +		return VM_FAULT_SIGBUS;
> +
> +	ret = fsnotify_file_area_perm(fpin, mask, &pos, count);
> +	fput(fpin);
> +	if (ret)
> +		return VM_FAULT_SIGBUS;
> +	return VM_FAULT_RETRY;
> +}
> +EXPORT_SYMBOL_GPL(filemap_fsnotify_fault);
> +
>  /**
>   * filemap_fault - read in file data for page fault handling
>   * @vmf:	struct vm_fault containing details of the fault



> @@ -3392,6 +3439,37 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
>  	 * or because readahead was otherwise unable to retrieve it.
>  	 */
>  	if (unlikely(!folio_test_uptodate(folio))) {
> +		/*
> +		 * If this is a precontent file we have can now emit an event to
> +		 * try and populate the folio.
> +		 */
> +		if (!(vmf->flags & FAULT_FLAG_TRIED) &&
> +		    fsnotify_file_has_pre_content_watches(file)) {
> +			loff_t pos = folio_pos(folio);
> +			size_t count = folio_size(folio);
> +
> +			/* We're NOWAIT, we have to retry. */
> +			if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT) {
> +				folio_unlock(folio);
> +				goto out_retry;
> +			}
> +
> +			if (mapping_locked)
> +				filemap_invalidate_unlock_shared(mapping);
> +			mapping_locked = false;
> +
> +			folio_unlock(folio);
> +			fpin = maybe_unlock_mmap_for_io(vmf, fpin);

When I look at it with GDB it seems to get here, but then always jumps 
to out_retry, which keeps happening when it reenters, and never seems to 
progress beyond from what I could tell.

For logrotate, strace stops at "mmap(NULL, 909, PROT_READ, 
MAP_PRIVATE|MAP_POPULATE, 3, 0".
For atop, strace stops at "mlockall(MCL_CURRENT|MCL_FUTURE".

If I remove this entire patch snippet everything seems to be normal.

> +			if (!fpin)
> +				goto out_retry;
> +
> +			error = fsnotify_file_area_perm(fpin, MAY_ACCESS, &pos,
> +							count);
> +			if (error)
> +				ret = VM_FAULT_SIGBUS;
> +			goto out_retry;
> +		}
> +
>  		/*
>  		 * If the invalidate lock is not held, the folio was in cache
>  		 * and uptodate and now it is not. Strange but possible since we

Please let me know if there's anything else you need.

Regards,
Klara Modin
Download attachment "config.gz" of type "application/gzip" (44308 bytes)

Download attachment "gdb-atop-bt.log.gz" of type "application/gzip" (792 bytes)

Download attachment "gdb-logrotate-bt.log.gz" of type "application/gzip" (848 bytes)

View attachment "hang-bisect" of type "text/plain" (2547 bytes)

Download attachment "strace-atop.log.gz" of type "application/gzip" (1846 bytes)

Download attachment "strace-logrotate.log.gz" of type "application/gzip" (1202 bytes)