linux-ext4 - Re: [PATCH v8 16/19] fsnotify: generate pre-content permission event on page fault

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e9f65f75-7f0c-423f-9fd4-b29dd006852b@amd.com>
Date: Mon, 9 Dec 2024 16:15:32 +0530
From: "Aithal, Srikanth" <sraithal@....com>
To: Klara Modin <klarasmodin@...il.com>, Josef Bacik <josef@...icpanda.com>,
 kernel-team@...com, linux-fsdevel@...r.kernel.org, jack@...e.cz,
 amir73il@...il.com, brauner@...nel.org, torvalds@...ux-foundation.org,
 viro@...iv.linux.org.uk, linux-xfs@...r.kernel.org,
 linux-btrfs@...r.kernel.org, linux-mm@...ck.org, linux-ext4@...r.kernel.org,
 Linux-Next Mailing List <linux-next@...r.kernel.org>
Subject: Re: [PATCH v8 16/19] fsnotify: generate pre-content permission event
 on page fault

On 12/8/2024 10:28 PM, Klara Modin wrote:
> Hi,
> 
> On 2024-11-15 16:30, Josef Bacik wrote:
>> FS_PRE_ACCESS or FS_PRE_MODIFY will be generated on page fault depending
>> on the faulting method.
>>
>> This pre-content event is meant to be used by hierarchical storage
>> managers that want to fill in the file content on first read access.
>>
>> Export a simple helper that file systems that have their own ->fault()
>> will use, and have a more complicated helper to be do fancy things with
>> in filemap_fault.
>>
> 
> This patch (0790303ec869d0fd658a548551972b51ced7390c in next-20241206) 
> interacts poorly with some programs which hang and are stuck at 100 % 
> sys cpu usage (examples of programs are logrotate and atop with root 
> privileges).
> 
> I also retested the new version on Jan Kara's for_next branch and it 
> behaves the same way.

 From linux-next20241206 onward we started hitting issues where KVM 
guests running kernel > next20241206 on AMD platforms fails to shutdown, 
hangs forever with below errors:

[  OK  ] Reached target Late Shutdown Services.
[  OK  ] Finished System Power Off.
[  OK  ] Reached target System Power Off.
[  128.946271] systemd-journald[93]: Failed to send WATCHDOG=1 
notification message: Connection refused
[  198.945362] systemd-journald[93]: Failed to send WATCHDOG=1 
notification message: Transport endpoint is not connected
[  298.945402] systemd-journald[93]: Failed to send WATCHDOG=1 
notification message: Transport endpoint is not connected
[  378.945345] systemd-journald[93]: Failed to send WATCHDOG=1 
notification message: Transport endpoint is not connected
[  488.945402] systemd-journald[93]: Failed to send WATCHDOG=1 
notification message: Transport endpoint is not connected
[  558.945904] systemd-journald[93]: Failed to send WATCHDOG=1 
notification message: Transport endpoint is not connected
[  632.945409] systemd-journald[93]: Failed to send WATCHDOG=1 
notification message: Transport endpoint is not connected
[  738.945403] systemd-journald[93]: Failed to send WATCHDOG=1 
notification message: Transport endpoint is not connected
[  848.945342] systemd-journald[93]: Failed to send WATCHDOG=1 
notification message: Transport endpoint is not connected
..
..

Bisecting the issue pointed to this patch.

commit 0790303ec869d0fd658a548551972b51ced7390c
Author: Josef Bacik <josef@...icpanda.com>
Date: Fri Nov 15 10:30:29 2024 -0500

fsnotify: generate pre-content permission event on page fault

Same issue exists with todays linux-next build as well.

Adding below configs in the guest_config fixes the shutdown hang issue:

CONFIG_FANOTIFY=y
CONFIG_FANOTIFY_ACCESS_PERMISSIONS=y

Regards,
Srikanth Aithal
> 
>> Signed-off-by: Josef Bacik <josef@...icpanda.com>
>> ---
>>  include/linux/mm.h |  1 +
>>  mm/filemap.c       | 78 ++++++++++++++++++++++++++++++++++++++++++++++
>>  2 files changed, 79 insertions(+)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 01c5e7a4489f..90155ef8599a 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -3406,6 +3406,7 @@ extern vm_fault_t filemap_fault(struct vm_fault 
>> *vmf);
>>  extern vm_fault_t filemap_map_pages(struct vm_fault *vmf,
>>          pgoff_t start_pgoff, pgoff_t end_pgoff);
>>  extern vm_fault_t filemap_page_mkwrite(struct vm_fault *vmf);
>> +extern vm_fault_t filemap_fsnotify_fault(struct vm_fault *vmf);
>>
>>  extern unsigned long stack_guard_gap;
>>  /* Generic expand stack which grows the stack according to 
>> GROWS{UP,DOWN} */
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index 68ea596f6905..0bf7d645dec5 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -47,6 +47,7 @@
>>  #include <linux/splice.h>
>>  #include <linux/rcupdate_wait.h>
>>  #include <linux/sched/mm.h>
>> +#include <linux/fsnotify.h>
>>  #include <asm/pgalloc.h>
>>  #include <asm/tlbflush.h>
>>  #include "internal.h"
>> @@ -3289,6 +3290,52 @@ static vm_fault_t 
>> filemap_fault_recheck_pte_none(struct vm_fault *vmf)
>>      return ret;
>>  }
>>
>> +/**
>> + * filemap_fsnotify_fault - maybe emit a pre-content event.
>> + * @vmf:    struct vm_fault containing details of the fault.
>> + * @folio:    the folio we're faulting in.
>> + *
>> + * If we have a pre-content watch on this file we will emit an event 
>> for this
>> + * range.  If we return anything the fault caller should return 
>> immediately, we
>> + * will return VM_FAULT_RETRY if we had to emit an event, which will 
>> trigger the
>> + * fault again and then the fault handler will run the second time 
>> through.
>> + *
>> + * This is meant to be called with the folio that we will be filling 
>> in to make
>> + * sure the event is emitted for the correct range.
>> + *
>> + * Return: a bitwise-OR of %VM_FAULT_ codes, 0 if nothing happened.
>> + */
>> +vm_fault_t filemap_fsnotify_fault(struct vm_fault *vmf)
> 
> The parameters mentioned above do not seem to match with the function.
> 
>> +{
>> +    struct file *fpin = NULL;
>> +    int mask = (vmf->flags & FAULT_FLAG_WRITE) ? MAY_WRITE : MAY_ACCESS;
>> +    loff_t pos = vmf->pgoff >> PAGE_SHIFT;
>> +    size_t count = PAGE_SIZE;
>> +    vm_fault_t ret;
>> +
>> +    /*
>> +     * We already did this and now we're retrying with everything 
>> locked,
>> +     * don't emit the event and continue.
>> +     */
>> +    if (vmf->flags & FAULT_FLAG_TRIED)
>> +        return 0;
>> +
>> +    /* No watches, we're done. */
>> +    if (!fsnotify_file_has_pre_content_watches(vmf->vma->vm_file))
>> +        return 0;
>> +
>> +    fpin = maybe_unlock_mmap_for_io(vmf, fpin);
>> +    if (!fpin)
>> +        return VM_FAULT_SIGBUS;
>> +
>> +    ret = fsnotify_file_area_perm(fpin, mask, &pos, count);
>> +    fput(fpin);
>> +    if (ret)
>> +        return VM_FAULT_SIGBUS;
>> +    return VM_FAULT_RETRY;
>> +}
>> +EXPORT_SYMBOL_GPL(filemap_fsnotify_fault);
>> +
>>  /**
>>   * filemap_fault - read in file data for page fault handling
>>   * @vmf:    struct vm_fault containing details of the fault
> 
> 
> 
>> @@ -3392,6 +3439,37 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
>>       * or because readahead was otherwise unable to retrieve it.
>>       */
>>      if (unlikely(!folio_test_uptodate(folio))) {
>> +        /*
>> +         * If this is a precontent file we have can now emit an event to
>> +         * try and populate the folio.
>> +         */
>> +        if (!(vmf->flags & FAULT_FLAG_TRIED) &&
>> +            fsnotify_file_has_pre_content_watches(file)) {
>> +            loff_t pos = folio_pos(folio);
>> +            size_t count = folio_size(folio);
>> +
>> +            /* We're NOWAIT, we have to retry. */
>> +            if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT) {
>> +                folio_unlock(folio);
>> +                goto out_retry;
>> +            }
>> +
>> +            if (mapping_locked)
>> +                filemap_invalidate_unlock_shared(mapping);
>> +            mapping_locked = false;
>> +
>> +            folio_unlock(folio);
>> +            fpin = maybe_unlock_mmap_for_io(vmf, fpin);
> 
> When I look at it with GDB it seems to get here, but then always jumps 
> to out_retry, which keeps happening when it reenters, and never seems to 
> progress beyond from what I could tell.
> 
> For logrotate, strace stops at "mmap(NULL, 909, PROT_READ, MAP_PRIVATE| 
> MAP_POPULATE, 3, 0".
> For atop, strace stops at "mlockall(MCL_CURRENT|MCL_FUTURE".
> 
> If I remove this entire patch snippet everything seems to be normal.
> 
>> +            if (!fpin)
>> +                goto out_retry;
>> +
>> +            error = fsnotify_file_area_perm(fpin, MAY_ACCESS, &pos,
>> +                            count);
>> +            if (error)
>> +                ret = VM_FAULT_SIGBUS;
>> +            goto out_retry;
>> +        }
>> +
>>          /*
>>           * If the invalidate lock is not held, the folio was in cache
>>           * and uptodate and now it is not. Strange but possible since we
> 
> Please let me know if there's anything else you need.
> 
> Regards,
> Klara Modin