linux-kernel - Re: [PATCH 3/4] mm: shmem: don't truncate page if memory failure happens

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHbLzkrxFrG9ncaFMVZhnXut0VmON0MP1bM=4DqFgwqXGRtoJg@mail.gmail.com>
Date:   Tue, 21 Sep 2021 12:34:44 -0700
From:   Yang Shi <shy828301@...il.com>
To:     Naoya Horiguchi <naoya.horiguchi@...ux.dev>
Cc:     HORIGUCHI NAOYA(堀口 直也) 
        <naoya.horiguchi@....com>, Hugh Dickins <hughd@...gle.com>,
        "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
        Matthew Wilcox <willy@...radead.org>,
        Oscar Salvador <osalvador@...e.de>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Linux MM <linux-mm@...ck.org>,
        Linux FS-devel Mailing List <linux-fsdevel@...r.kernel.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 3/4] mm: shmem: don't truncate page if memory failure happens

On Tue, Sep 21, 2021 at 2:49 AM Naoya Horiguchi
<naoya.horiguchi@...ux.dev> wrote:
>
> On Tue, Sep 14, 2021 at 11:37:17AM -0700, Yang Shi wrote:
> > The current behavior of memory failure is to truncate the page cache
> > regardless of dirty or clean.  If the page is dirty the later access
> > will get the obsolete data from disk without any notification to the
> > users.  This may cause silent data loss.  It is even worse for shmem
> > since shmem is in-memory filesystem, truncating page cache means
> > discarding data blocks.  The later read would return all zero.
> >
> > The right approach is to keep the corrupted page in page cache, any
> > later access would return error for syscalls or SIGBUS for page fault,
> > until the file is truncated, hole punched or removed.  The regular
> > storage backed filesystems would be more complicated so this patch
> > is focused on shmem.  This also unblock the support for soft
> > offlining shmem THP.
> >
> > Signed-off-by: Yang Shi <shy828301@...il.com>
> > ---
> >  mm/memory-failure.c |  3 ++-
> >  mm/shmem.c          | 25 +++++++++++++++++++++++--
> >  2 files changed, 25 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > index 54879c339024..3e06cb9d5121 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -1101,7 +1101,8 @@ static int page_action(struct page_state *ps, struct page *p,
> >       result = ps->action(p, pfn);
> >
> >       count = page_count(p) - 1;
> > -     if (ps->action == me_swapcache_dirty && result == MF_DELAYED)
> > +     if ((ps->action == me_swapcache_dirty && result == MF_DELAYED) ||
> > +         (ps->action == me_pagecache_dirty && result == MF_FAILED))
>
> This new line seems to affect the cases of dirty page cache
> on other filesystems, whose result is to miss "still referenced"
> messages for some unmap failure cases (although it's not so critical).
> So checking filesystem type (for example with shmem_mapping())
> might be helpful?
>
> And I think that if we might want to have some refactoring to pass
> *ps to each ps->action() callback, then move this refcount check to
> the needed places.
> I don't think that we always need the refcount check, for example in
> MF_MSG_KERNEL and MF_MSG_UNKNOWN cases (because no one knows the
> expected values for these cases).

Yeah, seems make sense to me. How's about doing the below (totally untested):

static inline bool check_refcount(struct *page, bool dec)
{
    int count = page_count(page) - 1;

    if (dec || shmem_mapping(page->mapping))
        count -= 1;

    if (count > 0) {
         pr_err("Memory failure: %#lx: %s still referenced by %d users\n",
                       pfn, action_page_types[ps->type], count);
         return false;
    }

    return true;
}

Then call this in the needed me_* functions and return right value per
the return value of it. I think me_swapcache_dirty() is the only place
need pass in true for dec parameter.

>
>
> >               count--;
> >       if (count > 0) {
> >               pr_err("Memory failure: %#lx: %s still referenced by %d users\n",
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index 88742953532c..ec33f4f7173d 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -2456,6 +2456,7 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
> >       struct inode *inode = mapping->host;
> >       struct shmem_inode_info *info = SHMEM_I(inode);
> >       pgoff_t index = pos >> PAGE_SHIFT;
> > +     int ret = 0;
> >
> >       /* i_rwsem is held by caller */
> >       if (unlikely(info->seals & (F_SEAL_GROW |
> > @@ -2466,7 +2467,19 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
> >                       return -EPERM;
> >       }
> >
> > -     return shmem_getpage(inode, index, pagep, SGP_WRITE);
> > +     ret = shmem_getpage(inode, index, pagep, SGP_WRITE);
> > +
> > +     if (!ret) {
>
> Maybe this "!ret" check is not necessary because *pagep is set
> non-NULL only when ret is 0.  It could save one indent level.

Yes, sure.

>
> > +             if (*pagep) {
> > +                     if (PageHWPoison(*pagep)) {
> > +                             unlock_page(*pagep);
> > +                             put_page(*pagep);
> > +                             ret = -EIO;
> > +                     }
> > +             }
> > +     }
> > +
> > +     return ret;
> >  }
> >
> >  static int
> > @@ -2555,6 +2568,11 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> >                       unlock_page(page);
> >               }
> >
> > +             if (page && PageHWPoison(page)) {
> > +                     error = -EIO;
> > +                     break;
> > +             }
> > +
> >               /*
> >                * We must evaluate after, since reads (unlike writes)
> >                * are called without i_rwsem protection against truncate
> > @@ -3782,7 +3800,6 @@ const struct address_space_operations shmem_aops = {
> >  #ifdef CONFIG_MIGRATION
> >       .migratepage    = migrate_page,
> >  #endif
> > -     .error_remove_page = generic_error_remove_page,
>
> This change makes truncate_error_page() calls invalidate_inode_page(),
> and in my testing it fails with "Failed to invalidate" message.
> So as a result memory_failure() finally returns with -EBUSY. I'm not
> sure it's expected because this patchset changes to keep error pages
> in page cache as a proper error handling.
> Maybe you can avoid this by defining .error_remove_page in shmem_aops
> which simply returns 0.

Yes, the "Failed to invalidate" message seems confusing. I agree a
shmem specific callback is better.

>
> >  };
> >  EXPORT_SYMBOL(shmem_aops);
> >
> > @@ -4193,6 +4210,10 @@ struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
> >               page = ERR_PTR(error);
> >       else
> >               unlock_page(page);
> > +
> > +     if (PageHWPoison(page))
> > +             page = NULL;
> > +
> >       return page;
>
> One more comment ...
>
>   - I guess that you add PageHWPoison() checks after some call sites
>     of shmem_getpage() and shmem_getpage_gfp(), but seems not cover all.
>     For example, mcontinue_atomic_pte() in mm/userfaultfd.c can properly
>     handle PageHWPoison?

No, I didn't touch anything outside shmem.c. I could add this in the
next version.

BTW, I just found another problem for the change in
shmem_read_mapping_page_gfp(), it should return ERR_PTR(-EIO) instead
of NULL since the callers may not handle NULL. Will fix in the next
version too.

>
> I'm trying to test more detail, but in my current understanding,
> this patch looks promising to me.  Thank you for your effort.

Thank a lot for taking time do the review.

>
> Thanks,
> Naoya Horiguchi