linux-kernel - Re: [RESEND RFC] zram: Allow rw_page when page isn't written back.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CADyq12yo+iREA_4FP1K-b-PekZQdLUDBT_Yd1v2vFyEbtq9ZOA@mail.gmail.com>
Date:   Fri, 30 Sep 2022 15:33:11 -0400
From:   Brian Geffon <bgeffon@...gle.com>
To:     Minchan Kim <minchan@...nel.org>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Nitin Gupta <ngupta@...are.org>,
        Sergey Senozhatsky <senozhatsky@...omium.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Suleiman Souhlal <suleiman@...gle.com>,
        Rom Lemarchand <romlem@...gle.com>,
        linux-mm <linux-mm@...ck.org>
Subject: Re: [RESEND RFC] zram: Allow rw_page when page isn't written back.

Hi Minchan,
Thank you for taking a look.

On Fri, Sep 23, 2022 at 3:31 PM Minchan Kim <minchan@...nel.org> wrote:
>
> On Thu, Sep 08, 2022 at 08:50:37AM -0400, Brian Geffon wrote:
> > Today when a zram device has a backing device we change the ops to
> > a new set which does not expose a rw_page method. This prevents the
> > upper layers from trying to issue a synchronous rw. This has the
> > downside that we penalize every rw even when it could possibly
>
> Do you mean addiontal bio alloc/free?
> Please specify something more detail.

Yes, the additional overhead is primarily this bio_alloc/free.

>
> > still be performed as a synchronous rw. By the very nature of
>
> Even though zram go though the block layer in the case, it's still
> synchronous operation against on in-memory compressed data. Only
> asynchrnous IO happens for the data in backing device.
>
> > zram all writes are synchronous so it's unfortunate to have to
> > accept this limitation.
> >
> > This change will always expose a rw_page function and if the page
> > has been written back it will return -EOPNOTSUPP which will force the
> > upper layers to try again with bio.
>
> Sounds a good idea.
>
> >
> > To safely allow a synchronous read to proceed for pages which have not
> > yet written back we introduce a new flag ZRAM_NO_WB. On the first
> > synchronous read if the page is not written back we will set the
> > ZRAM_NO_WB flag. This flag, which is never cleared, prevents writeback
> > from ever happening to that page.
>
> Why do we need a addtional flag?
> Why couldn't we do?
>
> 1. expose the rw_page all the time.
> 2. If the page was written back, just return an error in rw_page to make
> upper layer retry it with bio.

Yes this approach is much simpler, I'll send a new patch.

>
> >
> > This approach works because in the case of zram as a swap backing device
> > the page is going to be removed from zram shortly thereafter so
> > preventing writeback is fine. However, if zram is being used as a
> > generic block device then this might prevent writeback of the page.
> >
> > This proposal is still very much RFC, feedback would be appreciated.
> >
> > Signed-off-by: Brian Geffon <bgeffon@...gle.com>
> > ---
> >  drivers/block/zram/zram_drv.c | 68 +++++++++++++++++++++--------------
> >  drivers/block/zram/zram_drv.h |  1 +
> >  2 files changed, 43 insertions(+), 26 deletions(-)
> >
> > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> > index 92cb929a45b7..22b69e8b6042 100644
> > --- a/drivers/block/zram/zram_drv.c
> > +++ b/drivers/block/zram/zram_drv.c
> > @@ -52,9 +52,6 @@ static unsigned int num_devices = 1;
> >  static size_t huge_class_size;
> >
> >  static const struct block_device_operations zram_devops;
> > -#ifdef CONFIG_ZRAM_WRITEBACK
> > -static const struct block_device_operations zram_wb_devops;
> > -#endif
> >
> >  static void zram_free_page(struct zram *zram, size_t index);
> >  static int zram_bvec_read(struct zram *zram, struct bio_vec *bvec,
> > @@ -309,7 +306,8 @@ static void mark_idle(struct zram *zram, ktime_t cutoff)
> >                */
> >               zram_slot_lock(zram, index);
> >               if (zram_allocated(zram, index) &&
> > -                             !zram_test_flag(zram, index, ZRAM_UNDER_WB)) {
> > +                             !zram_test_flag(zram, index, ZRAM_UNDER_WB) &&
> > +                             !zram_test_flag(zram, index, ZRAM_NO_WB)) {
> >  #ifdef CONFIG_ZRAM_MEMORY_TRACKING
> >                       is_idle = !cutoff || ktime_after(cutoff, zram->table[index].ac_time);
> >  #endif
> > @@ -439,7 +437,6 @@ static void reset_bdev(struct zram *zram)
> >       filp_close(zram->backing_dev, NULL);
> >       zram->backing_dev = NULL;
> >       zram->bdev = NULL;
> > -     zram->disk->fops = &zram_devops;
> >       kvfree(zram->bitmap);
> >       zram->bitmap = NULL;
> >  }
> > @@ -543,17 +540,6 @@ static ssize_t backing_dev_store(struct device *dev,
> >       zram->backing_dev = backing_dev;
> >       zram->bitmap = bitmap;
> >       zram->nr_pages = nr_pages;
> > -     /*
> > -      * With writeback feature, zram does asynchronous IO so it's no longer
> > -      * synchronous device so let's remove synchronous io flag. Othewise,
> > -      * upper layer(e.g., swap) could wait IO completion rather than
> > -      * (submit and return), which will cause system sluggish.
> > -      * Furthermore, when the IO function returns(e.g., swap_readpage),
> > -      * upper layer expects IO was done so it could deallocate the page
> > -      * freely but in fact, IO is going on so finally could cause
> > -      * use-after-free when the IO is really done.
> > -      */
> > -     zram->disk->fops = &zram_wb_devops;
> >       up_write(&zram->init_lock);
> >
> >       pr_info("setup backing device %s\n", file_name);
> > @@ -722,7 +708,8 @@ static ssize_t writeback_store(struct device *dev,
> >
> >               if (zram_test_flag(zram, index, ZRAM_WB) ||
> >                               zram_test_flag(zram, index, ZRAM_SAME) ||
> > -                             zram_test_flag(zram, index, ZRAM_UNDER_WB))
> > +                             zram_test_flag(zram, index, ZRAM_UNDER_WB) ||
> > +                             zram_test_flag(zram, index, ZRAM_NO_WB))
> >                       goto next;
> >
> >               if (mode & IDLE_WRITEBACK &&
> > @@ -1226,6 +1213,10 @@ static void zram_free_page(struct zram *zram, size_t index)
> >               goto out;
> >       }
> >
> > +     if (zram_test_flag(zram, index, ZRAM_NO_WB)) {
> > +             zram_clear_flag(zram, index, ZRAM_NO_WB);
> > +     }
> > +
> >       /*
> >        * No memory is allocated for same element filled pages.
> >        * Simply clear same page flag.
> > @@ -1654,6 +1645,40 @@ static int zram_rw_page(struct block_device *bdev, sector_t sector,
> >       index = sector >> SECTORS_PER_PAGE_SHIFT;
> >       offset = (sector & (SECTORS_PER_PAGE - 1)) << SECTOR_SHIFT;
> >
> > +#ifdef CONFIG_ZRAM_WRITEBACK
> > +     /*
> > +      * With writeback feature, zram does asynchronous IO so it's no longer
> > +      * synchronous device so let's remove synchronous io flag. Othewise,
> > +      * upper layer(e.g., swap) could wait IO completion rather than
> > +      * (submit and return), which will cause system sluggish.
> > +      * Furthermore, when the IO function returns(e.g., swap_readpage),
> > +      * upper layer expects IO was done so it could deallocate the page
> > +      * freely but in fact, IO is going on so finally could cause
> > +      * use-after-free when the IO is really done.
> > +      *
> > +      * If the page is not currently written back then we may proceed to
> > +      * read the page synchronously, otherwise, we must fail with
> > +      * -EOPNOTSUPP to force the upper layers to use a normal bio.
> > +      */
> > +     zram_slot_lock(zram, index);
> > +     if (zram_test_flag(zram, index, ZRAM_WB) ||
> > +                     zram_test_flag(zram, index, ZRAM_UNDER_WB)) {
> > +             zram_slot_unlock(zram, index);
> > +             /* We cannot proceed with synchronous read */
> > +             return -EOPNOTSUPP;
> > +     }
> > +
> > +     /*
> > +      * Don't allow the page to be written back while we read it,
> > +      * this flag is never cleared. It shouldn't be a problem that
> > +      * we don't clear this flag because in the case of swap this
> > +      * page will be removed shortly after this read anyway.
> > +      */
> > +     if (op == REQ_OP_READ)
> > +             zram_set_flag(zram, index, ZRAM_NO_WB);
> > +     zram_slot_unlock(zram, index);
> > +#endif
> > +
> >       bv.bv_page = page;
> >       bv.bv_len = PAGE_SIZE;
> >       bv.bv_offset = 0;
> > @@ -1827,15 +1852,6 @@ static const struct block_device_operations zram_devops = {
> >       .owner = THIS_MODULE
> >  };
> >
> > -#ifdef CONFIG_ZRAM_WRITEBACK
> > -static const struct block_device_operations zram_wb_devops = {
> > -     .open = zram_open,
> > -     .submit_bio = zram_submit_bio,
> > -     .swap_slot_free_notify = zram_slot_free_notify,
> > -     .owner = THIS_MODULE
> > -};
> > -#endif
> > -
> >  static DEVICE_ATTR_WO(compact);
> >  static DEVICE_ATTR_RW(disksize);
> >  static DEVICE_ATTR_RO(initstate);
> > diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
> > index 158c91e54850..20e4c6a579e0 100644
> > --- a/drivers/block/zram/zram_drv.h
> > +++ b/drivers/block/zram/zram_drv.h
> > @@ -50,6 +50,7 @@ enum zram_pageflags {
> >       ZRAM_UNDER_WB,  /* page is under writeback */
> >       ZRAM_HUGE,      /* Incompressible page */
> >       ZRAM_IDLE,      /* not accessed page since last idle marking */
> > +     ZRAM_NO_WB,     /* Do not allow page to be written back */
> >
> >       __NR_ZRAM_PAGEFLAGS,
> >  };
> > --
> > 2.37.2.789.g6183377224-goog
> >
> >