[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20190325121628.zxlogz52go6k36on@wfg-t540p.sh.intel.com>
Date: Mon, 25 Mar 2019 20:16:28 +0800
From: Fengguang Wu <fengguang.wu@...el.com>
To: Martin Liu <liumartin@...gle.com>
Cc: akpm@...ux-foundation.org, axboe@...nel.dk, dchinner@...hat.com,
jenhaochen@...gle.com, salyzyn@...gle.com, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, linux-block@...r.kernel.org
Subject: Re: [RFC PATCH] mm: readahead: add readahead_shift into backing
device
Martin,
On Fri, Mar 22, 2019 at 11:46:11PM +0800, Martin Liu wrote:
>As the discussion https://lore.kernel.org/patchwork/patch/334982/
>We know an open file's ra_pages might run out of sync from
>bdi.ra_pages since sequential, random or error read. Current design
>is we have to ask users to reopen the file or use fdavise system
>call to get it sync. However, we might have some cases to change
>system wide file ra_pages to enhance system performance such as
>enhance the boot time by increasing the ra_pages or decrease it to
Do you have examples that some distro making use of larger ra_pages
for boot time optimization?
There are fadvise() based readahead/preload tools that can be more
efficient for such purpose.
>mitigate the trashing under memory pressure. Those are the cases
Suppose N read streams with equal read speed. The thrash-free memory
requirement would be (N * 2 * ra_pages).
If N=1000 and ra_pages=1MB, it'd require 2GB memory. Which looks
affordable in mainstream servers.
Also we have try_context_readahead() that's good at estimating the
thrashing free readahead size -- as long as the system load and
read speed remains stable, and node/zone reclaims are balanced.
For normal systems w/o huge number slow read streams, e.g. desktop,
reducing ra_pages may save a little memory by reducing readahead
misses, especially on mmap() read-around. Though at the risk of more
number of IOs. So I'd imagine desktops with 1GB or fewer memory
to really need worry about conservative readahead size.
>we need to globally and immediately change this value. However,
Such requirement should be rare enough?
OTOH, the current Linux behavior of "changing ra_pages only changes
future file opens" does seem not a typical end user would expect and
understand.
>from the discussion, we know bdi readahead tend to be an optimal
>value to be a physical property of the storage and doesn't tend to
>change dynamically. Thus we'd like to purpose readahead_shift into
>backing device. This value is used as a shift bit number with bdi
>readahead value to come out the readahead value and the negative
>value means the division effect in the fixup function. The fixup
>would only take effect when shift value is not zero;otherwise,
>it returns file's ra_pages directly. Thus, an administrator could
>use this value to control the amount readahead referred to
>its backing device to achieve its goal.
Sorry but it sounds like introducing an unnecessarily twisted new
interface. I'm afraid it fixes the pain for 0.001% users while
bringing more puzzle to the majority others.
If we make up mind to fix the problem, it'd be more acceptable to
limit to an internal per-file data structure change, like this:
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -884,7 +884,7 @@ struct file_ra_state {
unsigned int async_size; /* do asynchronous readahead when
there are only # of pages ahead */
- unsigned int ra_pages; /* Maximum readahead window */
+ unsigned int ra_pages_shift;
unsigned int mmap_miss; /* Cache miss stat for mmap accesses */
loff_t prev_pos; /* Cache last read() position */
};
Then let fadvise() and shrink_readahead_size_eio() adjust that
per-file ra_pages_shift.
Cheers,
Fengguang
>Signed-off-by: Martin Liu <liumartin@...gle.com>
>---
>I'd like to get some feedback about the patch. If you have any
>concern, please let me know. Thanks.
>---
> block/blk-sysfs.c | 28 ++++++++++++++++++++++++++++
> include/linux/backing-dev-defs.h | 1 +
> include/linux/backing-dev.h | 10 ++++++++++
> mm/backing-dev.c | 27 +++++++++++++++++++++++++++
> mm/filemap.c | 8 +++++---
> mm/readahead.c | 2 +-
> 6 files changed, 72 insertions(+), 4 deletions(-)
>
>diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
>index 59685918167e..f48b68e05d0e 100644
>--- a/block/blk-sysfs.c
>+++ b/block/blk-sysfs.c
>@@ -107,6 +107,27 @@ queue_ra_store(struct request_queue *q, const char *page, size_t count)
> return ret;
> }
>
>+static ssize_t queue_ra_shift_show(struct request_queue *q, char *page)
>+{
>+ int ra_shift = q->backing_dev_info->read_ahead_shift;
>+
>+ return scnprintf(page, PAGE_SIZE - 1, "%d\n", ra_shift);
>+}
>+
>+static ssize_t queue_ra_shift_store(struct request_queue *q, const char *page,
>+ size_t count)
>+{
>+ s8 ra_shift;
>+ ssize_t ret = kstrtos8(page, 10, &ra_shift);
>+
>+ if (ret)
>+ return ret;
>+
>+ q->backing_dev_info->read_ahead_shift = ra_shift;
>+
>+ return count;
>+}
>+
> static ssize_t queue_max_sectors_show(struct request_queue *q, char *page)
> {
> int max_sectors_kb = queue_max_sectors(q) >> 1;
>@@ -540,6 +561,12 @@ static struct queue_sysfs_entry queue_ra_entry = {
> .store = queue_ra_store,
> };
>
>+static struct queue_sysfs_entry queue_ra_shift_entry = {
>+ .attr = {.name = "read_ahead_shift", .mode = 0644 },
>+ .show = queue_ra_shift_show,
>+ .store = queue_ra_shift_store,
>+};
>+
> static struct queue_sysfs_entry queue_max_sectors_entry = {
> .attr = {.name = "max_sectors_kb", .mode = 0644 },
> .show = queue_max_sectors_show,
>@@ -729,6 +756,7 @@ static struct queue_sysfs_entry throtl_sample_time_entry = {
> static struct attribute *default_attrs[] = {
> &queue_requests_entry.attr,
> &queue_ra_entry.attr,
>+ &queue_ra_shift_entry.attr,
> &queue_max_hw_sectors_entry.attr,
> &queue_max_sectors_entry.attr,
> &queue_max_segments_entry.attr,
>diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
>index 07e02d6df5ad..452002df5232 100644
>--- a/include/linux/backing-dev-defs.h
>+++ b/include/linux/backing-dev-defs.h
>@@ -167,6 +167,7 @@ struct bdi_writeback {
> struct backing_dev_info {
> struct list_head bdi_list;
> unsigned long ra_pages; /* max readahead in PAGE_SIZE units */
>+ s8 read_ahead_shift; /* shift to ra_pages */
> unsigned long io_pages; /* max allowed IO size */
> congested_fn *congested_fn; /* Function pointer if device is md/dm */
> void *congested_data; /* Pointer to aux data for congested func */
>diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
>index f9b029180241..23f755fe386e 100644
>--- a/include/linux/backing-dev.h
>+++ b/include/linux/backing-dev.h
>@@ -221,6 +221,16 @@ static inline int bdi_sched_wait(void *word)
> return 0;
> }
>
>+static inline unsigned long read_ahead_fixup(struct file *file)
>+{
>+ unsigned long ra = file->f_ra.ra_pages;
>+ s8 shift = (inode_to_bdi(file->f_mapping->host))->read_ahead_shift;
>+
>+ ra = (shift >= 0) ? ra << shift : ra >> abs(shift);
>+
>+ return ra;
>+}
>+
> #ifdef CONFIG_CGROUP_WRITEBACK
>
> struct bdi_writeback_congested *
>diff --git a/mm/backing-dev.c b/mm/backing-dev.c
>index 72e6d0c55cfa..1b5163a5ead5 100644
>--- a/mm/backing-dev.c
>+++ b/mm/backing-dev.c
>@@ -172,6 +172,32 @@ static DEVICE_ATTR_RW(name);
>
> BDI_SHOW(read_ahead_kb, K(bdi->ra_pages))
>
>+static ssize_t read_ahead_shift_store(struct device *dev,
>+ struct device_attribute *attr,
>+ const char *buf, size_t count)
>+{
>+ struct backing_dev_info *bdi = dev_get_drvdata(dev);
>+ s8 read_ahead_shift;
>+ ssize_t ret = kstrtos8(buf, 10, &read_ahead_shift);
>+
>+ if (ret)
>+ return ret;
>+
>+ bdi->read_ahead_shift = read_ahead_shift;
>+
>+ return count;
>+}
>+
>+static ssize_t read_ahead_shift_show(struct device *dev,
>+ struct device_attribute *attr, char *page)
>+{
>+ struct backing_dev_info *bdi = dev_get_drvdata(dev);
>+
>+ return scnprintf(page, PAGE_SIZE - 1, "%d\n", bdi->read_ahead_shift);
>+}
>+
>+static DEVICE_ATTR_RW(read_ahead_shift);
>+
> static ssize_t min_ratio_store(struct device *dev,
> struct device_attribute *attr, const char *buf, size_t count)
> {
>@@ -223,6 +249,7 @@ static DEVICE_ATTR_RO(stable_pages_required);
>
> static struct attribute *bdi_dev_attrs[] = {
> &dev_attr_read_ahead_kb.attr,
>+ &dev_attr_read_ahead_shift.attr,
> &dev_attr_min_ratio.attr,
> &dev_attr_max_ratio.attr,
> &dev_attr_stable_pages_required.attr,
>diff --git a/mm/filemap.c b/mm/filemap.c
>index d78f577baef2..cd8d0e00ff93 100644
>--- a/mm/filemap.c
>+++ b/mm/filemap.c
>@@ -2469,6 +2469,7 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
> struct address_space *mapping = file->f_mapping;
> struct file *fpin = NULL;
> pgoff_t offset = vmf->pgoff;
>+ unsigned long max;
>
> /* If we don't want any read-ahead, don't bother */
> if (vmf->vma->vm_flags & VM_RAND_READ)
>@@ -2498,9 +2499,10 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
> * mmap read-around
> */
> fpin = maybe_unlock_mmap_for_io(vmf, fpin);
>- ra->start = max_t(long, 0, offset - ra->ra_pages / 2);
>- ra->size = ra->ra_pages;
>- ra->async_size = ra->ra_pages / 4;
>+ max = read_ahead_fixup(file);
>+ ra->start = max_t(long, 0, offset - max / 2);
>+ ra->size = max;
>+ ra->async_size = max / 4;
> ra_submit(ra, mapping, file);
> return fpin;
> }
>diff --git a/mm/readahead.c b/mm/readahead.c
>index a4593654a26c..546bf344fc4d 100644
>--- a/mm/readahead.c
>+++ b/mm/readahead.c
>@@ -384,7 +384,7 @@ ondemand_readahead(struct address_space *mapping,
> unsigned long req_size)
> {
> struct backing_dev_info *bdi = inode_to_bdi(mapping->host);
>- unsigned long max_pages = ra->ra_pages;
>+ unsigned long max_pages = read_ahead_fixup(filp);
> unsigned long add_pages;
> pgoff_t prev_offset;
>
>--
>2.21.0.392.gf8f6787159e-goog
>
Powered by blists - more mailing lists