linux-kernel - Re: [PATCH 04/10] block: initial patch for on-stack per-task plugging

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1300413340.2337.129.camel@sli10-conroe>
Date:	Fri, 18 Mar 2011 09:55:40 +0800
From:	Shaohua Li <shaohua.li@...el.com>
To:	Jens Axboe <jaxboe@...ionio.com>
Cc:	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"hch@...radead.org" <hch@...radead.org>,
	"jmoyer@...hat.com" <jmoyer@...hat.com>,
	Vivek Goyal <vgoyal@...hat.com>
Subject: Re: [PATCH 04/10] block: initial patch for on-stack per-task
 plugging

On Thu, 2011-03-17 at 17:44 +0800, Jens Axboe wrote:
> On 2011-03-17 04:19, Shaohua Li wrote:
> > On Thu, 2011-03-17 at 09:00 +0800, Shaohua Li wrote:
> >> On Thu, 2011-03-17 at 01:31 +0800, Vivek Goyal wrote:
> >>> On Wed, Mar 16, 2011 at 04:18:30PM +0800, Shaohua Li wrote:
> >>>> 2011/1/22 Jens Axboe <jaxboe@...ionio.com>:
> >>>>> Signed-off-by: Jens Axboe <jaxboe@...ionio.com>
> >>>>> ---
> >>>>>  block/blk-core.c          |  357 ++++++++++++++++++++++++++++++++------------
> >>>>>  block/elevator.c          |    6 +-
> >>>>>  include/linux/blk_types.h |    2 +
> >>>>>  include/linux/blkdev.h    |   30 ++++
> >>>>>  include/linux/elevator.h  |    1 +
> >>>>>  include/linux/sched.h     |    6 +
> >>>>>  kernel/exit.c             |    1 +
> >>>>>  kernel/fork.c             |    3 +
> >>>>>  kernel/sched.c            |   11 ++-
> >>>>>  9 files changed, 317 insertions(+), 100 deletions(-)
> >>>>>
> >>>>> diff --git a/block/blk-core.c b/block/blk-core.c
> >>>>> index 960f12c..42dbfcc 100644
> >>>>> --- a/block/blk-core.c
> >>>>> +++ b/block/blk-core.c
> >>>>> @@ -27,6 +27,7 @@
> >>>>>  #include <linux/writeback.h>
> >>>>>  #include <linux/task_io_accounting_ops.h>
> >>>>>  #include <linux/fault-inject.h>
> >>>>> +#include <linux/list_sort.h>
> >>>>>
> >>>>>  #define CREATE_TRACE_POINTS
> >>>>>  #include <trace/events/block.h>
> >>>>> @@ -213,7 +214,7 @@ static void blk_delay_work(struct work_struct *work)
> >>>>>
> >>>>>        q = container_of(work, struct request_queue, delay_work.work);
> >>>>>        spin_lock_irq(q->queue_lock);
> >>>>> -       q->request_fn(q);
> >>>>> +       __blk_run_queue(q);
> >>>>>        spin_unlock_irq(q->queue_lock);
> >>>>>  }
> >>>> Hi Jens,
> >>>> I have some questions about the per-task plugging. Since the request
> >>>> list is per-task, and each task delivers its requests at finish flush
> >>>> or schedule. But when one cpu delivers requests to global queue, other
> >>>> cpus don't know. This seems to have problem. For example:
> >>>> 1. get_request_wait() can only flush current task's request list,
> >>>> other cpus/tasks might still have a lot of requests, which aren't sent
> >>>> to request_queue.
> >>>
> >>> But very soon these requests will be sent to request queue as soon task
> >>> is either scheduled out or task explicitly flushes the plug? So we might
> >>> wait a bit longer but that might not matter in general, i guess. 
> >> Yes, I understand there is just a bit delay. I don't know how severe it
> >> is, but this still could be a problem, especially for fast storage or
> >> random I/O. My current tests show slight regression (3% or so) with
> >> Jens's for 2.6.39/core branch. I'm still checking if it's caused by the
> >> per-task plug, but the per-task plug is highly suspected.
> >>
> >>>> your ioc-rq-alloc branch is for this, right? Will it
> >>>> be pushed to 2.6.39 too? I'm wondering if we should limit per-task
> >>>> queue length. If there are enough requests there, we force a flush
> >>>> plug.
> >>>
> >>> That's the idea jens had. But then came the question of maintaining
> >>> data structures per task per disk. That makes it complicated.
> >>>
> >>> Even if we move the accounting out of request queue and do it say at
> >>> bdi, ideally we shall to do per task per bdi accounting.
> >>>
> >>> Jens seemed to be suggesting that generally fluser threads are the
> >>> main cluprit for submitting large amount of IO. They are already per
> >>> bdi. So probably just maintain a per task limit for flusher threads.
> >> Yep, flusher is the main spot in my mind. We need call more flush plug
> >> for flusher thread. 
> >>
> >>> I am not sure what happens to direct reclaim path, AIO deep queue 
> >>> paths etc.
> >> direct reclaim path could build deep write queue too. It
> >> uses .writepage, currently there is no flush plug there. Maybe we need
> >> add flush plug in shrink_inactive_list too.
> >>
> >>>> 2. some APIs like blk_delay_work, which call __blk_run_queue() might
> >>>> not work. because other CPUs might not dispatch their requests to
> >>>> request queue. So __blk_run_queue will eventually find no requests,
> >>>> which might stall devices.
> >>>> Since one cpu doesn't know other cpus' request list, I'm wondering if
> >>>> there are other similar issues.
> >>>
> >>> So again in this case if queue is empty at the time of __blk_run_queue(),
> >>> then we will probably just experinece little more delay then intended
> >>> till some task flushes. But should not stall the system?
> >> not stall the system, but device stalls a little time.
> > Jens,
> > I need below patch to recover a ffsb fsync workload, which has about 30%
> > regression with stack plug. 
> > I guess the reason is WRITE_SYNC_PLUG doesn't work now, so if a context
> > hasn't blk_plug, we lose previous plug (request merge). This suggests
> > all places we use WRITE_SYNC_PLUG before (for example, kjournald) should
> > have a blk_plug context.
> 
> Good point, those should be auto-converted. I'll take this patch and
> double check the others. Thanks!
> 
> Does it remove that performance regression completely?
Yes, it removes the regression completely at my side.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/