lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Date: Fri, 18 Mar 2011 09:55:40 +0800 From: Shaohua Li <shaohua.li@...el.com> To: Jens Axboe <jaxboe@...ionio.com> Cc: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, "hch@...radead.org" <hch@...radead.org>, "jmoyer@...hat.com" <jmoyer@...hat.com>, Vivek Goyal <vgoyal@...hat.com> Subject: Re: [PATCH 04/10] block: initial patch for on-stack per-task plugging On Thu, 2011-03-17 at 17:44 +0800, Jens Axboe wrote: > On 2011-03-17 04:19, Shaohua Li wrote: > > On Thu, 2011-03-17 at 09:00 +0800, Shaohua Li wrote: > >> On Thu, 2011-03-17 at 01:31 +0800, Vivek Goyal wrote: > >>> On Wed, Mar 16, 2011 at 04:18:30PM +0800, Shaohua Li wrote: > >>>> 2011/1/22 Jens Axboe <jaxboe@...ionio.com>: > >>>>> Signed-off-by: Jens Axboe <jaxboe@...ionio.com> > >>>>> --- > >>>>> block/blk-core.c | 357 ++++++++++++++++++++++++++++++++------------ > >>>>> block/elevator.c | 6 +- > >>>>> include/linux/blk_types.h | 2 + > >>>>> include/linux/blkdev.h | 30 ++++ > >>>>> include/linux/elevator.h | 1 + > >>>>> include/linux/sched.h | 6 + > >>>>> kernel/exit.c | 1 + > >>>>> kernel/fork.c | 3 + > >>>>> kernel/sched.c | 11 ++- > >>>>> 9 files changed, 317 insertions(+), 100 deletions(-) > >>>>> > >>>>> diff --git a/block/blk-core.c b/block/blk-core.c > >>>>> index 960f12c..42dbfcc 100644 > >>>>> --- a/block/blk-core.c > >>>>> +++ b/block/blk-core.c > >>>>> @@ -27,6 +27,7 @@ > >>>>> #include <linux/writeback.h> > >>>>> #include <linux/task_io_accounting_ops.h> > >>>>> #include <linux/fault-inject.h> > >>>>> +#include <linux/list_sort.h> > >>>>> > >>>>> #define CREATE_TRACE_POINTS > >>>>> #include <trace/events/block.h> > >>>>> @@ -213,7 +214,7 @@ static void blk_delay_work(struct work_struct *work) > >>>>> > >>>>> q = container_of(work, struct request_queue, delay_work.work); > >>>>> spin_lock_irq(q->queue_lock); > >>>>> - q->request_fn(q); > >>>>> + __blk_run_queue(q); > >>>>> spin_unlock_irq(q->queue_lock); > >>>>> } > >>>> Hi Jens, > >>>> I have some questions about the per-task plugging. Since the request > >>>> list is per-task, and each task delivers its requests at finish flush > >>>> or schedule. But when one cpu delivers requests to global queue, other > >>>> cpus don't know. This seems to have problem. For example: > >>>> 1. get_request_wait() can only flush current task's request list, > >>>> other cpus/tasks might still have a lot of requests, which aren't sent > >>>> to request_queue. > >>> > >>> But very soon these requests will be sent to request queue as soon task > >>> is either scheduled out or task explicitly flushes the plug? So we might > >>> wait a bit longer but that might not matter in general, i guess. > >> Yes, I understand there is just a bit delay. I don't know how severe it > >> is, but this still could be a problem, especially for fast storage or > >> random I/O. My current tests show slight regression (3% or so) with > >> Jens's for 2.6.39/core branch. I'm still checking if it's caused by the > >> per-task plug, but the per-task plug is highly suspected. > >> > >>>> your ioc-rq-alloc branch is for this, right? Will it > >>>> be pushed to 2.6.39 too? I'm wondering if we should limit per-task > >>>> queue length. If there are enough requests there, we force a flush > >>>> plug. > >>> > >>> That's the idea jens had. But then came the question of maintaining > >>> data structures per task per disk. That makes it complicated. > >>> > >>> Even if we move the accounting out of request queue and do it say at > >>> bdi, ideally we shall to do per task per bdi accounting. > >>> > >>> Jens seemed to be suggesting that generally fluser threads are the > >>> main cluprit for submitting large amount of IO. They are already per > >>> bdi. So probably just maintain a per task limit for flusher threads. > >> Yep, flusher is the main spot in my mind. We need call more flush plug > >> for flusher thread. > >> > >>> I am not sure what happens to direct reclaim path, AIO deep queue > >>> paths etc. > >> direct reclaim path could build deep write queue too. It > >> uses .writepage, currently there is no flush plug there. Maybe we need > >> add flush plug in shrink_inactive_list too. > >> > >>>> 2. some APIs like blk_delay_work, which call __blk_run_queue() might > >>>> not work. because other CPUs might not dispatch their requests to > >>>> request queue. So __blk_run_queue will eventually find no requests, > >>>> which might stall devices. > >>>> Since one cpu doesn't know other cpus' request list, I'm wondering if > >>>> there are other similar issues. > >>> > >>> So again in this case if queue is empty at the time of __blk_run_queue(), > >>> then we will probably just experinece little more delay then intended > >>> till some task flushes. But should not stall the system? > >> not stall the system, but device stalls a little time. > > Jens, > > I need below patch to recover a ffsb fsync workload, which has about 30% > > regression with stack plug. > > I guess the reason is WRITE_SYNC_PLUG doesn't work now, so if a context > > hasn't blk_plug, we lose previous plug (request merge). This suggests > > all places we use WRITE_SYNC_PLUG before (for example, kjournald) should > > have a blk_plug context. > > Good point, those should be auto-converted. I'll take this patch and > double check the others. Thanks! > > Does it remove that performance regression completely? Yes, it removes the regression completely at my side. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@...r.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists