lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Tue, 27 Sep 2011 14:33:40 +0800
From:	Shaohua Li <shaohua.li@...el.com>
To:	Corrado Zoccolo <czoccolo@...il.com>
Cc:	Vivek Goyal <vgoyal@...hat.com>,
	Maxim Patlasov <maxim.patlasov@...il.com>,
	Jens Axboe <jaxboe@...ionio.com>,
	lkml <linux-kernel@...r.kernel.org>
Subject: Re: [patch]cfq-iosched: delete deep seeky queue idle logic

On Tue, 2011-09-27 at 14:07 +0800, Corrado Zoccolo wrote:
> On Mon, Sep 26, 2011 at 2:55 AM, Shaohua Li <shaohua.li@...el.com> wrote:
> > On Fri, 2011-09-23 at 13:50 +0800, Corrado Zoccolo wrote:
> >> Il giorno 21/set/2011 13:16, "Shaohua Li" <shaohua.li@...el.com> ha
> >> scritto:
> >> >
> >> > On Sat, 2011-09-17 at 03:25 +0800, Corrado Zoccolo wrote:
> >> > > On Fri, Sep 16, 2011 at 8:40 AM, Shaohua Li <shaohua.li@...el.com>
> >> wrote:
> >> > > > On Fri, 2011-09-16 at 14:04 +0800, Corrado Zoccolo wrote:
> >> > > >> On Fri, Sep 16, 2011 at 5:09 AM, Shaohua Li
> >> <shaohua.li@...el.com> wrote:
> >> > > >> > Recently Maxim and I discussed why his aiostress workload
> >> performs poorly. If
> >> > > >> > you didn't follow the discussion, here are the issues we
> >> found:
> >> > > >> > 1. cfq seeky dection isn't good. Assume a task accesses
> >> sector A, B, C, D, A+1,
> >> > > >> > B+1, C+1, D+1, A+2...Accessing A, B, C, D is random. cfq will
> >> detect the queue
> >> > > >> > as seeky, but since when accessing A+1, A+1 is already in
> >> disk cache, this
> >> > > >> > should be detected as sequential really. Not sure if any real
> >> workload has such
> >> > > >> > access patern, and seems not easy to have a clean fix too.
> >> Any idea for this?
> >> > > >>
> >> > > >> Not all disks will cache 4 independent streams, we can't make
> >> that
> >> > > >> assumption in cfq.
> >> > > > sure thing. we can't make such assumption. I'm thinking if we
> >> should
> >> > > > move the seeky detection in request finish. If time between two
> >> requests
> >> > > > finish is short, we thought the queue is sequential. This will
> >> make the
> >> > > > detection adaptive. but seems time measurement isn't easy.
> >> > > >
> >> > > >> The current behaviour of assuming it as seeky should work well
> >> enough,
> >> > > >> in fact it will be put in the seeky tree, and it can enjoy the
> >> seeky
> >> > > >> tree quantum of time. If the second round takes a short time,
> >> it will
> >> > > >> be able to schedule a third round again after the idle time.
> >> > > >> If there are other seeky processes competing for the tree, the
> >> cache
> >> > > >> can be cleared by the time it gets back to your 4 streams
> >> process, so
> >> > > >> it will behave exactly as a seeky process from cfq point of
> >> view.
> >> > > >> If the various accesses were submitted in parallel, the deep
> >> seeky
> >> > > >> queue logic should kick in and make sure the process gets a
> >> sequential
> >> > > >> quantum, rather than sharing it with other seeky processes, so
> >> > > >> depending on your disk, it could perform better.
> >> > > > yes, the idle logic makes it ok, but sounds like "make things
> >> wrong
> >> > > > first (in seeky detection) and then fix it later (the idle
> >> logic)".
> >> > > >
> >> > > >> > 2. deep seeky queue idle. This makes raid performs poorly. I
> >> would think we
> >> > > >> > revert the logic. Deep queue is more popular with high end
> >> hardware. In such
> >> > > >> > hardware, we'd better not do idle.
> >> > > >> > Note, currently we set a queue's slice after the first
> >> request is finished.
> >> > > >> > This means the drive already idles a little time. If the
> >> queue is truely deep,
> >> > > >> > new requests should already come in, so idle isn't required.
> >> > > > What did you think about this? Assume seeky request takes long
> >> time, so
> >> > > > the queue is already idling for a little time.
> >> > > I don't think I understand. If cfq doesn't idle, it will dispatch
> >> an
> >> > > other request from the same or an other queue (if present)
> >> > > immediately, until all possible in-flight requests are sent. Now,
> >> you
> >> > > depend on NCQ for the order requests are handled, so you cannot
> >> > > guarantee fairness any more.
> >> > >
> >> > > >
> >> > > >> > Looks Vivek used to post a patch to rever it, but it gets
> >> ignored.
> >> > > >> >
> >> http://us.generation-nt.com/patch-cfq-iosched-revert-logic-deep-queues-help-198339681.html
> >> > > >> I get a 404 here. I think you are seeing only one half of the
> >> medal.
> >> > > >> That logic is there mainly to ensure fairness between deep
> >> seeky
> >> > > >> processes and normal seeky processes that want low latency.
> >> > > > didn't understand it. The logic doesn't protect non-deep
> >> process. how
> >> > > > could it make the normal seeky process have low latency? or did
> >> you have
> >> > > > a test case for this, so I can analyze?
> >> > > > I tried a workload with one task drives depth 4 and one task
> >> drives
> >> > > > depth 16. Appears the behavior isn't changed w/wo the logic.
> >> > sorry for the delay.
> >> >
> >> > > Try a workload with one shallow seeky queue and one deep (16) one,
> >> on
> >> > > a single spindle NCQ disk.
> >> > > I think the behaviour when I submitted my patch was that both were
> >> > > getting 100ms slice (if this is not happening, probably some
> >> > > subsequent patch broke it).
> >> > > If you remove idling, they will get disk time roughly in
> >> proportion
> >> > > 16:1, i.e. pretty unfair.
> >> > I thought you are talking about a workload with one thread depth 4,
> >> and
> >> > the other thread depth 16. I did some tests here. In an old kernel,
> >> > without the deep seeky idle logic, the threads have disk time in
> >> > proportion 1:5. With it, they get almost equal disk time. SO this
> >> > reaches your goal. In a latest kernel, w/wo the logic, there is no
> >> big
> >> > difference (the 16 depth thread get about 5x more disk time). With
> >> the
> >> > logic, the depth 4 thread gets equal disk time in first several
> >> slices.
> >> > But after an idle expiration(mostly because current block plug hold
> >> > requests in task list and didn't add them to elevator), the queue
> >> never
> >> > gets detected as deep, because the queue dispatch request one by
> >> one. So
> >> > the logic is already broken for some time (maybe since block plug is
> >> > added).
> >> Could be that dispatching requests one by one is harming the
> >> performance, then?
> > Not really. Say 4 requests are running, the task dispatches a request
> > after one previous request is completed. requests are dispatching one by
> > one but there are still 4 requests running at any time. Checking the
> > in_flight requests are more precise for the deep detection.
> >
> What happens if there are 4 tasks, all that could dispatch 4 requests
> in parallel? Will we reach and sustain 16 in flight requests, or it
> will bounce around 4 in flight? I think here we could get a big
> difference.
ah, yes, we really should change
if (cfqq->queued[0] + cfqq->queued[1] >= 4)
                cfq_mark_cfqq_deep(cfqq);
to
if (cfqq->queued[0] + cfqq->queued[1] + cfqq->cfqq->dispatched >= 4)
                cfq_mark_cfqq_deep(cfqq);
this is a bug, though this isn't related to the original raid issue.

> Probably it is better to move the deep queue detection logic in the
> per-task queue?
it's per-task queue currently.

> Then cfq will decide if it should dispatch few requests from every
> task (shallow case) or all requests from a single task (deep), and
> then idle.
don't get your point. detect the deep logic considering all tasks?

Thanks,
Shaohua

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists