linux-kernel - Re: [RFD] I/O scheduling in blk-mq

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <16E20428-7E4D-41AC-ADD9-738125713624@linaro.org>
Date:   Fri, 30 Sep 2016 08:18:27 +0200
From:   Paolo Valente <paolo.valente@...aro.org>
To:     Paolo Valente <paolo.valente@...aro.org>
Cc:     Omar Sandoval <osandov@...ndov.com>, Jens Axboe <axboe@...nel.dk>,
        Tejun Heo <tj@...nel.org>,
        Christoph Hellwig <hch@...radead.org>,
        linux-block@...r.kernel.org, linux-kernel@...r.kernel.org,
        Ulf Hansson <ulf.hansson@...aro.org>,
        Linus Walleij <linus.walleij@...aro.org>, broonie@...nel.org
Subject: Re: [RFD] I/O scheduling in blk-mq

Hi Omar,
have you had a chance to look at these last questions of mine?

Thanks,
Paolo

> Il giorno 31 ago 2016, alle ore 17:20, Paolo Valente <paolo.valente@...aro.org> ha scritto:
> 
> 
> Il giorno 08/ago/2016, alle ore 22:09, Omar Sandoval <osandov@...ndov.com> ha scritto:
> 
>> On Mon, Aug 08, 2016 at 04:09:56PM +0200, Paolo wrote:
>>> Hi Jens, Tejun, Christoph, all,
>>> AFAIK blk-mq does not yet feature I/O schedulers. In particular, there
>>> is no scheduler providing strong guarantees in terms of
>>> responsiveness, latency for time-sensitive applications and bandwidth
>>> distribution.
>>> 
>>> For this reason, I'm trying to port BFQ to blk-mq, or to develop
>>> something simpler if even a reduced version of BFQ proves to be too
>>> heavy (this project is supported by Linaro). If you are willing to
>>> provide some feedback in this respect, I would like to ask for
>>> opinions/suggestions on the following two matters, and possibly to
>>> open a more general discussion on I/O scheduling in blk-mq.
>>> 
>>> 1) My idea is to have an independent instance of BFQ, or in general of
>>> the I/O scheduler, executed for each software queue. Then there would
>>> be no global scheduling. The drawback of no global scheduling is that
>>> each process cannot get more than 1/M of the total throughput of the
>>> device, if M is the number of software queues. But, if I'm not
>>> mistaken, it is however unfeasible to give a process more than 1/M of
>>> the total throughput, without lowering the throughput itself. In fact,
>>> giving a process more than 1/M of the total throughput implies serving
>>> its software queue, say Q, more than the others.  The only way to do
>>> it is periodically stopping the service of the other software queues
>>> and dispatching only the requests in Q. But this would reduce
>>> parallelism, which is the main way how blk-mq achieves a very high
>>> throughput. Are these considerations, and, in particular, one
>>> independent I/O scheduler per software queue, sensible?
>>> 
>>> 2) To provide per-process service guarantees, an I/O scheduler must
>>> create per-process internal queues. BFQ and CFQ use I/O contexts to
>>> achieve this goal. Is something like that (or exactly the same)
>>> available also in blk-mq? If so, do you have any suggestion, or link to
>>> documentation/code on how to use what is available in blk-mq?
>>> 
>>> Thanks,
>>> Paolo
>> 
>> Hi, Paolo,
>> 
>> I've been working on I/O scheduling for blk-mq with Jens for the past
>> few months (splitting time with other small projects), and we're making
>> good progress. Like you noticed, the hard part isn't really grafting a
>> scheduler interface onto blk-mq, it's maintaining good scalability while
>> providing adequate fairness.
>> 
>> We're working towards a scheduler more like deadline and getting the
>> architectural issues worked out. The goal is some sort of fairness
>> across all queues.
> 
> If I'm not mistaken, the requests of a process (the bios after your
> patch) end up in a given software queue basically by chance, i.e.,
> because the process happens to be executed on the core which that
> queue is associated with. If this is true, then the scheduler cannot
> control in which queue a request is sent. So, how do you imagine the
> scheduler to control the global request service order exactly? By
> stopping the service of some queues and letting only the head-of-line
> request(s) of some other queue(s) be dispatched?
> 
> In this respect, I guess that, as of now, it is again chance that
> determines from which software queue the next request to dispatch is
> picked, i.e., it depends on which core the dispatch functions happen
> to be executed. Is it correct?
> 
>> The scheduler-per-software-queue model won't hold up
>> so well if we have a slower device with an I/O-hungry process on one CPU
>> and an interactive process on another CPU.
>> 
> 
> So, the problem would be that the hungry process eats all the
> bandwidth, and the interactive one never gets served.
> 
> What about the case where both processes are on the same CPU, i.e.,
> where the requests of both processes are on the same software queue?
> How does the scheduler you envisage guarantees a good latency to the
> interactive process in this case? By properly reordering requests
> inside the software queue?
> 
> I'm sorry if my questions are quite silly, or do not make much sense.
> 
> Thanks,
> Paolo
> 
> 
>> The issue I'm working through now is that on blk-mq, we only have as
>> many `struct request`s as the hardware has tags, so on a device with a
>> limited queue depth, it's really hard to do any sort of intelligent
>> scheduling. The solution for that is switching over to working with
>> `struct bio`s in the software queues instead, which abstracts away the
>> hardware capabilities. I have some work in progress at
>> https://github.com/osandov/linux/tree/blk-mq-iosched, but it's not yet
>> at feature-parity.
>> 
>> After that, I'll be back to working on the scheduling itself. The vague
>> idea is to amortize global scheduling decisions, but I don't have much
>> concrete code behind that yet.
>> 
>> Thanks!
>> -- 
>> Omar
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html