linux-kernel - Re: IO controller discussion (Was: Re: [PATCH 01/10] Documentation)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <661de9470904180619k34e7998ch755a2ad3bed9ce5e@mail.gmail.com>
Date:	Sat, 18 Apr 2009 18:49:33 +0530
From:	Balbir Singh <balbir@...ux.vnet.ibm.com>
To:	Vivek Goyal <vgoyal@...hat.com>
Cc:	Andrea Righi <righi.andrea@...il.com>,
	Andrew Morton <akpm@...ux-foundation.org>, nauman@...gle.com,
	dpshah@...gle.com, lizf@...fujitsu.com, mikew@...gle.com,
	fchecconi@...il.com, paolo.valente@...more.it,
	jens.axboe@...cle.com, ryov@...inux.co.jp,
	fernando@...ellilink.co.jp, s-uchida@...jp.nec.com,
	taka@...inux.co.jp, guijianfeng@...fujitsu.com,
	arozansk@...hat.com, jmoyer@...hat.com, oz-kernel@...hat.com,
	dhaval@...ux.vnet.ibm.com, linux-kernel@...r.kernel.org,
	containers@...ts.linux-foundation.org, menage@...gle.com,
	peterz@...radead.org
Subject: Re: IO controller discussion (Was: Re: [PATCH 01/10] Documentation)

On Fri, Apr 17, 2009 at 7:43 PM, Vivek Goyal <vgoyal@...hat.com> wrote:
> On Fri, Apr 17, 2009 at 11:37:28AM +0200, Andrea Righi wrote:
>> On Thu, Apr 16, 2009 at 02:37:53PM -0400, Vivek Goyal wrote:
>> > > I think it would be possible to implement both proportional and limiting
>> > > rules at the same level (e.g., the IO scheduler), but we need also to
>> > > address the memory consumption problem (I still need to review your
>> > > patchset in details and I'm going to test it soon :), so I don't know if
>> > > you already addressed this issue).
>> > >
>> >
>> > Can you please elaborate a bit on this? Are you concerned about that data
>> > structures created to solve the problem consume a lot of memory?
>>
>> Sorry I was not very clear here. With memory consumption I mean wasting
>> the memory with hard/slow reclaimable dirty pages or pending IO
>> requests.
>>
>> If there's only a global limit on dirty pages, any cgroup can exhaust
>> that limit and cause other cgroups/processes to block when they try to
>> write to disk.
>>
>> But, ok, the IO controller is not probably the best place to implement
>> such functionality. I should rework on the per cgroup dirty_ratio:
>>
>> https://lists.linux-foundation.org/pipermail/containers/2008-September/013140.html
>>
>> Last time we focused too much on the best interfaces to define dirty
>> pages limit, and I never re-posted an updated version of this patchset.
>> Now I think we can simply provide the same dirty_ratio/dirty_bytes
>> interface that we provide globally, but per cgroup.
>>
>> >
>> > > IOW if we simply don't dispatch requests and we don't throttle the tasks
>> > > in the cgroup that exceeds its limit, how do we avoid the waste of
>> > > memory due to the succeeding IO requests and the increasingly dirty
>> > > pages in the page cache (that are also hard to reclaim)? I may be wrong,
>> > > but I think we talked about this problem in a previous email... sorry I
>> > > don't find the discussion in my mail archives.
>> > >
>> > > IMHO a nice approach would be to measure IO consumption at the IO
>> > > scheduler level, and control IO applying proportional weights / absolute
>> > > limits _both_ at the IO scheduler / elevator level _and_ at the same
>> > > time block the tasks from dirtying memory that will generate additional
>> > > IO requests.
>> > >
>> > > Anyway, there's no need to provide this with a single IO controller, we
>> > > could split the problem in two parts: 1) provide a proportional /
>> > > absolute IO controller in the IO schedulers and 2) allow to set, for
>> > > example, a maximum limit of dirty pages for each cgroup.
>> > >
>> >
>> > I think setting a maximum limit on dirty pages is an interesting thought.
>> > It sounds like as if memory controller can handle it?
>>
>> Exactly, the same above.
>
> Thinking more about it. Memory controller can probably enforce the higher
> limit but it would not easily translate into a fixed upper async write
> rate. Till the process hits the page cache limit or is slowed down by
> dirty page writeout, it can get a very high async write BW.
>
> So memory controller page cache limit will help but it would not direclty
> translate into what max bw limit patches are doing.
>
> Even if we do max bw control at IO scheduler level, async writes are
> problematic again. IO controller will not be able to throttle the process
> until it sees actuall write request. In big memory systems, writeout might
> not happen for some time and till then it will see a high throughput.
>
> So doing async write throttling at higher layer and not at IO scheduler
> layer gives us the opprotunity to produce more accurate results.
>
> For sync requests, I think IO scheduler max bw control should work fine.
>
> BTW, andrea, what is the use case of your patches? Andrew had mentioned
> that some people are already using it. I am curious to know will a
> proportional BW controller will solve the issues/requirements of these
> people or they have specific requirement of traffic shaping and max bw
> controller only.
>
> [..]
>> > > > Can you please give little more details here regarding how QoS requirements
>> > > > are not met with proportional weight?
>> > >
>> > > With proportional weights the whole bandwidth is allocated if no one
>> > > else is using it. When IO is submitted other tasks with a higher weight
>> > > can be forced to sleep until the IO generated by the low weight tasks is
>> > > not completely dispatched. Or any extent of the priority inversion
>> > > problems.
>> >
>> > Hmm..., I am not very sure here. When admin is allocating the weights, he
>> > has the whole picture. He knows how many groups are conteding for the disk
>> > and what could be the worst case scenario. So if I have got two groups
>> > with A and B with weight 1 and 2 and both are contending, then as an
>> > admin one would expect to get 33% of BW for group A in worst case (if
>> > group B is continuously backlogged). If B is not contending than A can get
>> > 100% of BW. So while configuring the system, will one not plan for worst
>> > case (33% for A, and 66 % for B)?
>>
>> OK, I'm quite convinced.. :)
>>
>> To a large degree, if we want to provide a BW reservation strategy we
>> must provide an interface that allows cgroups to ask for time slices
>> such as max/min 5 IO requests every 50ms or something like that.
>> Probably the same functionality can be achieved translating time slices
>> from weights, percentages or absolute BW limits.
>
> Ok, I would like to split it in two parts.
>
> I think providng minimum gurantee in absolute terms like 5 IO request
> every 50ms will be very hard because IO scheduler has no control over
> how many competitors are there. An easier thing will be to have minimum
> gurantees on share basis. For minimum BW (disk time slice) gurantee, admin
> shall have to create right cgroup hierarchy and assign weights properly and
> then admin can calculate what % of disk slice a particular group will get
> as minimum gurantee. (This is more complicated than this as there are
> time slices which are not accounted to any groups. During queue switch
> cfq starts the time slice counting only after first request has completed
> to offset the impact of seeking and i guess also NCQ).
>
> I think it should be possible to give max bandwidth gurantees in absolute
> terms, like io/s or sectors/sec or MB/sec etc, because only thing IO
> scheduler has to do is to not allow dispatch from a particular queue if
> it has crossed its limit and then either let the disk idle or move onto
> next eligible queue.
>
> The only issue here will be async writes. max bw gurantee for async writes
> at IO scheduler level might not mean much to application because of page
> cache.

I see so much of the memory controller coming up. Since we've been
discussing so many of these design points on mail, I wonder if it
makes sense to summarize them somewhere (a wiki?). Would anyone like
to take a shot at it?

Balbir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/