linux-kernel - Re: [PATCH] io-controller: Add io group reference handling for request

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20090526.203424.39179999.ryov@valinux.co.jp>
Date:	Tue, 26 May 2009 20:34:24 +0900 (JST)
From:	Ryo Tsuruta <ryov@...inux.co.jp>
To:	righi.andrea@...il.com
Cc:	vgoyal@...hat.com, guijianfeng@...fujitsu.com, nauman@...gle.com,
	dpshah@...gle.com, lizf@...fujitsu.com, mikew@...gle.com,
	fchecconi@...il.com, paolo.valente@...more.it,
	jens.axboe@...cle.com, fernando@....ntt.co.jp,
	s-uchida@...jp.nec.com, taka@...inux.co.jp, jmoyer@...hat.com,
	dhaval@...ux.vnet.ibm.com, balbir@...ux.vnet.ibm.com,
	linux-kernel@...r.kernel.org,
	containers@...ts.linux-foundation.org, agk@...hat.com,
	dm-devel@...hat.com, snitzer@...hat.com, m-ikeda@...jp.nec.com,
	akpm@...ux-foundation.org
Subject: Re: [PATCH] io-controller: Add io group reference handling for
 request

Hi Andrea and Vivek,

From: Andrea Righi <righi.andrea@...il.com>
Subject: Re: [PATCH] io-controller: Add io group reference handling for request
Date: Mon, 18 May 2009 16:39:23 +0200

> On Mon, May 18, 2009 at 10:01:14AM -0400, Vivek Goyal wrote:
> > On Sun, May 17, 2009 at 12:26:06PM +0200, Andrea Righi wrote:
> > > On Fri, May 15, 2009 at 10:06:43AM -0400, Vivek Goyal wrote:
> > > > On Fri, May 15, 2009 at 09:48:40AM +0200, Andrea Righi wrote:
> > > > > On Fri, May 15, 2009 at 01:15:24PM +0800, Gui Jianfeng wrote:
> > > > > > Vivek Goyal wrote:
> > > > > > ...
> > > > > > >  }
> > > > > > > @@ -1462,20 +1462,27 @@ struct io_cgroup *get_iocg_from_bio(stru
> > > > > > >  /*
> > > > > > >   * Find the io group bio belongs to.
> > > > > > >   * If "create" is set, io group is created if it is not already present.
> > > > > > > + * If "curr" is set, io group is information is searched for current
> > > > > > > + * task and not with the help of bio.
> > > > > > > + *
> > > > > > > + * FIXME: Can we assume that if bio is NULL then lookup group for current
> > > > > > > + * task and not create extra function parameter ?
> > > > > > >   *
> > > > > > > - * Note: There is a narrow window of race where a group is being freed
> > > > > > > - * by cgroup deletion path and some rq has slipped through in this group.
> > > > > > > - * Fix it.
> > > > > > >   */
> > > > > > > -struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
> > > > > > > -					int create)
> > > > > > > +struct io_group *io_get_io_group(struct request_queue *q, struct bio *bio,
> > > > > > > +					int create, int curr)
> > > > > > 
> > > > > >   Hi Vivek,
> > > > > > 
> > > > > >   IIUC we can get rid of curr, and just determine iog from bio. If bio is not NULL,
> > > > > >   get iog from bio, otherwise get it from current task.
> > > > > 
> > > > > Consider also that get_cgroup_from_bio() is much more slow than
> > > > > task_cgroup() and need to lock/unlock_page_cgroup() in
> > > > > get_blkio_cgroup_id(), while task_cgroup() is rcu protected.
> > > > > 
> > > > 
> > > > True.
> > > > 
> > > > > BTW another optimization could be to use the blkio-cgroup functionality
> > > > > only for dirty pages and cut out some blkio_set_owner(). For all the
> > > > > other cases IO always occurs in the same context of the current task,
> > > > > and you can use task_cgroup().
> > > > > 
> > > > 
> > > > Yes, may be in some cases we can avoid setting page owner. I will get
> > > > to it once I have got functionality going well. In the mean time if
> > > > you have a patch for it, it will be great.
> > > > 
> > > > > However, this is true only for page cache pages, for IO generated by
> > > > > anonymous pages (swap) you still need the page tracking functionality
> > > > > both for reads and writes.
> > > > > 
> > > > 
> > > > Right now I am assuming that all the sync IO will belong to task
> > > > submitting the bio hence use task_cgroup() for that. Only for async
> > > > IO, I am trying to use page tracking functionality to determine the owner.
> > > > Look at elv_bio_sync(bio).
> > > > 
> > > > You seem to be saying that there are cases where even for sync IO, we
> > > > can't use submitting task's context and need to rely on page tracking
> > > > functionlity? 

I think that there are some kernel threads (e.g., dm-crypt, LVM and md
devices) which actually submit IOs instead of tasks which originate the
IOs. When IOs are submitted from such kernel threads, we can't use
submitting task's context to determine to which cgroup the IO belongs.

> > > > In case of getting page (read) from swap, will it not happen
> > > > in the context of process who will take a page fault and initiate the
> > > > swap read?
> > > 
> > > No, for example in read_swap_cache_async():
> > > 
> > > @@ -308,6 +309,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> > >  		 */
> > >  		__set_page_locked(new_page);
> > >  		SetPageSwapBacked(new_page);
> > > +		blkio_cgroup_set_owner(new_page, current->mm);
> > >  		err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
> > >  		if (likely(!err)) {
> > >  			/*
> > > 
> > > This is a read, but the current task is not always the owner of this
> > > swap cache page, because it's a readahead operation.
> > > 
> > 
> > But will this readahead be not initiated in the context of the task taking
> > the page fault?
> > 
> > handle_pte_fault()
> > 	do_swap_page()
> > 		swapin_readahead()
> > 			read_swap_cache_async()
> > 
> > If yes, then swap reads issued will still be in the context of process and
> > we should be fine?
> 
> Right. I was trying to say that the current task may swap-in also pages
> belonging to a different task, so from a certain point of view it's not
> so fair to charge the current task for the whole activity. But ok, I
> think it's a minor issue.
> 
> > 
> > > Anyway, this is a minor corner case I think. And probably it is safe to
> > > consider this like any other read IO and get rid of the
> > > blkio_cgroup_set_owner().
> > 
> > Agreed.
> > 
> > > 
> > > I wonder if it would be better to attach the blkio_cgroup to the
> > > anonymous page only when swap-out occurs.
> > 
> > Swap seems to be an interesting case in general. Somebody raised this
> > question on lwn io controller article also. A user process never asked
> > for swap activity. It is something enforced by kernel. So while doing
> > some swap outs, it does not seem too fair to charge the write out to
> > the process page belongs to and the fact of the matter may be that there
> > is some other memory hungry application which is forcing these swap outs.
> > 
> > Keeping this in mind, should swap activity be considered as system
> > activity and be charged to root group instead of to user tasks in other
> > cgroups?
> 
> In this case I assume the swap-in activity should be charged to the root
> cgroup as well.
> 
> Anyway, in the logic of the memory and swap control it would seem
> reasonable to provide IO separation also for the swap IO activity.
> 
> In the MEMHOG example, it would be unfair if the memory pressure is
> caused by a task in another cgroup, but with memory and swap isolation a
> memory pressure condition can only be caused by a memory hog that runs
> in the same cgroup. From this point of view it seems more fair to
> consider the swap activity as the particular cgroup IO activity, instead
> of charging always the root cgroup.
> 
> Otherwise, I suspect, memory pressure would be a simple way to blow away
> any kind of QoS guarantees provided by the IO controller.
> 
> >   
> > > I mean, just put the
> > > blkio_cgroup_set_owner() hook in try_to_umap() in order to keep track of
> > > the IO generated by direct reclaim of anon memory. For all the other
> > > cases we can simply use the submitting task's context.

I think that only putting the hook in try_to_unmap() doesn't work
correctly, because IOs will be charged to reclaiming processes or
kswapd. These IOs should be charged to processes which cause memory
pressure.

> > > BTW, O_DIRECT is another case that is possible to optimize, because all
> > > the bios generated by direct IO occur in the same context of the current
> > > task.
> > 
> > Agreed about the direct IO optimization.
> > 
> > Ryo, what do you think? would you like to do include these optimizations
> > by the Andrea in next version of IO tracking patches?
> >  
> > Thanks
> > Vivek
> 
> Thanks,
> -Andrea

Thanks,
Ryo Tsuruta
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/