linux-kernel - Re: [PATCH 4/7] bio-cgroup: Split the cgroup memory subsystem into two parts

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20080807.162512.22162413.taka@valinux.co.jp>
Date:	Thu, 07 Aug 2008 16:25:12 +0900 (JST)
From:	Hirokazu Takahashi <taka@...inux.co.jp>
To:	kamezawa.hiroyu@...fujitsu.com
Cc:	Balbir Singh <balbir@...ux.vnet.ibm.com>, ryov@...inux.co.jp,
	xen-devel@...ts.xensource.com,
	containers@...ts.linux-foundation.org,
	linux-kernel@...r.kernel.org,
	virtualization@...ts.linux-foundation.org, dm-devel@...hat.com,
	agk@...rceware.org
Subject: Re: [PATCH 4/7] bio-cgroup: Split the cgroup memory subsystem into
 two parts

Hi,

> >> > This patch splits the cgroup memory subsystem into two parts.
> >> > One is for tracking pages to find out the owners. The other is
> >> > for controlling how much amount of memory should be assigned to
> >> > each cgroup.
> >> > 
> >> > With this patch, you can use the page tracking mechanism even if
> >> > the memory subsystem is off.
> >> > 
> >> > Based on 2.6.27-rc1-mm1
> >> > Signed-off-by: Ryo Tsuruta <ryov@...inux.co.jp>
> >> > Signed-off-by: Hirokazu Takahashi <taka@...inux.co.jp>
> >> > 
> >> 
> >> Plese CC me or Balbir or Pavel (See Maintainer list) when you try this ;)
> >> 
> >> After this patch, the total structure is
> >> 
> >>  page <-> page_cgroup <-> bio_cgroup.
> >>  (multiple bio_cgroup can be attached to page_cgroup)
> >>
> >> Does this pointer chain will add
> >>   - significant performance regression or
> >>   - new race condtions 
> >> ?
> >
> >I don't think it will cause significant performance loss, because
> >the link between a page and a page_cgroup has already existed, which
> >the memory resource controller prepared. Bio_cgroup uses this as it is,
> >and does nothing about this.
> >
> >And the link between page_cgroup and bio_cgroup isn't protected
> >by any additional spin-locks, since the associated bio_cgroup is
> >guaranteed to exist as long as the bio_cgroup owns pages.
> >
> Hmm, I think page_cgroup's cost is visible when
> 1. a page is changed to be in-use state. (fault or radixt-tree-insert)
> 2. a page is changed to be out-of-use state (fault or radixt-tree-removal)
> 3. memcg hit its limit or global LRU reclaim runs.
> "1" and "2" can be catched as 5% loss of exec throuput. 
> "3" is not measured (because LRU walk itself is heavy.)
> 
> What new chances to access page_cgroup you'll add ?
> I'll have to take into account them.

I haven't add any at this moment, but I thinks some people may want
to move some pages in page-cache from one cgroup to another cgroup.
When that time comes, I'll try to make the cost minimized that
I will probably only update the link between a page_cgroup and
a bio_cgroup and leave the others untouched.

> >I've just noticed that most of overhead comes from the spin-locks
> >when reclaiming the pages inside mem_cgroups and the spin-locks to
> >protect the links between pages and page_cgroups.
> Overhead between page <-> page_cgroup lock is cannot be catched by
> lock_stat now.Do you have numbers ?
> But ok, there are too many locks ;(

The problem is that every time the lock is held, the associated
cache line is flushed.

> >The latter overhead comes from the policy your team has chosen
> >that page_cgroup structures are allocated on demand. I still feel
> >this approach doesn't make any sense because linux kernel tries to
> >make use of most of the pages as far as it can, so most of them
> >have to be assigned its related page_cgroup. It would make us happy
> >if page_cgroups are allocated at the booting time.
> >
> Now, multi-sizer-page-cache is discussed for a long time. If it's our
> direction, on-demand page_cgroup make sense.

I don't think I can agree to this.
When multi-sized-page-cache is introduced, some data structures will be
allocated to manage multi-sized-pages. I think page_cgroups should be
allocated at the same time. This approach will make things simple.

It seems like the on-demand allocation approach leads not only
overhead but complexity and a lot of race conditions.
If you allocate page_cgroups when allocating page structures,
You can get rid of most of the locks and you don't have to care about
allocation error of page_cgroups anymore.

And it will also give us flexibility that memcg related data can be
referred/updated inside critical sections.

> >> For example, adding a simple function.
> >> ==
> >> int get_page_io_id(struct page *)
> >>  - returns a I/O cgroup ID for this page. If ID is not found, -1 is returne
> d.
> >>    ID is not guaranteed to be valid value. (ID can be obsolete)
> >> ==
> >> And just storing cgroup ID to page_cgroup at page allocation.
> >> Then, making bio_cgroup independent from page_cgroup and 
> >> get ID if avialble and avoid too much pointer walking.
> >
> >I don't think there are any diffrences between a poiter and ID.
> >I think this ID is just a encoded version of the pointer.
> >
> ID can be obsolete, pointer is not. memory cgroup has to take care of
> bio cgroup's race condition ? (About race conditions, it's already complicated
> enough)

Bio-cgroup just expects that the call-backs bio-cgroup prepares are called
when the status of a page_cgroup get changed.

> To be honest, I think adding a new (4 or 8 bytes) page struct and record infor
> mation of bio-control is more straightforward approach. Buy as you might
> think, "there is no room"

But only if everyone allows me to add some new members into "struct page."
I think the same thing goes with memcg you're working on.


Thank you,
Hirokazu Takahashi.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/