linux-kernel - Re: [PATCH 6/10] mm/memcg: take care over pc->mem

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4F433418.3010401@openvz.org>
Date:	Tue, 21 Feb 2012 10:05:12 +0400
From:	Konstantin Khlebnikov <khlebnikov@...nvz.org>
To:	Hugh Dickins <hughd@...gle.com>
CC:	Andrew Morton <akpm@...ux-foundation.org>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
	Johannes Weiner <hannes@...xchg.org>,
	Ying Han <yinghan@...gle.com>,
	"linux-mm@...ck.org" <linux-mm@...ck.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 6/10] mm/memcg: take care over pc->mem_cgroup

Hugh Dickins wrote:
> page_relock_lruvec() is using lookup_page_cgroup(page)->mem_cgroup
> to find the memcg, and hence its per-zone lruvec for the page.  We
> therefore need to be careful to see the right pc->mem_cgroup: where
> is it updated?
>
> In __mem_cgroup_commit_charge(), under lruvec lock whenever lru
> care might be needed, lrucare holding the page off lru at that time.
>
> In mem_cgroup_reset_owner(), not under lruvec lock, but before the
> page can be visible to others - except compaction or lumpy reclaim,
> which ignore the page because it is not yet PageLRU.
>
> In mem_cgroup_split_huge_fixup(), always under lruvec lock.
>
> In mem_cgroup_move_account(), which holds several locks, but an
> lruvec lock not among them: yet it still appears to be safe, because
> the page has been taken off its old lru and not yet put on the new.
>
> Be particularly careful in compaction's isolate_migratepages() and
> vmscan's lumpy handling in isolate_lru_pages(): those approach the
> page by its physical location, and so can encounter pages which
> would not be found by any logical lookup.  For those cases we have
> to change __isolate_lru_page() slightly: it must leave ClearPageLRU
> to the caller, because compaction and lumpy cannot safely interfere
> with a page until they have first isolated it and then locked lruvec.

Yeah, this is most complicated part. I found one race here, see below.

>
> To the list above we have to add __mem_cgroup_uncharge_common(),
> and new function mem_cgroup_reset_uncharged_to_root(): the first
> resetting pc->mem_cgroup to root_mem_cgroup when a page off lru is
> uncharged, and the second when an uncharged page is taken off lru
> (which used to be achieved implicitly with the PageAcctLRU flag).
>
> That's because there's a remote risk that compaction or lumpy reclaim
> will spy a page while it has PageLRU set; then it's taken off LRU and
> freed, its mem_cgroup torn down and freed, the page reallocated (so
> get_page_unless_zero again succeeds); then compaction or lumpy reclaim
> reach their page_relock_lruvec, using the stale mem_cgroup for locking.
>
> So long as there's one charge on the mem_cgroup, or a page on one of
> its lrus, mem_cgroup_force_empty() cannot succeed and the mem_cgroup
> cannot be destroyed.  But when an uncharged page is taken off lru,
> or a page off lru is uncharged, it no longer protects its old memcg,
> and the one stable root_mem_cgroup must then be used for it.
>
> Signed-off-by: Hugh Dickins<hughd@...gle.com>
> ---
>   include/linux/memcontrol.h |    5 ++
>   mm/compaction.c            |   36 ++++++-----------
>   mm/memcontrol.c            |   45 +++++++++++++++++++--
>   mm/swap.c                  |    2
>   mm/vmscan.c                |   73 +++++++++++++++++++++++++----------
>   5 files changed, 114 insertions(+), 47 deletions(-)
>
> --- mmotm.orig/include/linux/memcontrol.h       2012-02-18 11:57:42.675524592 -0800
> +++ mmotm/include/linux/memcontrol.h    2012-02-18 11:57:49.103524745 -0800
> @@ -65,6 +65,7 @@ extern int mem_cgroup_cache_charge(struc
>   struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
>   extern struct mem_cgroup *mem_cgroup_from_lruvec(struct lruvec *lruvec);
>   extern void mem_cgroup_update_lru_size(struct lruvec *, enum lru_list, int);
> +extern void mem_cgroup_reset_uncharged_to_root(struct page *);
>
>   /* For coalescing uncharge for reducing memcg' overhead*/
>   extern void mem_cgroup_uncharge_start(void);
> @@ -251,6 +252,10 @@ static inline void mem_cgroup_update_lru
>   {
>   }
>
> +static inline void mem_cgroup_reset_uncharged_to_root(struct page *page)
> +{
> +}
> +
>   static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
>   {
>          return NULL;
> --- mmotm.orig/mm/compaction.c  2012-02-18 11:57:42.675524592 -0800
> +++ mmotm/mm/compaction.c       2012-02-18 11:57:49.103524745 -0800
> @@ -356,28 +356,6 @@ static isolate_migrate_t isolate_migrate
>                          continue;
>                  }
>
> -               if (!lruvec) {
> -                       /*
> -                        * We do need to take the lock before advancing to
> -                        * check PageLRU etc., but there's no guarantee that
> -                        * the page we're peeking at has a stable memcg here.
> -                        */
> -                       lruvec =&zone->lruvec;
> -                       lock_lruvec(lruvec);
> -               }
> -               if (!PageLRU(page))
> -                       continue;
> -
> -               /*
> -                * PageLRU is set, and lru_lock excludes isolation,
> -                * splitting and collapsing (collapsing has already
> -                * happened if PageLRU is set).
> -                */
> -               if (PageTransHuge(page)) {
> -                       low_pfn += (1<<  compound_order(page)) - 1;
> -                       continue;
> -               }
> -
>                  if (!cc->sync)
>                          mode |= ISOLATE_ASYNC_MIGRATE;
>
> @@ -386,10 +364,24 @@ static isolate_migrate_t isolate_migrate
>                          continue;
>
>                  page_relock_lruvec(page,&lruvec);

Here race with mem_cgroup_move_account() we hold lock for old lruvec,
while move_account() recharge page and put page back into other lruvec.
Thus we see PageLRU(), but below we isolate page from wrong lruvec.

In my patch-set this is fixed with __wait_lru_unlock() [ spin_unlock_wait() ]
in mem_cgroup_move_account()

> +               if (unlikely(!PageLRU(page) || PageUnevictable(page) ||
> +                                               PageTransHuge(page))) {
> +                       /*
> +                        * lru_lock excludes splitting a huge page,
> +                        * but we cannot hold lru_lock while freeing page.
> +                        */
> +                       low_pfn += (1<<  compound_order(page)) - 1;
> +                       unlock_lruvec(lruvec);
> +                       lruvec = NULL;
> +                       put_page(page);
> +                       continue;
> +               }
>
>                  VM_BUG_ON(PageTransCompound(page));
>
>                  /* Successfully isolated */
> +               ClearPageLRU(page);
> +               mem_cgroup_reset_uncharged_to_root(page);
>                  del_page_from_lru_list(page, lruvec, page_lru(page));
>                  list_add(&page->lru, migratelist);
>                  cc->nr_migratepages++;
> --- mmotm.orig/mm/memcontrol.c  2012-02-18 11:57:42.679524592 -0800
> +++ mmotm/mm/memcontrol.c       2012-02-18 11:57:49.107524745 -0800
> @@ -1069,6 +1069,33 @@ void page_relock_lruvec(struct page *pag
>          *lruvp = lruvec;
>   }
>
> +void mem_cgroup_reset_uncharged_to_root(struct page *page)
> +{
> +       struct page_cgroup *pc;
> +
> +       if (mem_cgroup_disabled())
> +               return;
> +
> +       VM_BUG_ON(PageLRU(page));
> +
> +       /*
> +        * Once an uncharged page is isolated from the mem_cgroup's lru,
> +        * it no longer protects that mem_cgroup from rmdir: reset to root.
> +        *
> +        * __page_cache_release() and release_pages() may be called at
> +        * interrupt time: we cannot lock_page_cgroup() then (we might
> +        * have interrupted a section with page_cgroup already locked),
> +        * nor do we need to since the page is frozen and about to be freed.
> +        */
> +       pc = lookup_page_cgroup(page);
> +       if (page_count(page))
> +               lock_page_cgroup(pc);
> +       if (!PageCgroupUsed(pc)&&  pc->mem_cgroup != root_mem_cgroup)
> +               pc->mem_cgroup = root_mem_cgroup;
> +       if (page_count(page))
> +               unlock_page_cgroup(pc);
> +}
> +
>   /**
>    * mem_cgroup_update_lru_size - account for adding or removing an lru page
>    * @lruvec: mem_cgroup per zone lru vector
> @@ -2865,6 +2892,7 @@ __mem_cgroup_uncharge_common(struct page
>          struct mem_cgroup *memcg = NULL;
>          unsigned int nr_pages = 1;
>          struct page_cgroup *pc;
> +       struct lruvec *lruvec;
>          bool anon;
>
>          if (mem_cgroup_disabled())
> @@ -2884,6 +2912,7 @@ __mem_cgroup_uncharge_common(struct page
>          if (unlikely(!PageCgroupUsed(pc)))
>                  return NULL;
>
> +       lruvec = page_lock_lruvec(page);
>          lock_page_cgroup(pc);
>
>          memcg = pc->mem_cgroup;
> @@ -2915,14 +2944,17 @@ __mem_cgroup_uncharge_common(struct page
>          mem_cgroup_charge_statistics(memcg, anon, -nr_pages);
>
>          ClearPageCgroupUsed(pc);
> +
>          /*
> -        * pc->mem_cgroup is not cleared here. It will be accessed when it's
> -        * freed from LRU. This is safe because uncharged page is expected not
> -        * to be reused (freed soon). Exception is SwapCache, it's handled by
> -        * special functions.
> +        * Once an uncharged page is isolated from the mem_cgroup's lru,
> +        * it no longer protects that mem_cgroup from rmdir: reset to root.
>           */
> +       if (!PageLRU(page)&&  pc->mem_cgroup != root_mem_cgroup)
> +               pc->mem_cgroup = root_mem_cgroup;
>
>          unlock_page_cgroup(pc);
> +       unlock_lruvec(lruvec);
> +
>          /*
>           * even after unlock, we have memcg->res.usage here and this memcg
>           * will never be freed.
> @@ -2939,6 +2971,7 @@ __mem_cgroup_uncharge_common(struct page
>
>   unlock_out:
>          unlock_page_cgroup(pc);
> +       unlock_lruvec(lruvec);
>          return NULL;
>   }
>
> @@ -3327,7 +3360,9 @@ static struct page_cgroup *lookup_page_c
>           * the first time, i.e. during boot or memory hotplug;
>           * or when mem_cgroup_disabled().
>           */
> -       if (likely(pc)&&  PageCgroupUsed(pc))
> +       if (!pc || PageCgroupUsed(pc))
> +               return pc;
> +       if (pc->mem_cgroup&&  pc->mem_cgroup != root_mem_cgroup)
>                  return pc;
>          return NULL;
>   }
> --- mmotm.orig/mm/swap.c        2012-02-18 11:57:42.679524592 -0800
> +++ mmotm/mm/swap.c     2012-02-18 11:57:49.107524745 -0800
> @@ -52,6 +52,7 @@ static void __page_cache_release(struct
>                  lruvec = page_lock_lruvec(page);
>                  VM_BUG_ON(!PageLRU(page));
>                  __ClearPageLRU(page);
> +               mem_cgroup_reset_uncharged_to_root(page);
>                  del_page_from_lru_list(page, lruvec, page_off_lru(page));
>                  unlock_lruvec(lruvec);
>          }
> @@ -583,6 +584,7 @@ void release_pages(struct page **pages,
>                          page_relock_lruvec(page,&lruvec);
>                          VM_BUG_ON(!PageLRU(page));
>                          __ClearPageLRU(page);
> +                       mem_cgroup_reset_uncharged_to_root(page);
>                          del_page_from_lru_list(page, lruvec, page_off_lru(page));
>                  }
>
> --- mmotm.orig/mm/vmscan.c      2012-02-18 11:57:42.679524592 -0800
> +++ mmotm/mm/vmscan.c   2012-02-18 11:57:49.107524745 -0800
> @@ -1087,11 +1087,11 @@ int __isolate_lru_page(struct page *page
>
>          if (likely(get_page_unless_zero(page))) {
>                  /*
> -                * Be careful not to clear PageLRU until after we're
> -                * sure the page is not being freed elsewhere -- the
> -                * page release code relies on it.
> +                * Beware of interface change: now leave ClearPageLRU(page)
> +                * to the caller, because memcg's lumpy and compaction
> +                * cases (approaching the page by its physical location)
> +                * may not have the right lru_lock yet.
>                   */
> -               ClearPageLRU(page);
>                  ret = 0;
>          }
>
> @@ -1154,7 +1154,16 @@ static unsigned long isolate_lru_pages(u
>
>                  switch (__isolate_lru_page(page, mode, file)) {
>                  case 0:
> +#ifdef CONFIG_DEBUG_VM
> +                       /* check lock on page is lock we already got */
> +                       page_relock_lruvec(page,&lruvec);
> +                       BUG_ON(lruvec != home_lruvec);
> +                       BUG_ON(page != lru_to_page(src));
> +                       BUG_ON(page_lru(page) != lru);
> +#endif
> +                       ClearPageLRU(page);
>                          isolated_pages = hpage_nr_pages(page);
> +                       mem_cgroup_reset_uncharged_to_root(page);
>                          mem_cgroup_update_lru_size(lruvec, lru, -isolated_pages);
>                          list_move(&page->lru, dst);
>                          nr_taken += isolated_pages;
> @@ -1211,21 +1220,7 @@ static unsigned long isolate_lru_pages(u
>                              !PageSwapCache(cursor_page))
>                                  break;
>
> -                       if (__isolate_lru_page(cursor_page, mode, file) == 0) {
> -                               mem_cgroup_page_relock_lruvec(cursor_page,
> -&lruvec);
> -                               isolated_pages = hpage_nr_pages(cursor_page);
> -                               mem_cgroup_update_lru_size(lruvec,
> -                                       page_lru(cursor_page), -isolated_pages);
> -                               list_move(&cursor_page->lru, dst);
> -
> -                               nr_taken += isolated_pages;
> -                               nr_lumpy_taken += isolated_pages;
> -                               if (PageDirty(cursor_page))
> -                                       nr_lumpy_dirty += isolated_pages;
> -                               scan++;
> -                               pfn += isolated_pages - 1;
> -                       } else {
> +                       if (__isolate_lru_page(cursor_page, mode, file) != 0) {
>                                  /*
>                                   * Check if the page is freed already.
>                                   *
> @@ -1243,13 +1238,50 @@ static unsigned long isolate_lru_pages(u
>                                          continue;
>                                  break;
>                          }
> +
> +                       /*
> +                        * This locking call is a no-op in the non-memcg
> +                        * case, since we already hold the right lru_lock;
> +                        * but it may change the lock in the memcg case.
> +                        * It is then vital to recheck PageLRU (but not
> +                        * necessary to recheck isolation mode).
> +                        */
> +                       mem_cgroup_page_relock_lruvec(cursor_page,&lruvec);
> +
> +                       if (PageLRU(cursor_page)&&
> +                           !PageUnevictable(cursor_page)) {
> +                               ClearPageLRU(cursor_page);
> +                               isolated_pages = hpage_nr_pages(cursor_page);
> +                               mem_cgroup_reset_uncharged_to_root(cursor_page);
> +                               mem_cgroup_update_lru_size(lruvec,
> +                                       page_lru(cursor_page), -isolated_pages);
> +                               list_move(&cursor_page->lru, dst);
> +
> +                               nr_taken += isolated_pages;
> +                               nr_lumpy_taken += isolated_pages;
> +                               if (PageDirty(cursor_page))
> +                                       nr_lumpy_dirty += isolated_pages;
> +                               scan++;
> +                               pfn += isolated_pages - 1;
> +                       } else {
> +                               /* Cannot hold lru_lock while freeing page */
> +                               unlock_lruvec(lruvec);
> +                               lruvec = NULL;
> +                               put_page(cursor_page);
> +                               break;
> +                       }
>                  }
>
>                  /* If we break out of the loop above, lumpy reclaim failed */
>                  if (pfn<  end_pfn)
>                          nr_lumpy_failed++;
>
> -               lruvec = home_lruvec;
> +               if (lruvec != home_lruvec) {
> +                       if (lruvec)
> +                               unlock_lruvec(lruvec);
> +                       lruvec = home_lruvec;
> +                       lock_lruvec(lruvec);
> +               }
>          }
>
>          *nr_scanned = scan;
> @@ -1301,6 +1333,7 @@ int isolate_lru_page(struct page *page)
>                          int lru = page_lru(page);
>                          get_page(page);
>                          ClearPageLRU(page);
> +                       mem_cgroup_reset_uncharged_to_root(page);
>                          del_page_from_lru_list(page, lruvec, lru);
>                          ret = 0;
>                  }

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/