linux-kernel - Re: CVE-2024-36000: mm/hugetlb: fix missing hugetlb

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Zk8bgS8IboS-7jQw@localhost.localdomain>
Date: Thu, 23 May 2024 12:33:37 +0200
From: Oscar Salvador <osalvador@...e.de>
To: Michal Hocko <mhocko@...e.com>
Cc: Oscar Salvador <OSalvador@...e.com>, Peter Xu <peterx@...hat.com>,
	cve@...nel.org, linux-kernel@...r.kernel.org,
	linux-cve-announce@...r.kernel.org,
	Greg Kroah-Hartman <gregkh@...uxfoundation.org>
Subject: Re: CVE-2024-36000: mm/hugetlb: fix missing hugetlb_lock for resv
 uncharge

On Thu, May 23, 2024 at 09:30:59AM +0200, Michal Hocko wrote:
> Let me add Oscar,

Thanks

> 
> On Tue 21-05-24 15:38:45, Peter Xu wrote:
> > That commit mentioned that we rely on the lock to make sure all hugetlb
> > folios on the active list will have a valid memcg.  However I'm not sure
> > whether it's still required now (after all that's 2012..), e.g., I'm
> > looking at hugetlb_cgroup_css_offline(), and hugetlb_cgroup_move_parent()
> > looks all safe to even take empty memcg folios with the latest code at
> > least:
> > 
> > 	/*
> > 	 * We can have pages in active list without any cgroup
> > 	 * ie, hugepage with less than 3 pages. We can safely
> > 	 * ignore those pages.
> > 	 */
> > 	if (!page_hcg || page_hcg != h_cg)
> > 		goto out;

Ok, I had a look at hugetlb_cgroup implementation.
First time looking at that code, so bear with me.

I looked back at commit

 commit 94ae8ba7176666d1e7d8bbb9f93670a27540b6a8 (HEAD)
 Author: Aneesh Kumar K.V <aneesh.kumar@...ux.vnet.ibm.com>
 Date:   Tue Jul 31 16:42:35 2012 -0700

     hugetlb/cgroup: assign the page hugetlb cgroup when we move the page to active list.

to understand why the lock was needed.

On the theoretical part:

And we could have

     CPU0                                   CPU1
 dequeue_huge_page_vma
  dequeue_huge_page_node
   move_page_to_active_list
 release_lock
                                           hugetlb_cgroup_pre_destroy
                                            for_each_page_in_active_list
                                            <-- pages empty cgroups are skipped -->
                                             hugetlb_cgroup_move_parent
                                             move_page_to_parent
 hugetlb_cgroup_commit_charge <-- too late
  page[2].lru.next = (void *)h_cg;

So, the above page should have been moved to the parent, but since by the time
we were checking the activelist this page did not have any cgroup attach ot it,
it was skipped.

Notice I said theoretical because I noticed that
cgroup_call_pre_destroy()->hugetlb_cgroup_pre_destroy() is called from
cgroup_rmdir(). I am not sure under which circumstances cgroup_rmdir()
can succeed (does the cgroup refcount have dropped to 0?)

Now onto the current days.

After Peter's fix, we had:

        CPU0                                 CPU1
 dequeue_huge_page_vma
  dequeue_hugetlb_folio_node_exact
   move_page_to_active_list
 hugetlb_cgroup_commit_charge
  : folio->_hugetlb_cgroup = cgroup
 hugetlb_cgroup_commit_charge_rsvd
  : folio->_hugetlb_cgroup_rsvd = cgroup
 release lock
                                          hugetlb_cgroup_css_offline
                                           take lock
                                           for_each_page_in_active_list
                                            hugetlb_cgroup_move_parent
                                             : set folio->_hugetlb_cgroup
                                             : _hugetlb_cgroup_rsvd is not set
                                             : still contains the old value
					    release lock
 __hugetlb_cgroup_uncharge_folio_rsvd
  hugetlb_cgroup_from_folio_rsvd
  ...
  folio->_hugetlb_cgroup_rsvd = NUL

So, as you can see, we charged a folio to the parent, and this folio still
contains the previous _hugetlb_cgroup_rsvd, when it should have been NULLed.
All this meaning that since no page_counter_uncharge() was called for
the _hugetlb_group_rsvd, parent still has the old page_counter's for cgroup_rsvd.

Then, if before hugetlb_cgroup_from_folio_rsvd() gets called, someone allocates a
new cgroup under this parent, it will get both the page_counters from the
_hugetlb_cgroup and the _hugetlb_cgroup_rsvd.
And then page_counter_set_max() gets called for both counters.

What implies for that new cgroup to have those page_counters?
I do not really know. 

Maybe it does not let it allocate a new page when it should? Or it does look like
it has more memory reserved than it actually does?

-- 
Oscar Salvador
SUSE Labs