linux-kernel - Re: [PATCH] mm/hugetlb: Defer freeing of huge pages if in non-task context

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <20191216135110.b6fb283ba5551c8cfb22494e@linux-foundation.org>
Date:   Mon, 16 Dec 2019 13:51:10 -0800
From:   Andrew Morton <akpm@...ux-foundation.org>
To:     Waiman Long <longman@...hat.com>
Cc:     Mike Kravetz <mike.kravetz@...cle.com>,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org,
        Matthew Wilcox <willy@...radead.org>,
        Davidlohr Bueso <dave@...olabs.net>,
        Andi Kleen <ak@...ux.intel.com>,
        Michal Hocko <mhocko@...nel.org>
Subject: Re: [PATCH] mm/hugetlb: Defer freeing of huge pages if in non-task
 context

On Mon, 16 Dec 2019 13:27:39 -0500 Waiman Long <longman@...hat.com> wrote:

> The following lockdep splat was observed when a certain hugetlbfs test
> was run:
> 
> ...
> 
> Both the hugetbl_lock and the subpool lock can be acquired in
> free_huge_page(). One way to solve the problem is to make both locks
> irq-safe. Another alternative is to defer the freeing to a workqueue job.
> 
> This patch implements the deferred freeing by adding a
> free_hpage_workfn() work function to do the actual freeing. The
> free_huge_page() call in a non-task context saves the page to be freed
> in the hpage_freelist linked list in a lockless manner.
> 
> The generic workqueue is used to process the work, but a dedicated
> workqueue can be used instead if it is desirable to have the huge page
> freed ASAP.
>
> ...
>
> @@ -1199,6 +1199,73 @@ void free_huge_page(struct page *page)
>  	spin_unlock(&hugetlb_lock);
>  }
>  
> +/*
> + * As free_huge_page() can be called from a non-task context, we have
> + * to defer the actual freeing in a workqueue to prevent potential
> + * hugetlb_lock deadlock.
> + *
> + * free_hpage_workfn() locklessly retrieves the linked list of pages to
> + * be freed and frees them one-by-one. As the page->mapping pointer is
> + * going to be cleared in __free_huge_page() anyway, it is reused as the
> + * next pointer of a singly linked list of huge pages to be freed.
> + */
> +#define NEXT_PENDING	((struct page *)-1)
> +static struct page *hpage_freelist;
> +
> +static void free_hpage_workfn(struct work_struct *work)
> +{
> +	struct page *curr, *next;
> +	int cnt = 0;
> +
> +	do {
> +		curr = xchg(&hpage_freelist, NULL);
> +		if (!curr)
> +			break;
> +
> +		while (curr) {
> +			next = (struct page *)READ_ONCE(curr->mapping);
> +			if (next == NEXT_PENDING) {
> +				cpu_relax();
> +				continue;
> +			}
> +			__free_huge_page(curr);
> +			curr = next;
> +			cnt++;
> +		}
> +	} while (!READ_ONCE(hpage_freelist));
> +
> +	if (!cnt)
> +		return;
> +	pr_debug("HugeTLB: free_hpage_workfn() frees %d huge page(s)\n", cnt);
> +}
> +static DECLARE_WORK(free_hpage_work, free_hpage_workfn);
> +
> +void free_huge_page(struct page *page)
> +{
> +	/*
> +	 * Defer freeing if in non-task context to avoid hugetlb_lock deadlock.
> +	 */
> +	if (!in_task()) {
> +		struct page *next;
> +
> +		page->mapping = (struct address_space *)NEXT_PENDING;
> +		next = xchg(&hpage_freelist, page);
> +		WRITE_ONCE(page->mapping, (struct address_space *)next);

The NEXT_PENDING stuff could do with come commenting, I think.  It's
reasonably obvious, but not obvious enough.  For example, why does the
second write to page->mapping use WRITE_ONCE() but the first does not. 
Please spell out the design, fully.

> +		schedule_work(&free_hpage_work);
> +		return;
> +	}
> +
> +	/*
> +	 * Racing may prevent some deferred huge pages in hpage_freelist
> +	 * from being freed. Check here and call schedule_work() if that
> +	 * is the case.
> +	 */
> +	if (unlikely(hpage_freelist && !work_pending(&free_hpage_work)))
> +		schedule_work(&free_hpage_work);
> +
> +	__free_huge_page(page);
> +}
> +
>  static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
>  {
>  	INIT_LIST_HEAD(&page->lru);

Otherwise it looks OK to me.  Deferring freeing in this way is
generally lame and gives rise to concerns about memory exhaustion in
strange situations, and to concerns about various memory accounting
stats being logically wrong for short periods.  But we already do this
in (too) many places, so fingers crossed :(