lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a4789a59-d54e-2052-bd4d-3ab44a9726b6@redhat.com>
Date:   Mon, 16 Dec 2019 17:52:17 -0500
From:   Waiman Long <longman@...hat.com>
To:     Andrew Morton <akpm@...ux-foundation.org>
Cc:     Mike Kravetz <mike.kravetz@...cle.com>,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org,
        Matthew Wilcox <willy@...radead.org>,
        Davidlohr Bueso <dave@...olabs.net>,
        Andi Kleen <ak@...ux.intel.com>,
        Michal Hocko <mhocko@...nel.org>
Subject: Re: [PATCH] mm/hugetlb: Defer freeing of huge pages if in non-task
 context

On 12/16/19 4:51 PM, Andrew Morton wrote:
> On Mon, 16 Dec 2019 13:27:39 -0500 Waiman Long <longman@...hat.com> wrote:
>
>> The following lockdep splat was observed when a certain hugetlbfs test
>> was run:
>>
>> ...
>>
>> Both the hugetbl_lock and the subpool lock can be acquired in
>> free_huge_page(). One way to solve the problem is to make both locks
>> irq-safe. Another alternative is to defer the freeing to a workqueue job.
>>
>> This patch implements the deferred freeing by adding a
>> free_hpage_workfn() work function to do the actual freeing. The
>> free_huge_page() call in a non-task context saves the page to be freed
>> in the hpage_freelist linked list in a lockless manner.
>>
>> The generic workqueue is used to process the work, but a dedicated
>> workqueue can be used instead if it is desirable to have the huge page
>> freed ASAP.
>>
>> ...
>>
>> @@ -1199,6 +1199,73 @@ void free_huge_page(struct page *page)
>>  	spin_unlock(&hugetlb_lock);
>>  }
>>  
>> +/*
>> + * As free_huge_page() can be called from a non-task context, we have
>> + * to defer the actual freeing in a workqueue to prevent potential
>> + * hugetlb_lock deadlock.
>> + *
>> + * free_hpage_workfn() locklessly retrieves the linked list of pages to
>> + * be freed and frees them one-by-one. As the page->mapping pointer is
>> + * going to be cleared in __free_huge_page() anyway, it is reused as the
>> + * next pointer of a singly linked list of huge pages to be freed.
>> + */
>> +#define NEXT_PENDING	((struct page *)-1)
>> +static struct page *hpage_freelist;
>> +
>> +static void free_hpage_workfn(struct work_struct *work)
>> +{
>> +	struct page *curr, *next;
>> +	int cnt = 0;
>> +
>> +	do {
>> +		curr = xchg(&hpage_freelist, NULL);
>> +		if (!curr)
>> +			break;
>> +
>> +		while (curr) {
>> +			next = (struct page *)READ_ONCE(curr->mapping);
>> +			if (next == NEXT_PENDING) {
>> +				cpu_relax();
>> +				continue;
>> +			}
>> +			__free_huge_page(curr);
>> +			curr = next;
>> +			cnt++;
>> +		}
>> +	} while (!READ_ONCE(hpage_freelist));
>> +
>> +	if (!cnt)
>> +		return;
>> +	pr_debug("HugeTLB: free_hpage_workfn() frees %d huge page(s)\n", cnt);
>> +}
>> +static DECLARE_WORK(free_hpage_work, free_hpage_workfn);
>> +
>> +void free_huge_page(struct page *page)
>> +{
>> +	/*
>> +	 * Defer freeing if in non-task context to avoid hugetlb_lock deadlock.
>> +	 */
>> +	if (!in_task()) {
>> +		struct page *next;
>> +
>> +		page->mapping = (struct address_space *)NEXT_PENDING;
>> +		next = xchg(&hpage_freelist, page);
>> +		WRITE_ONCE(page->mapping, (struct address_space *)next);
> The NEXT_PENDING stuff could do with come commenting, I think.  It's
> reasonably obvious, but not obvious enough.  For example, why does the
> second write to page->mapping use WRITE_ONCE() but the first does not. 
> Please spell out the design, fully.

Sure. The idea is that the setting of the next pointer and the writing
to hpage_freelist cannot be done atomically without using a lock. Before
xchg(), the page isn't visible to a concurrent work function. So no
special write is needed, the mb() in xchg will ensure that the
page->mapping will be visible to all. After the xchg, page->mapping is
subjected to concurrent access. So WRITE_ONCE() is used to make sure
that is no write tearing.

I will update the patch with more comment once I gather other feedbacks
from other reviewers.

>
>> +		schedule_work(&free_hpage_work);
>> +		return;
>> +	}
>> +
>> +	/*
>> +	 * Racing may prevent some deferred huge pages in hpage_freelist
>> +	 * from being freed. Check here and call schedule_work() if that
>> +	 * is the case.
>> +	 */
>> +	if (unlikely(hpage_freelist && !work_pending(&free_hpage_work)))
>> +		schedule_work(&free_hpage_work);
>> +
>> +	__free_huge_page(page);
>> +}
>> +
>>  static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
>>  {
>>  	INIT_LIST_HEAD(&page->lru);
> Otherwise it looks OK to me.  Deferring freeing in this way is
> generally lame and gives rise to concerns about memory exhaustion in
> strange situations, and to concerns about various memory accounting
> stats being logically wrong for short periods.  But we already do this
> in (too) many places, so fingers crossed :(
>
It is actually quite rare to hit the condition that a huge page will
have to be freed in an irq context. Otherwise, this problem will be
found earlier. Hopefully the workfn won't be invoked in that many occasions.

Cheers,
Longman

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ