linux-kernel - Re: [PATCH] hugetlbfs: Disable IRQ when taking hugetlb_lock in set_max_huge

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <1209d9ba-9d82-4dfc-5cdf-a2641814af75@redhat.com>
Date:   Mon, 9 Dec 2019 21:37:00 -0500
From:   Waiman Long <longman@...hat.com>
To:     Matthew Wilcox <willy@...radead.org>
Cc:     Mike Kravetz <mike.kravetz@...cle.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [PATCH] hugetlbfs: Disable IRQ when taking hugetlb_lock in
 set_max_huge_pages()

On 12/9/19 7:46 PM, Waiman Long wrote:
> On 12/9/19 11:49 AM, Matthew Wilcox wrote:
>> On Mon, Dec 09, 2019 at 11:01:50AM -0500, Waiman Long wrote:
>>> [  613.245273] Call Trace:
>>> [  613.256273]  <IRQ>
>>> [  613.265273]  dump_stack+0x9a/0xf0
>>> [  613.281273]  mark_lock+0xd0c/0x12f0
>>> [  613.341273]  __lock_acquire+0x146b/0x48c0
>>> [  613.401273]  lock_acquire+0x14f/0x3b0
>>> [  613.440273]  _raw_spin_lock+0x30/0x70
>>> [  613.477273]  free_huge_page+0x36f/0xaa0
>>> [  613.495273]  bio_check_pages_dirty+0x2fc/0x5c0
>> Oh, this is fun.  So we kicked off IO to a hugepage, then truncated or
>> otherwise caused the page to come free.  Then the IO finished and did the
>> final put on the page ... from interrupt context.  Splat.  Not something
>> that's going to happen often, but can happen if a process dies during
>> IO or due to a userspace bug.
>>
>> Maybe we should schedule work to do the actual freeing of the page
>> instead of this rather large patch.  It doesn't seem like a case we need
>> to optimise for.
> I think that may be a good idea to try it out.

It turns out using workqueue is more complex that I originally thought.
I currently come up with the following untested changes to do that:

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ac65bb5e38ac..629ac000318b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1136,7 +1136,7 @@ static inline void ClearPageHugeTemporary(struct
page *pag
        page[2].mapping = NULL;
 }
 
-void free_huge_page(struct page *page)
+static void __free_huge_page(struct page *page)
 {
        /*
         * Can't pass hstate in here because it is called from the
@@ -1199,6 +1199,82 @@ void free_huge_page(struct page *page)
        spin_unlock(&hugetlb_lock);
 }
 
+/*
+ * As free_huge_page() can be called from softIRQ context, we have
+ * to defer the actual freeing in a workqueue to prevent potential
+ * hugetlb_lock deadlock.
+ */
+struct hpage_to_free {
+       struct page             *page;
+       struct hpage_to_free    *next;
+};
+static struct hpage_to_free *hpage_freelist;
+#define NEXT_PENDING   ((struct hpage_to_free *)-1)
+
+/*
+ * This work function locklessly retrieves the pages to be freed and
+ * frees them one-by-one.
+ */
+static void free_huge_page_workfn(struct work_struct *work)
+{
+       struct hpage_to_free *curr, *next;
+
+recheck:
+       curr = xchg(&hpage_freelist, NULL);
+       if (!curr)
+               return;
+
+       while (curr) {
+               __free_huge_page(curr->page);
+               next = READ_ONCE(curr->next);
+               while (next == NEXT_PENDING) {
+                       cpu_relax();
+                       next = READ_ONCE(curr->next);
+               }
+               kfree(curr);
+               curr = next;
+       }
+       if (work && READ_ONCE(hpage_freelist))
+               goto recheck;
+}
+static DECLARE_WORK(free_huge_page_work, free_huge_page_workfn);
+
+void free_huge_page(struct page *page)
+{
+       /*
+        * Defer freeing in softIRQ context to avoid hugetlb_lock deadlock.
+        */
+       if (in_serving_softirq()) {
+               struct hpage_to_free *item, *next;
+
+               /*
+                * We are in serious trouble if kmalloc fails. In this
+                * case, we hope for the best and do the freeing now.
+                */
+               item = kmalloc(sizeof(*item), GFP_KERNEL);
+               if (WARN_ON_ONCE(!item))
+                       goto free_page_now;
+
+               item->page = page;
+               item->next = NEXT_PENDING;
+               next = xchg(&hpage_freelist, item);
+               WRITE_ONCE(item->next, next);
+               schedule_work(&free_huge_page_work);
+               return;
+       }
+
+       /*
+        * Racing may prevent some of deferred huge pages in hpage_freelist
+        * from being freed. Check here and call schedule_work() if that
+        * is the case.
+        */
+       if (hpage_freelist && !work_pending(&free_huge_page_work))
+               schedule_work(&free_huge_page_work);
+
+free_page_now:
+       __free_huge_page(page);
+}
+
 static void prep_new_huge_page(struct hstate *h, struct page *page, int
nid)
 {
        INIT_LIST_HEAD(&page->lru);

-----------------------------------------------------------------------------------------------------

I think it may be simpler and less risky to use spin_lock_bh() as you
have suggested. Also, the above code will not be good enough if more
lock taking functions are being called from softIRQ context in the future.

So what do you think?

Cheers,
Longman

Cheers,
Longman