linux-kernel - Re: [lkp] [mm, page_alloc] d0164adc89: -100.0% fsmark.app

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20151202110009.GA2015@techsingularity.net>
Date:	Wed, 2 Dec 2015 11:00:09 +0000
From:	Mel Gorman <mgorman@...hsingularity.net>
To:	"Huang, Ying" <ying.huang@...ux.intel.com>
Cc:	lkp@...org, LKML <linux-kernel@...r.kernel.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Rik van Riel <riel@...hat.com>,
	Vitaly Wool <vitalywool@...il.com>,
	David Rientjes <rientjes@...gle.com>,
	Christoph Lameter <cl@...ux.com>,
	Johannes Weiner <hannes@...xchg.org>,
	Michal Hocko <mhocko@...e.com>,
	Vlastimil Babka <vbabka@...e.cz>,
	Will Deacon <will.deacon@....com>,
	Linus Torvalds <torvalds@...ux-foundation.org>
Subject: Re: [lkp] [mm, page_alloc] d0164adc89: -100.0% fsmark.app_overhead

On Mon, Nov 30, 2015 at 10:14:24AM +0800, Huang, Ying wrote:
> > There is no reference to OOM possibility in the email that I can see. Can
> > you give examples of the OOM messages that shows the problem sites? It was
> > suspected that there may be some callers that were accidentally depending
> > on access to emergency reserves. If so, either they need to be fixed (if
> > the case is extremely rare) or a small reserve will have to be created
> > for callers that are not high priority but still cannot reclaim.
> >
> > Note that I'm travelling a lot over the next two weeks so I'll be slow to
> > respond but I will get to it.
> 
> Here is the kernel log,  the full dmesg is attached too.  The OOM
> occurs during fsmark testing.
> 
> Best Regards,
> Huang, Ying
> 
> [   31.453514] kworker/u4:0: page allocation failure: order:0, mode:0x2200000
> [   31.463570] CPU: 0 PID: 6 Comm: kworker/u4:0 Not tainted 4.3.0-08056-gd0164ad #1
> [   31.466115] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
> [   31.477146] Workqueue: writeback wb_workfn (flush-253:0)
> [   31.481450]  0000000000000000 ffff880035ac75e8 ffffffff8140a142 0000000002200000
> [   31.492582]  ffff880035ac7670 ffffffff8117117b ffff880037586b28 ffff880000000040
> [   31.507631]  ffff88003523b270 0000000000000040 ffff880035abc800 ffffffff00000000

This is an allocation failure and is not a triggering of the OOM killer so
the severity is reduced but it still looks like a bug in the driver. Looking
at the history and the discussion, it appears to me that __GFP_HIGH was
cleared from the allocation site by accident. I strongly suspect that Will
Deacon thought __GFP_HIGH was related to highmem instead of being related
to high priority.  Will, can you review the following patch please? Ying,
can you test please?

---8<---
virtio: allow vring descriptor allocations to use high-priority reserves

Commit b92b1b89a33c ("virtio: force vring descriptors to be allocated
from lowmem") prevented the inappropriate use of highmem pages but it
also masked out __GFP_HIGH. __GFP_HIGH is used for GFP_ATOMIC allocation
requests to grant access to a small emergency reserve. It's intended for
user by callers that have no alternative.

Ying Huang reported the following page allocation failure warning after
commit d0164adc89f6 ("mm, page_alloc: distinguish between being unable to
sleep, unwilling to sleep and avoiding waking kswapd")

    kworker/u4:0: page allocation failure: order:0, mode:0x2200000
    CPU: 0 PID: 6 Comm: kworker/u4:0 Not tainted 4.3.0-08056-gd0164ad #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
    Workqueue: writeback wb_workfn (flush-253:0)
     0000000000000000 ffff880035ac75e8 ffffffff8140a142 0000000002200000
     ffff880035ac7670 ffffffff8117117b ffff880037586b28 ffff880000000040
     ffff88003523b270 0000000000000040 ffff880035abc800 ffffffff00000000
    Call Trace:
     [<ffffffff8140a142>] dump_stack+0x4b/0x69
     [<ffffffff8117117b>] warn_alloc_failed+0xdb/0x140
     [<ffffffff81174ec4>] __alloc_pages_nodemask+0x874/0xa60
     [<ffffffff811bcb62>] alloc_pages_current+0x92/0x120
     [<ffffffff811c73e4>] new_slab+0x3d4/0x480
     [<ffffffff811c7c36>] __slab_alloc+0x376/0x470
     [<ffffffff814e0ced>] ? alloc_indirect+0x1d/0x50
     [<ffffffff81338221>] ? xfs_submit_ioend_bio+0x31/0x40
     [<ffffffff814e0ced>] ? alloc_indirect+0x1d/0x50
     [<ffffffff811c8e8d>] __kmalloc+0x20d/0x260
     [<ffffffff814e0ced>] alloc_indirect+0x1d/0x50
     [<ffffffff814e0fec>] virtqueue_add_sgs+0x2cc/0x3a0
     [<ffffffff81573a30>] __virtblk_add_req+0xb0/0x1f0
     [<ffffffff8117a121>] ? pagevec_lookup_tag+0x21/0x30
     [<ffffffff813e5d72>] ? blk_rq_map_sg+0x1e2/0x4f0
     [<ffffffff81573c82>] virtio_queue_rq+0x112/0x280
     [<ffffffff813e9de7>] __blk_mq_run_hw_queue+0x1d7/0x370
     [<ffffffff813e9bef>] blk_mq_run_hw_queue+0x9f/0xc0
     [<ffffffff813eb10a>] blk_mq_insert_requests+0xfa/0x1a0
     [<ffffffff813ebdb3>] blk_mq_flush_plug_list+0x123/0x140
     [<ffffffff813e1777>] blk_flush_plug_list+0xa7/0x200
     [<ffffffff813e1c49>] blk_finish_plug+0x29/0x40
     [<ffffffff81215f85>] wb_writeback+0x185/0x2c0
     [<ffffffff812166a5>] wb_workfn+0xf5/0x390
     [<ffffffff81091297>] process_one_work+0x157/0x420
     [<ffffffff81091ef9>] worker_thread+0x69/0x4a0
     [<ffffffff81091e90>] ? rescuer_thread+0x380/0x380
     [<ffffffff8109746f>] kthread+0xef/0x110
     [<ffffffff81097380>] ? kthread_park+0x60/0x60
     [<ffffffff818bce8f>] ret_from_fork+0x3f/0x70
     [<ffffffff81097380>] ? kthread_park+0x60/0x60

Commit d0164adc89f6 ("mm, page_alloc: distinguish between being unable to
sleep, unwilling to sleep and avoiding waking kswapd") is stricter about
reserves. It distinguishes between callers that are high-priority with
access to emergency reserves and callers that simply do not want to sleep
and have recovery options. The reported allocation failure is truly atomic
with no recovery options that appears to have cleared __GFP_HIGH by mistake
for reasons that are unrelated to highmem. This patch restores the flag.

Signed-off-by: Mel Gorman <mgorman@...hsingularity.net>
---
 drivers/virtio/virtio_ring.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 096b857e7b75..f9e119e6df18 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -107,9 +107,10 @@ static struct vring_desc *alloc_indirect(struct virtqueue *_vq,
 	/*
 	 * We require lowmem mappings for the descriptors because
 	 * otherwise virt_to_phys will give us bogus addresses in the
-	 * virtqueue.
+	 * virtqueue. Access to high-priority reserves is preserved
+	 * if originally requested by GFP_ATOMIC.
 	 */
-	gfp &= ~(__GFP_HIGHMEM | __GFP_HIGH);
+	gfp &= ~__GFP_HIGHMEM;
 
 	desc = kmalloc(total_sg * sizeof(struct vring_desc), gfp);
 	if (!desc)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/