linux-kernel - Re: [RFC] can we use vmalloc to alloc thread stack if compaction failed

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALCETrV++O=ynMKYwdhG-AksnVXX6hBpBxtXfNaa_dhVLMu2Tg@mail.gmail.com>
Date:	Fri, 29 Jul 2016 12:47:38 -0700
From:	Andy Lutomirski <luto@...capital.net>
To:	Joonsoo Kim <iamjoonsoo.kim@....com>
Cc:	Andy Lutomirski <luto@...nel.org>, Xishi Qiu <qiuxishi@...wei.com>,
	Michal Hocko <mhocko@...nel.org>, Tejun Heo <tj@...nel.org>,
	Ingo Molnar <mingo@...nel.org>,
	Peter Zijlstra <peterz@...radead.org>,
	LKML <linux-kernel@...r.kernel.org>,
	Linux MM <linux-mm@...ck.org>,
	Yisheng Xie <xieyisheng1@...wei.com>
Subject: Re: [RFC] can we use vmalloc to alloc thread stack if compaction failed

---------- Forwarded message ----------
From: "Joonsoo Kim" <iamjoonsoo.kim@....com>
Date: Jul 28, 2016 7:57 PM
Subject: Re: [RFC] can we use vmalloc to alloc thread stack if compaction failed
To: "Andy Lutomirski" <luto@...nel.org>
Cc: "Xishi Qiu" <qiuxishi@...wei.com>, "Michal Hocko"
<mhocko@...nel.org>, "Tejun Heo" <tj@...nel.org>, "Ingo Molnar"
<mingo@...nel.org>, "Peter Zijlstra" <peterz@...radead.org>, "LKML"
<linux-kernel@...r.kernel.org>, "Linux MM" <linux-mm@...ck.org>,
"Yisheng Xie" <xieyisheng1@...wei.com>

> On Thu, Jul 28, 2016 at 08:07:51AM -0700, Andy Lutomirski wrote:
> > On Thu, Jul 28, 2016 at 3:51 AM, Xishi Qiu <qiuxishi@...wei.com> wrote:
> > > On 2016/7/28 17:43, Michal Hocko wrote:
> > >
> > >> On Thu 28-07-16 16:45:06, Xishi Qiu wrote:
> > >>> On 2016/7/28 15:58, Michal Hocko wrote:
> > >>>
> > >>>> On Thu 28-07-16 15:41:53, Xishi Qiu wrote:
> > >>>>> On 2016/7/28 15:20, Michal Hocko wrote:
> > >>>>>
> > >>>>>> On Thu 28-07-16 15:08:26, Xishi Qiu wrote:
> > >>>>>>> Usually THREAD_SIZE_ORDER is 2, it means we need to alloc 16kb continuous
> > >>>>>>> physical memory during fork a new process.
> > >>>>>>>
> > >>>>>>> If the system's memory is very small, especially the smart phone, maybe there
> > >>>>>>> is only 1G memory. So the free memory is very small and compaction is not
> > >>>>>>> always success in slowpath(__alloc_pages_slowpath), then alloc thread stack
> > >>>>>>> may be failed for memory fragment.
> > >>>>>>
> > >>>>>> Well, with the current implementation of the page allocator those
> > >>>>>> requests will not fail in most cases. The oom killer would be invoked in
> > >>>>>> order to free up some memory.
> > >>>>>>
> > >>>>>
> > >>>>> Hi Michal,
> > >>>>>
> > >>>>> Yes, it success in most cases, but I did have seen this problem in some
> > >>>>> stress-test.
> > >>>>>
> > >>>>> DMA free:470628kB, but alloc 2 order block failed during fork a new process.
> > >>>>> There are so many memory fragments and the large block may be soon taken by
> > >>>>> others after compact because of stress-test.
> > >>>>>
> > >>>>> --- dmesg messages ---
> > >>>>> 07-13 08:41:51.341 <4>[309805.658142s][pid:1361,cpu5,sManagerService]sManagerService: page allocation failure: order:2, mode:0x2000d1
> > >>>>
> > >>>> Yes but this is __GFP_DMA allocation. I guess you have already reported
> > >>>> this failure and you've been told that this is quite unexpected for the
> > >>>> kernel stack allocation. It is your out-of-tree patch which just makes
> > >>>> things worse because DMA restricted allocations are considered "lowmem"
> > >>>> and so they do not invoke OOM killer and do not retry like regular
> > >>>> GFP_KERNEL allocations.
> > >>>
> > >>> Hi Michal,
> > >>>
> > >>> Yes, we add GFP_DMA, but I don't think this is the key for the problem.
> > >>
> > >> You are restricting the allocation request to a single zone which is
> > >> definitely not good. Look at how many larger order pages are available
> > >> in the Normal zone.
> > >>
> > >>> If we do oom-killer, maybe we will get a large block later, but there
> > >>> is enough free memory before oom(although most of them are fragments).
> > >>
> > >> Killing a task is of course the last resort action. It would give you
> > >> larger order blocks used for the victims thread.
> > >>
> > >>> I wonder if we can alloc success without kill any process in this situation.
> > >>
> > >> Sure it would be preferable to compact that memory but that might be
> > >> hard with your restriction in place. Consider that DMA zone would tend
> > >> to be less movable than normal zones as users would have to pin it for
> > >> DMA. Your DMA is really large so this might turn out to just happen to
> > >> work but note that the primary problem here is that you put a zone
> > >> restriction for your allocations.
> > >>
> > >>> Maybe use vmalloc is a good way, but I don't know the influence.
> > >>
> > >> You can have a look at vmalloc patches posted by Andy. They are not that
> > >> trivial.
> > >>
> > >
> > > Hi Michal,
> > >
> > > Thank you for your comment, could you give me the link?
> > >
> >
> > I've been keeping it mostly up to date in this branch:
> >
> > https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=x86/vmap_stack
> >
> > It's currently out of sync due to a bunch of the patches being queued
> > elsewhere for the merge window.
>
> Hello, Andy.
>
> I have some questions about it.
>
> IIUC, to turn on HAVE_ARCH_VMAP_STACK on different architecture, there
> is nothing to be done in architecture side if the architecture doesn't
> support lazily faults in top-level paging entries for the vmalloc
> area. Is my understanding is correct?
>

There should be nothing fundamental that needs to be done.  On the
other hand, it might be good to make sure the arch code can print a
clean stack trace on stack overflow.

If it's helpful, I just pushed out anew

> And, I'd like to know how you search problematic places using kernel
> stack for DMA.
>

I did some searching for problematic sg_init_buf calls using
Coccinelle.  I'm not very good at Coccinelle, so I may have missed
something.

For the most part, DMA API debugging should have found the problems
already.  The ones I found were in drivers that didn't do real DMA:
crypto users and virtio.

> One note is that, stack overflow happens at the previous page of the
> stack end position if stack grows down, but, guard page is placed at
> the next page of the stack begin position. So, this stack overflow
> detection depends on the fact that previous vmalloc-ed area is allocated
> without VM_NO_GUARD. There isn't many users for this flag so there
> would be no problem but just note.

Yes, and that's a known weakness.  It would be nice to improve it.

--Andy