linux-kernel - Re: [PATCH 00/10] fix freepage count problems due to memory isolation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <53C3A7A5.9060005@suse.cz>
Date:	Mon, 14 Jul 2014 11:49:25 +0200
From:	Vlastimil Babka <vbabka@...e.cz>
To:	Joonsoo Kim <iamjoonsoo.kim@....com>
CC:	Andrew Morton <akpm@...ux-foundation.org>,
	"Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
	Rik van Riel <riel@...hat.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Mel Gorman <mgorman@...e.de>,
	Johannes Weiner <hannes@...xchg.org>,
	Minchan Kim <minchan@...nel.org>,
	Yasuaki Ishimatsu <isimatu.yasuaki@...fujitsu.com>,
	Zhang Yanfei <zhangyanfei@...fujitsu.com>,
	"Srivatsa S. Bhat" <srivatsa.bhat@...ux.vnet.ibm.com>,
	Tang Chen <tangchen@...fujitsu.com>,
	Naoya Horiguchi <n-horiguchi@...jp.nec.com>,
	Bartlomiej Zolnierkiewicz <b.zolnierkie@...sung.com>,
	Wen Congyang <wency@...fujitsu.com>,
	Marek Szyprowski <m.szyprowski@...sung.com>,
	Michal Nazarewicz <mina86@...a86.com>,
	Laura Abbott <lauraa@...eaurora.org>,
	Heesub Shin <heesub.shin@...sung.com>,
	"Aneesh Kumar K.V" <aneesh.kumar@...ux.vnet.ibm.com>,
	Ritesh Harjani <ritesh.list@...il.com>,
	t.stanislaws@...sung.com, Gioh Kim <gioh.kim@....com>,
	linux-mm@...ck.org, Lisa Du <cldu@...vell.com>,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH 00/10] fix freepage count problems due to memory isolation

On 07/14/2014 08:22 AM, Joonsoo Kim wrote:
> On Mon, Jul 07, 2014 at 04:33:09PM +0200, Vlastimil Babka wrote:
>> On 07/07/2014 06:49 AM, Joonsoo Kim wrote:
>>> Ccing Lisa, because there was bug report it may be related this
>>> topic last Saturday.
>>>
>>> http://www.spinics.net/lists/linux-mm/msg75741.html
>>>
>>> On Fri, Jul 04, 2014 at 05:33:27PM +0200, Vlastimil Babka wrote:
>>>> On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
>>>>> Hello,
>>>>
>>>> Hi Joonsoo,
>>>>
>>>> please CC me on further updates, this is relevant to me.
>>>
>>> Hello, Vlastimil.
>>>
>>> Sorry for missing you. :)
>>>
>>>>
>>>>> This patchset aims at fixing problems due to memory isolation found by
>>>>> testing my patchset [1].
>>>>>
>>>>> These are really subtle problems so I can be wrong. If you find what I am
>>>>> missing, please let me know.
>>>>>
>>>>> Before describing bugs itself, I first explain definition of freepage.
>>>>>
>>>>> 1. pages on buddy list are counted as freepage.
>>>>> 2. pages on isolate migratetype buddy list are *not* counted as freepage.
>>>>
>>>> I think the second point is causing us a lot of trouble. And I wonder if it's really
>>>> justified! We already have some is_migrate_isolate() checks in the fast path and now you
>>>> would add more, mostly just to keep this accounting correct.
>>>
>>> It's not just for keeping accouting correct. There is another
>>> purpose for is_migrate_isolate(). It forces freed pages to go into
>>> isolate buddy list. Without it, freed pages would go into other
>>> buddy list and will be used soon. So memory isolation can't work well
>>> without is_migrate_isolate() checks and success rate could decrease.
>>
>> Well I think that could be solved also by doing a lru/pcplists drain
>> right after marking pageblock as MIGRATETYPE_ISOLATE. After that
>> moment, anything newly put on pcplists should observe the new
>> migratetype and go to the correct pcplists. As putting stuff on
>> pcplists is done with disabled interrupts, and draining is done by
>> IPI, I think it should work correctly if we put the migratetype
>> determination under the disabled irq part in free_hot_cold_page().
>
> Hello, sorry for late.
>
> Yes, this can close the race on migratetype buddy list, but, there is
> a problem. See the below.
>
>>
>>> And, I just added three more is_migrate_isolate() in the fast
>>> path, but, two checks are in same *unlikely* branch and I can remove
>>> another one easily. Therefore it's not quite problem I guess. (It even
>>> does no-op if MEMORY_ISOLATION is disabled.)
>>> Moreover, I removed one unconditional get_pageblock_migratetype() in
>>> free_pcppages_bulk() so, in performance point or view, freepath would
>>> be improved.
>>
>> I haven't checked the individual patches yet, but I'll try.
>>
>>>>
>>>> So the question is, does it have to be correct? And (admiteddly not after a completely
>>>> exhaustive analysis) I think the answer is, surprisingly, that it doesn't :)
>>>>
>>>> Well I of course don't mean that the freepage accounts could go random completely, but
>>>> what if we allowed them to drift a bit, limiting both the max error and the timeframe
>>>> where errors are possible? After all, watermarks checking is already racy so I don't think
>>>> it would be hurt that much.
>>>
>>> I understand your suggestion. I once thought like as you, but give up
>>> that idea. Watermark checking is already racy, but, it's *only*
>>> protection to prevent memory allocation. After passing that check,
>>> there is no mean to prevent us from allocating memory. So it should
>>> be accurate as much as possible. If we allow someone to get the
>>> memory without considering memory isolation, free memory could be in
>>> really low state and system could be broken occasionally.
>>
>> It would definitely require more thorough review, but I think with
>> the pcplists draining changes outlined above, there shouldn't be any
>> more inaccuracy happening.
>>
>>>>
>>>> Now if we look at what both CMA and memory hot-remove does is:
>>>>
>>>> 1. Mark a MAX_ORDER-aligned buch of pageblocks as MIGRATE_ISOLATE through
>>>> start_isolate_page_range(). As part of this, all free pages in that area are
>>>> moved on the isolate freelist through move_freepages_block().
>>>>
>>>> 2. Try to migrate away all non-free pages in the range. Also drain pcplists and lru_add
>>>> caches.
>>>>
>>>> 3. Check if everything was successfully isolated by test_pages_isolated(). Restart and/or
>>>> undo pageblock isolation if not.
>>>>
>>>> So my idea is to throw away all special-casing of is_migrate_isolate() in the buddy
>>>> allocator, which would therefore account free pages on the isolate freelist as normal free
>>>> pages.
>>>> The accounting of isolated pages would be instead done only on the top level of CMA/hot
>>>> remove in the three steps outlined above, which would be modified as follows:
>>>>
>>>> 1. Calculate N as the target number of pages to be isolated. Perform the actions of step 1
>>>> as usual. Calculate X as the number of pages that move_freepages_block() has moved.
>>>> Subtract X from freepages (this is the same as it is done now), but also *remember the
>>>> value of X*
>>>>
>>>> 2. Migrate and drain pcplists as usual. The new free pages will either end up correctly on
>>>> isolate freelist, or not. We don't care, they will be accounted as freepages either way.
>>>> This is where some inaccuracy in accounted freepages would build up.
>>>>
>>>> 3. If test_pages_isolated() checks pass, subtract (N - X) from freepages. The result is
>>>> that we have a isolated range of N pages that nobody can steal now as everything is on
>>>> isolate freelist and is MAX_ORDER aligned. And we have in total subtracted N pages (first
>>>> X, then N-X). So the accounting matches reality.
>>>>
>>>> If we have to undo, we undo the isolation and as part of this, we use
>>>> move_freepages_block() to move pages from isolate freelist to the normal ones. But we
>>>> don't care how many pages were moved. We simply add the remembered value of X to the
>>>> number of freepages, undoing the change from step 1. Again, the accounting matches reality.
>>>>
>>>>
>>>> The final point is that if we do this per MAX_ORDER blocks, the error in accounting cannot
>>>> be ever larger than 4MB and will be visible only during time a single MAX_ORDER block is
>>>> handled.
>>>
>>> The 4MB error in accounting looks serious for me. min_free_kbytes is
>>> 4MB in 1GB system. So this 4MB error would makes all things broken in
>>> such systems. Moreover, there are some ARCH having larger
>>> pageblock_order than MAX_ORDER. In this case, the error will be larger
>>> than 4MB.
>>>
>>> In addition, I have a plan to extend CMA to work in parallel. It means
>>> that there could be parallel memory isolation users rather than just
>>> one user at the same time, so, we cannot easily bound the error under
>>> some degree.
>>
>> OK. I had thought that the error would be much smaller in practice,
>> but probably due to how buddy merging works, it would be enough if
>> just the very last freed page in the MAX_ORDER block was misplaced,
>> and that would trigger a cascading merge that will end with single
>> page at MAX_ORDER size becoming misplaced. So let's probably forget
>> this approach.
>>
>>>> As a possible improvement, we can assume during phase 2 that every page freed by migration
>>>> will end up correctly on isolate free list. So we create M free pages by migration, and
>>>> subtract M from freepage account. Then in phase 3 we either subtract (N - X - M), or add X
>>>> + M in the undo case. (Ideally, if we succeed, X + M should be equal to N, but due to
>>>> pages on pcplists and the possible races it will be less). I think with this improvement,
>>>> any error would be negligible.
>>>>
>>>> Thoughts?
>>>
>>> Thanks for suggestion. :)
>>> It is really good topic to think deeply.
>>>
>>> For now, I'd like to fix these problems without side-effect as you
>>> suggested. Your suggestion changes the meaning of freepage that
>>> isolated pages are included in nr_freepage and there could be possible
>>> regression in success rate of memory hotplug and CMA. Possibly, it
>>> is the way we have to go, but, IMHO, it isn't the time to go. Before
>>> going that way, we should fix current implementation first so that
>>> fixes can be backported to old kernel if someone needs.
>>
>> Adding the extra pcplist drains and reordering
>> get_pageblock_migratetype() under disabled IRQ's should be small
>> enough to be backportable and not bring any regressions to success
>> rates. There will be some cost for the extra drains, but paid only
>> when the memory isolation actually happens. Extending the scope of
>> disabled irq's would affect fast path somewhat, but I guess less
>> than extra checks?
>
> Your approach would close the race on migratetype buddy list. But,
> I'm not sure how we can count number of freepages correctly. IIUC,
> unlike my approach, this approach allows merging between isolate buddy
> list and normal buddy list and it would results in miscounting number of
> freepages. You probably think that move_freepages_block() could fixup
> all the situation, but, I don't think so.

No, I didn't think that, I simply overlooked this scenario. Good catch.

> After applying your approach,
>
> CPU1                                    CPU2
> set_migratetype_isolate()               - free_one_page()
>    - lock
>    - set_pageblock_migratetype()
>    - unlock
>    - drain...
>                                          - lock
>                                          - __free_one_page() with MIGRATE_ISOLATE
>                                          - merged with normal page and linked with
>                                          isolate buddy list
>                                          - skip to count freepage, because of
>                                          MIGRATE_ISOLATE
>                                          - unlock
>    - lock
>    - move_freepages_block()
>      try to fixup freepages count
>    - unlock
>
> We want to fixup counting in move_freepages_block(), but, we can't
> because we don't have enough information whether this page is already
> accounted on number of freepage or not. Could you help me to find out
> the whole mechanism of your approach?

I can't think of an easy fix to this situation.

A non-trivial fix that comes to mind (and I might have overlooked 
something) is something like:

- distinguish MIGRATETYPE_ISOLATING and MIGRATETYPE_ISOLATED
- CPU1 first sets MIGRATETYPE_ISOLATING before the drain
- when CPU2 sees MIGRATETYPE_ISOLATING, it just puts the page on special 
unbounded pcplist and that's it
- CPU1 does the drain as usual, potentially misplacing some pages that 
move_freepages_block() will then fix. But no wrong merging can occur.
- after move_freepages_block(), CPU1 changes MIGRATETYPE_ISOLATING to 
MIGRATETYPE_ISOLATED
- CPU2 can then start freeing directly on isolate buddy list. There 
might be some pages still on the special pcplist of CPU2/CPUx but that 
means they won't merge yet.
- CPU1 executes on all CPU's a new operation that flushes the special 
pcplist on isolate buddy list and merge as needed.


> And, IIUC, your approach can remove only one get_pageblock_migratetype()
> in free_pcppages_bulk(). free_hot_cold_page() and free_one_page()
> should have is_migrate_isolate() branches to free the page to correct
> list and count number of freepage correctly. Additionally, it extends
> scope of disabled irq. So it doesn't look much better than mine,
> because my approach also removes get_pageblock_migratetype() in
> free_pcppages_bulk().
>
> Please let me know what I am missing. :)
>
> Thanks.
>
>>
>>> Interestingly,
>>> on last Saturday, Lisa Du reported CMA accounting bugs.
>>>
>>> http://www.spinics.net/lists/linux-mm/msg75741.html
>>>
>>> I don't look at it in detail, but, maybe it is related to these
>>> problems and we should fix it without side-effect.
>>>
>>> So, in conclusion, I think that your suggestion is beyond the scope of
>>> this patchset because of following two reasons.
>>>
>>> 1. I'd like to fix these problems without side-effect(possible
>>> regression in success rate of memory hotplug and CMA, and nr_freepage
>>> meanging change) due to backport possibility.
>>> 2. nr_freepage without considering memory isolation is somewhat dangerous
>>> and not suitable for some systems.
>>>
>>> If you have any objection, please let me know. But, I will go on
>>> a vacation for a week so I can't answer your further comments
>>> for a week. I will reply them next week. :)
>>>
>>> Thanks.
>>>
>>>>> 3. pages on cma buddy list are counted as CMA freepage, too.
>>>>> 4. pages for guard are *not* counted as freepage.
>>>>>
>>>>> Now, I describe problems and related patch.
>>>>>
>>>>> 1. Patch 2: If guard page are cleared and merged into isolate buddy list,
>>>>> we should not add freepage count.
>>>>>
>>>>> 2. Patch 3: When the page return back from pcp to buddy, we should
>>>>> account it to freepage counter. In this case, we should check the
>>>>> pageblock migratetype of the page and should insert the page into
>>>>> appropriate buddy list. Although we checked it in current code, we
>>>>> didn't insert the page into appropriate buddy list so that freepage
>>>>> counting can be wrong.
>>>>>
>>>>> 3. Patch 4: There is race condition so that some freepages could be
>>>>> on isolate buddy list. If so, we can't use this page until next isolation
>>>>> attempt on this pageblock.
>>>>>
>>>>> 4. Patch 5: There is race condition that page on isolate pageblock
>>>>> can go into non-isolate buddy list. If so, buddy allocator would
>>>>> merge pages on non-isolate buddy list and isolate buddy list, respectively,
>>>>> and freepage count will be wrong.
>>>>>
>>>>> 5. Patch 9: move_freepages(_block) returns *not* number of moved pages.
>>>>> Instead, it returns number of pages linked in that migratetype buddy list.
>>>>> So accouting with this return value makes freepage count wrong.
>>>>>
>>>>> 6. Patch 10: buddy allocator would merge pages on non-isolate buddy list
>>>>> and isolate buddy list, respectively. This leads to freepage counting
>>>>> problem so fix it by stopping merging in this case.
>>>>>
>>>>> Without patchset [1], above problem doesn't happens on my CMA allocation
>>>>> test, because CMA reserved pages aren't used at all. So there is no
>>>>> chance for above race.
>>>>>
>>>>> With patchset [1], I did simple CMA allocation test and get below result.
>>>>>
>>>>> - Virtual machine, 4 cpus, 1024 MB memory, 256 MB CMA reservation
>>>>> - run kernel build (make -j16) on background
>>>>> - 30 times CMA allocation(8MB * 30 = 240MB) attempts in 5 sec interval
>>>>> - Result: more than 5000 freepage count are missed
>>>>>
>>>>> With patchset [1] and this patchset, I found that no freepage count are
>>>>> missed so that I conclude that problems are solved.
>>>>>
>>>>> These problems can be possible on memory hot remove users, although
>>>>> I didn't check it further.
>>>>>
>>>>> Other patches are either for the base to fix these problems or for
>>>>> simple clean-up. Please see individual patches for more information.
>>>>>
>>>>> This patchset is based on linux-next-20140703.
>>>>>
>>>>> Thanks.
>>>>>
>>>>> [1]: Aggressively allocate the pages on cma reserved memory
>>>>>       https://lkml.org/lkml/2014/5/30/291
>>>>>
>>>>>
>>>>> Joonsoo Kim (10):
>>>>>    mm/page_alloc: remove unlikely macro on free_one_page()
>>>>>    mm/page_alloc: correct to clear guard attribute in DEBUG_PAGEALLOC
>>>>>    mm/page_alloc: handle page on pcp correctly if it's pageblock is
>>>>>      isolated
>>>>>    mm/page_alloc: carefully free the page on isolate pageblock
>>>>>    mm/page_alloc: optimize and unify pageblock migratetype check in free
>>>>>      path
>>>>>    mm/page_alloc: separate freepage migratetype interface
>>>>>    mm/page_alloc: store migratetype of the buddy list into freepage
>>>>>      correctly
>>>>>    mm/page_alloc: use get_onbuddy_migratetype() to get buddy list type
>>>>>    mm/page_alloc: fix possible wrongly calculated freepage counter
>>>>>    mm/page_alloc: Stop merging pages on non-isolate and isolate buddy
>>>>>      list
>>>>>
>>>>>   include/linux/mm.h             |   30 +++++++--
>>>>>   include/linux/mmzone.h         |    5 ++
>>>>>   include/linux/page-isolation.h |    8 +++
>>>>>   mm/page_alloc.c                |  138 +++++++++++++++++++++++++++++-----------
>>>>>   mm/page_isolation.c            |   18 ++----
>>>>>   5 files changed, 147 insertions(+), 52 deletions(-)
>>>>>
>>>>
>>>> --
>>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>>> the body to majordomo@...ck.org.  For more info on Linux MM,
>>>> see: http://www.linux-mm.org/ .
>>>> Don't email: <a href=mailto:"dont@...ck.org"> email@...ck.org </a>
>>>
>>> --
>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>> the body to majordomo@...ck.org.  For more info on Linux MM,
>>> see: http://www.linux-mm.org/ .
>>> Don't email: <a href=mailto:"dont@...ck.org"> email@...ck.org </a>
>>>
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@...ck.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@...ck.org"> email@...ck.org </a>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/