linux-kernel - Re: [resend][PATCH] mm, vmscan: fix do_try_to_free

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4FE0FA7B.7020407@gmail.com>
Date:	Tue, 19 Jun 2012 18:17:31 -0400
From:	KOSAKI Motohiro <kosaki.motohiro@...il.com>
To:	Minchan Kim <minchan@...nel.org>
CC:	Aaditya Kumar <aaditya.kumar.30@...il.com>,
	KOSAKI Motohiro <kosaki.motohiro@...il.com>,
	linux-kernel@...r.kernel.org, linux-mm@...ck.org,
	akpm@...ux-foundation.org, Nick Piggin <npiggin@...il.com>,
	Michal Hocko <mhocko@...e.cz>,
	Johannes Weiner <hannes@...xchg.org>,
	Mel Gorman <mel@....ul.ie>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
	Minchan Kim <minchan.kim@...il.com>, frank.rowand@...sony.com,
	tim.bird@...sony.com, takuzo.ohara@...sony.com,
	kan.iibuchi@...sony.com
Subject: Re: [resend][PATCH] mm, vmscan: fix do_try_to_free_pages() livelock

(6/17/12 8:43 PM), Minchan Kim wrote:
> On 06/17/2012 02:48 AM, Aaditya Kumar wrote:
> 
>> On Fri, Jun 15, 2012 at 12:57 PM, Minchan Kim <minchan@...nel.org> wrote:
>>
>>>>
>>>> pgdat_balanced() doesn't recognized zone. Therefore kswapd may sleep
>>>> if node has multiple zones. Hm ok, I realized my descriptions was
>>>> slightly misleading. priority 0 is not needed. bakance_pddat() calls
>>>> pgdat_balanced()
>>>> every priority. Most easy case is, movable zone has a lot of free pages and
>>>> normal zone has no reclaimable page.
>>>>
>>>> btw, current pgdat_balanced() logic seems not correct. kswapd should
>>>> sleep only if every zones have much free pages than high water mark
>>>> _and_ 25% of present pages in node are free.
>>>>
>>>
>>>
>>> Sorry. I can't understand your point.
>>> Current kswapd doesn't sleep if relevant zones don't have free pages above high watermark.
>>> It seems I am missing your point.
>>> Please anybody correct me.
>>
>> Since currently direct reclaim is given up based on
>> zone->all_unreclaimable flag,
>> so for e.g in one of the scenarios:
>>
>> Lets say system has one node with two zones (NORMAL and MOVABLE) and we
>> hot-remove the all the pages of the MOVABLE zone.
>>
>> While migrating pages during memory hot-unplugging, the allocation function
>> (for new page to which the page in MOVABLE zone would be moved)  can end up
>> looping in direct reclaim path for ever.
>>
>> This is so because when most of the pages in the MOVABLE zone have
>> been migrated,
>> the zone now contains lots of free memory (basically above low watermark)
>> BUT all are in MIGRATE_ISOLATE list of the buddy list.
>>
>> So kswapd() would not balance this zone as free pages are above low watermark
>> (but all are in isolate list). So zone->all_unreclaimable flag would
>> never be set for this zone
>> and allocation function would end up looping forever. (assuming the
>> zone NORMAL is
>> left with no reclaimable memory)
>>
> 
> 
> Thanks a lot, Aaditya! Scenario you mentioned makes perfect.
> But I don't see it's a problem of kswapd.
> 
> a5d76b54 made new migration type 'MIGRATE_ISOLATE' which is very irony type because there are many free pages in free list
> but we can't allocate it. :(
> It doesn't reflect right NR_FREE_PAGES while many places in the kernel use NR_FREE_PAGES to trigger some operation.
> Kswapd is just one of them confused.
> As right fix of this problem, we should fix hot plug code, IMHO which can fix CMA, too. 
> 
> This patch could make inconsistency between NR_FREE_PAGES and SumOf[free_area[order].nr_free]
> and it could make __zone_watermark_ok confuse so we might need to fix move_freepages_block itself to reflect
> free_area[order].nr_free exactly. 
> 
> Any thought?
> 
> Side Note: I still need KOSAKI's patch with fixed description regardless of this problem because set zone->all_unreclaimable of only kswapd is very fragile.
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4403009..19de56c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5593,8 +5593,10 @@ int set_migratetype_isolate(struct page *page)
>  
>  out:
>         if (!ret) {
> +               int pages_moved;
>                 set_pageblock_migratetype(page, MIGRATE_ISOLATE);
> -               move_freepages_block(zone, page, MIGRATE_ISOLATE);
> +               pages_moved = move_freepages_block(zone, page, MIGRATE_ISOLATE);
> +               __mod_zone_page_state(zone, NR_FREE_PAGES, -pages_moved);
>         }   
>  
>         spin_unlock_irqrestore(&zone->lock, flags);
> @@ -5607,12 +5609,14 @@ void unset_migratetype_isolate(struct page *page, unsigned migratetype)
>  {
>         struct zone *zone;
>         unsigned long flags;
> +       int pages_moved;
>         zone = page_zone(page);
>         spin_lock_irqsave(&zone->lock, flags);
>         if (get_pageblock_migratetype(page) != MIGRATE_ISOLATE)
>                 goto out;
>         set_pageblock_migratetype(page, migratetype);
> -       move_freepages_block(zone, page, migratetype);
> +       pages_moved = move_freepages_block(zone, page, migratetype);
> +       __mod_zone_page_state(zone, NR_FREE_PAGES, pages_moved);
>  out:
>         spin_unlock_irqrestore(&zone->lock, flags);
>  }

Unfortunately, this doesn't work. there are two reasons. 1) when memory hotplug occue, we have
two scenarios. a) free page -> page block change into isolate b) page block change into isolate
-> free page. The above patch only care scenario (a). Thus it lead to confusing NR_FREE_PAGES value.
_if_ we put a new branch free page hotpath, we can solve scenario (b). but I don't like it. because of,
zero hotpath overhead is one of memory hotplug design principle. 2) event if we can solve above issue,
all_unreclaimable logic still broken. because of, __alloc_pages_slowpath() wake up kswapd only once and
don't wake up when "goto rebalance" path. But, wake_all_kswapd() is racy and no guarantee to wake up
kswapd. It mean direct reclaim should work fine w/o background reclaim.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/