[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b3970971-1f2c-3c9b-d0e3-008f57c45b74@linux.alibaba.com>
Date: Wed, 22 May 2019 11:25:07 +0800
From: Yang Shi <yang.shi@...ux.alibaba.com>
To: Johannes Weiner <hannes@...xchg.org>
Cc: ying.huang@...el.com, mhocko@...e.com, mgorman@...hsingularity.net,
kirill.shutemov@...ux.intel.com, josef@...icpanda.com,
hughd@...gle.com, shakeelb@...gle.com, akpm@...ux-foundation.org,
linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [v3 PATCH 2/2] mm: vmscan: correct some vmscan counters for THP
swapout
On 5/22/19 12:00 AM, Johannes Weiner wrote:
> On Tue, May 21, 2019 at 05:40:42PM +0800, Yang Shi wrote:
>> Since commit bd4c82c22c36 ("mm, THP, swap: delay splitting THP after
>> swapped out"), THP can be swapped out in a whole. But, nr_reclaimed
>> and some other vm counters still get inc'ed by one even though a whole
>> THP (512 pages) gets swapped out.
>>
>> This doesn't make too much sense to memory reclaim. For example, direct
>> reclaim may just need reclaim SWAP_CLUSTER_MAX pages, reclaiming one THP
>> could fulfill it. But, if nr_reclaimed is not increased correctly,
>> direct reclaim may just waste time to reclaim more pages,
>> SWAP_CLUSTER_MAX * 512 pages in worst case.
>>
>> And, it may cause pgsteal_{kswapd|direct} is greater than
>> pgscan_{kswapd|direct}, like the below:
>>
>> pgsteal_kswapd 122933
>> pgsteal_direct 26600225
>> pgscan_kswapd 174153
>> pgscan_direct 14678312
>>
>> nr_reclaimed and nr_scanned must be fixed in parallel otherwise it would
>> break some page reclaim logic, e.g.
>>
>> vmpressure: this looks at the scanned/reclaimed ratio so it won't
>> change semantics as long as scanned & reclaimed are fixed in parallel.
>>
>> compaction/reclaim: compaction wants a certain number of physical pages
>> freed up before going back to compacting.
>>
>> kswapd priority raising: kswapd raises priority if we scan fewer pages
>> than the reclaim target (which itself is obviously expressed in order-0
>> pages). As a result, kswapd can falsely raise its aggressiveness even
>> when it's making great progress.
>>
>> Other than nr_scanned and nr_reclaimed, some other counters, e.g.
>> pgactivate, nr_skipped, nr_ref_keep and nr_unmap_fail need to be fixed
>> too since they are user visible via cgroup, /proc/vmstat or trace
>> points, otherwise they would be underreported.
>>
>> When isolating pages from LRUs, nr_taken has been accounted in base
>> page, but nr_scanned and nr_skipped are still accounted in THP. It
>> doesn't make too much sense too since this may cause trace point
>> underreport the numbers as well.
>>
>> So accounting those counters in base page instead of accounting THP as
>> one page.
>>
>> This change may result in lower steal/scan ratio in some cases since
>> THP may get split during page reclaim, then a part of tail pages get
>> reclaimed instead of the whole 512 pages, but nr_scanned is accounted
>> by 512, particularly for direct reclaim. But, this should be not a
>> significant issue.
>>
>> Cc: "Huang, Ying" <ying.huang@...el.com>
>> Cc: Johannes Weiner <hannes@...xchg.org>
>> Cc: Michal Hocko <mhocko@...e.com>
>> Cc: Mel Gorman <mgorman@...hsingularity.net>
>> Cc: "Kirill A . Shutemov" <kirill.shutemov@...ux.intel.com>
>> Cc: Hugh Dickins <hughd@...gle.com>
>> Cc: Shakeel Butt <shakeelb@...gle.com>
>> Signed-off-by: Yang Shi <yang.shi@...ux.alibaba.com>
>> ---
>> v3: Removed Shakeel's Reviewed-by since the patch has been changed significantly
>> Switched back to use compound_order per Matthew
>> Fixed more counters per Johannes
>> v2: Added Shakeel's Reviewed-by
>> Use hpage_nr_pages instead of compound_order per Huang Ying and William Kucharski
>>
>> mm/vmscan.c | 40 ++++++++++++++++++++++++++++------------
>> 1 file changed, 28 insertions(+), 12 deletions(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index b65bc50..1044834 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1250,7 +1250,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>> case PAGEREF_ACTIVATE:
>> goto activate_locked;
>> case PAGEREF_KEEP:
>> - stat->nr_ref_keep++;
>> + stat->nr_ref_keep += (1 << compound_order(page));
>> goto keep_locked;
>> case PAGEREF_RECLAIM:
>> case PAGEREF_RECLAIM_CLEAN:
>> @@ -1294,6 +1294,17 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>> goto activate_locked;
>> }
>>
>> + /*
>> + * Account all tail pages when THP is added
>> + * into swap cache successfully.
>> + * The head page has been accounted at the
>> + * first place.
>> + */
>> + if (PageTransHuge(page))
>> + sc->nr_scanned +=
>> + ((1 << compound_order(page)) -
>> + 1);
>> +
>> may_enter_fs = 1;
> Even if we don't split and reclaim the page, we should always account
> the number of base pages in nr_scanned. Otherwise it's not clear what
> nr_scanned means.
Sure.
>
>> /* Adding to swap updated mapping */
>> @@ -1315,7 +1326,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>> if (unlikely(PageTransHuge(page)))
>> flags |= TTU_SPLIT_HUGE_PMD;
>> if (!try_to_unmap(page, flags)) {
>> - stat->nr_unmap_fail++;
>> + stat->nr_unmap_fail +=
>> + (1 << compound_order(page));
>> goto activate_locked;
>> }
>> }
>> @@ -1442,7 +1454,11 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>>
>> unlock_page(page);
>> free_it:
>> - nr_reclaimed++;
>> + /*
>> + * THP may get swapped out in a whole, need account
>> + * all base pages.
>> + */
>> + nr_reclaimed += (1 << compound_order(page));
> This expression is quite repetitive. Why not do
>
> int nr_pages;
>
> page = lru_to_page(page_list);
> nr_pages = 1 << compound_order(page);
> list_del(&page->lru);
>
> if (!trylock_page(page))
> ...
>
> at the head of the loop and add nr_pages to all these counters
> instead?
Because it is unknown whether the THP will be swapped out as a whole or
will be split at this point. nr_scanned is fine, but nr_reclaimed is not.
>
>> @@ -1642,14 +1659,12 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>> unsigned long nr_zone_taken[MAX_NR_ZONES] = { 0 };
>> unsigned long nr_skipped[MAX_NR_ZONES] = { 0, };
>> unsigned long skipped = 0;
>> - unsigned long scan, total_scan, nr_pages;
>> + unsigned long scan, nr_pages;
>> LIST_HEAD(pages_skipped);
>> isolate_mode_t mode = (sc->may_unmap ? 0 : ISOLATE_UNMAPPED);
>>
>> scan = 0;
>> - for (total_scan = 0;
>> - scan < nr_to_scan && nr_taken < nr_to_scan && !list_empty(src);
>> - total_scan++) {
>> + while (scan < nr_to_scan && nr_taken < nr_to_scan && !list_empty(src)) {
>> struct page *page;
> Once you fixed the units, scan < nr_to_scan && nr_taken >= nr_to_scan
> is an impossible condition. You should be able to write:
>
> while (scan < nr_to_scan && !list_empty(src))
Yes.
>
> Also, you need to keep total_scan. The trace point wants to know how
> many pages were actually looked at, including the ones from ineligible
> zones that were skipped over.
Aha, yes. The total_scan includes both scanned and skipped. Will fix in v4.
>
>>
>> page = lru_to_page(src);
>> @@ -1659,7 +1674,8 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>>
>> if (page_zonenum(page) > sc->reclaim_idx) {
>> list_move(&page->lru, &pages_skipped);
>> - nr_skipped[page_zonenum(page)]++;
>> + nr_skipped[page_zonenum(page)] +=
>> + (1 << compound_order(page));
>> continue;
>> }
>>
>> @@ -1669,7 +1685,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>> * ineligible pages. This causes the VM to not reclaim any
>> * pages, triggering a premature OOM.
>> */
>> - scan++;
>> + scan += (1 << compound_order(page));
>> switch (__isolate_lru_page(page, mode)) {
>> case 0:
>> nr_pages = hpage_nr_pages(page);
> Same here, you can calculate nr_pages at the top of the loop and use
> it throughout.
Yes. Will fix in v4.
>
>> @@ -1707,9 +1723,9 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>> skipped += nr_skipped[zid];
>> }
>> }
>> - *nr_scanned = total_scan;
>> + *nr_scanned = scan;
>> trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan,
>> - total_scan, skipped, nr_taken, mode, lru);
>> + scan, skipped, nr_taken, mode, lru);
>> update_lru_sizes(lruvec, lru, nr_zone_taken);
>> return nr_taken;
>> }
>> --
>> 1.8.3.1
>>
Powered by blists - more mailing lists