[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0f9f7a2e-23c3-43fe-b5c1-dab3a7b31c2d@126.com>
Date: Mon, 29 Jul 2024 11:49:19 +0800
From: Ge Yang <yangge1116@....com>
To: Chris Li <chrisl@...nel.org>
Cc: Andrew Morton <akpm@...ux-foundation.org>, linux-mm <linux-mm@...ck.org>,
LKML <linux-kernel@...r.kernel.org>, stable@...r.kernel.org,
Barry Song <21cnbao@...il.com>, David Hildenbrand <david@...hat.com>,
baolin.wang@...ux.alibaba.com, liuzixing@...on.cn,
Hugh Dickins <hughd@...gle.com>
Subject: Re: [PATCH V2] mm/gup: Clear the LRU flag of a page before adding to
LRU batch
在 2024/7/28 6:33, Chris Li 写道:
> Hello Ge Yang,
>
> Sorry for joining this discussion late.
>
> I recently found a regression on mm-unstable during my swap stress
> test, using tmpfs to compile linux. The test hit OOM very soon after
> the make spawn many cc processes.
>
> This is preventing me from stress testing the swap allocator series on
> mm-unstable and mm-stable. I finally spent some time doing a kernel
> git bisect. It bisects down to this commit:
>
> 33dfe9204f29b415bbc0abb1a50642d1ba94f5e9 is the first bad commit
> commit 33dfe9204f29b415bbc0abb1a50642d1ba94f5e9
> Author: yangge <yangge1116@....com>
> Date: Wed Jul 3 20:02:33 2024 +0800
> mm/gup: clear the LRU flag of a page before adding to LRU batch
> mm/swap.c | 43 +++++++++++++++++++++++++++++++------------
> 1 file changed, 31 insertions(+), 12 deletions(-)
> bisect found first bad commit
>
>
> Here the git bisect log:
> $ git bisect log
> # bad: [66ebbdfdeb093e097399b1883390079cd4c3022b] Merge tag
> 'irq-msi-2024-07-22' of
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
> # good: [0c3836482481200ead7b416ca80c68a29cfdaabd] Linux 6.10
> git bisect start 'remotes/akpm/mm-stable' 'v6.10'
> # good: [280e36f0d5b997173d014c07484c03a7f7750668] nsfs: use cleanup guard
> git bisect good 280e36f0d5b997173d014c07484c03a7f7750668
> # good: [07e773db19f16f4111795b658c4748da22c927bb] Merge tag
> 'tpmdd-next-6.11-rc1-roundtwo' of
> git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd
> git bisect good 07e773db19f16f4111795b658c4748da22c927bb
> # good: [ef035628c326af9aa645af1b91fbb72fdfec874e] Merge tag
> 'i2c-for-6.11-rc1-try2' of
> git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
> git bisect good ef035628c326af9aa645af1b91fbb72fdfec874e
> # good: [2c9b3512402ed192d1f43f4531fb5da947e72bd0] Merge tag
> 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
> git bisect good 2c9b3512402ed192d1f43f4531fb5da947e72bd0
> # bad: [30d77b7eef019fa4422980806e8b7cdc8674493e] mm/mglru: fix
> ineffective protection calculation
> git bisect bad 30d77b7eef019fa4422980806e8b7cdc8674493e
> # good: [c02525a33969000fa7b595b743deb4d79804916b] ftrace: unpoison
> ftrace_regs in ftrace_ops_list_func()
> git bisect good c02525a33969000fa7b595b743deb4d79804916b
> # good: [8ef6fd0e9ea83a792ba53882ddc6e0d38ce0d636] Merge branch
> 'mm-hotfixes-stable' into mm-stable to pick up "mm: fix crashes from
> deferred split racing folio migration", needed by "mm: migrate: split
> folio_migrate_mapping()".
> git bisect good 8ef6fd0e9ea83a792ba53882ddc6e0d38ce0d636
> # good: [a898530eea3d0ba08c17a60865995a3bb468d1bc] powerpc/64e: split
> out nohash Book3E 64-bit code
> git bisect good a898530eea3d0ba08c17a60865995a3bb468d1bc
> # good: [00f58104202c472e487f0866fbd38832523fd4f9] mm: fix khugepaged
> activation policy
> git bisect good 00f58104202c472e487f0866fbd38832523fd4f9
> # good: [53dabce2652fb854eae84609ce9c37429d5d87ba] mm, page_alloc: put
> should_fail_alloc_page() back behing CONFIG_FAIL_PAGE_ALLOC
> git bisect good 53dabce2652fb854eae84609ce9c37429d5d87ba
> # good: [6ab42fe21c84d72da752923b4bd7075344f4a362] alloc_tag: fix
> page_ext_get/page_ext_put sequence during page splitting
> git bisect good 6ab42fe21c84d72da752923b4bd7075344f4a362
> # bad: [33dfe9204f29b415bbc0abb1a50642d1ba94f5e9] mm/gup: clear the
> LRU flag of a page before adding to LRU batch
> git bisect bad 33dfe9204f29b415bbc0abb1a50642d1ba94f5e9
> # good: [af649773fb25250cd22625af021fb6275c56a3ee] mm/numa_balancing:
> teach mpol_to_str about the balancing mode
> git bisect good af649773fb25250cd22625af021fb6275c56a3ee
> # first bad commit: [33dfe9204f29b415bbc0abb1a50642d1ba94f5e9] mm/gup:
> clear the LRU flag of a page before adding to LRU batch
>
> I double checked this commit 33dfe9204f29b415bbc0abb1a50642d1ba94f5e9
> ("mm/gup: clear the LRU flag of a page before adding to LRU batch")
> fail the swap stress test very quickly.
>
> The previous commit af649773fb25250cd22625af021fb6275c56a3ee
> ("mm/numa_balancing: teach mpol_to_str about the balancing mode") can
> pass the swap stress test fine.
>
> Please feel free to send me patches to test out the issue. As it is, I
> believe it is a regression on the swapping behavior.
>
> Here is the dmesg of the OOM kill:
>
> [ 93.326752] cc1 invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL),
> order=0, oom_score_adj=0
> [ 93.327330] CPU: 3 PID: 5225 Comm: cc1 Tainted: G I
> 6.10.0-rc6+ #34
> [ 93.328277] Hardware name: HP ProLiant DL360 G7, BIOS P68 08/16/2015
> [ 93.328757] Call Trace:
> [ 93.328977] <TASK>
> [ 93.329515] dump_stack_lvl+0x5d/0x80
> [ 93.329842] dump_header+0x44/0x18d
> [ 93.330422] oom_kill_process.cold+0xa/0xaa
> [ 93.330723] out_of_memory+0x219/0x4b0
> [ 93.331037] mem_cgroup_out_of_memory+0x12d/0x160
> [ 93.331755] try_charge_memcg+0x488/0x630
> [ 93.332044] __mem_cgroup_charge+0x42/0xb0
> [ 93.332321] do_anonymous_page+0x32a/0x8b0
> [ 93.332553] ? __pte_offset_map+0x1b/0x180
> [ 93.332857] __handle_mm_fault+0xc05/0x1080
> [ 93.333141] ? sched_balance_trigger+0x14c/0x3f0
> [ 93.333840] ? sched_tick+0xee/0x320
> [ 93.334142] handle_mm_fault+0xcd/0x2a0
> [ 93.334419] do_user_addr_fault+0x217/0x620
> [ 93.334694] exc_page_fault+0x7e/0x180
> [ 93.334960] asm_exc_page_fault+0x26/0x30
> [ 93.335194] RIP: 0033:0x147a0b3
> [ 93.335852] Code: a0 00 48 89 fb 49 89 f4 41 89 d5 74 7a 31 d2 31
> f6 b9 01 00 00 00 bf 18 00 00 00 e8 97 b6 f8 ff 66 0f ef c0 66 41 83
> 3c 24 2b <4c> 89 60 10 48 89 c5 49 89 c6 0f 11 00 74 7e 48 8b 43 08 80
> 48 02
> [ 93.337577] RSP: 002b:00007ffe666e3e10 EFLAGS: 00010216
> [ 93.337966] RAX: 00007f4dd9d0e000 RBX: 00007ffe666e3e50 RCX: 00000000000000a9
> [ 93.338896] RDX: 0000000000000018 RSI: 0000000000000006 RDI: 00000000000000aa
> [ 93.339849] RBP: 00007f4dd9d0a0e0 R08: 0000000000000040 R09: 0000000000000001
> [ 93.340801] R10: 0000000000000000 R11: 0000000004c04560 R12: 00007f4dd9d04fd8
> [ 93.341675] R13: 0000000000000004 R14: 0000000000000000 R15: 000000007ffea943
> [ 93.342584] </TASK>
> [ 93.342762] memory: usage 481280kB, limit 481280kB, failcnt 9789
I can't reproduce this problem, using tmpfs to compile linux.
Seems you limit the memory size used to compile linux, which leads to
OOM. May I ask why the memory size is limited to 481280kB? Do I also
need to limit the memory size to 481280kB to test?
> [ 93.343556] swap: usage 123404kB, limit 9007199254740988kB, failcnt 0
> [ 93.343984] Memory cgroup stats for /build-kernel-tmpfs:
> [ 93.344051] anon 461377536
> [ 93.344586] file 10264576
> [ 93.344795] kernel 20480000
> [ 93.344987] kernel_stack 2146304
> [ 93.345615] pagetables 9916416
> [ 93.346283] sec_pagetables 0
> [ 93.346878] percpu 54496
> [ 93.347080] sock 0
> [ 93.347607] vmalloc 0
> [ 93.347837] shmem 24576
> [ 93.347984] zswap 0
> [ 93.348510] zswapped 0
> [ 93.348661] file_mapped 9805824
> [ 93.349286] file_dirty 0
> [ 93.349484] file_writeback 0
> [ 93.350085] swapcached 24576
> [ 93.350706] anon_thp 213909504
> [ 93.351335] file_thp 0
> [ 93.351544] shmem_thp 0
> [ 93.351681] inactive_anon 180965376
> [ 93.352348] active_anon 291487744
> [ 93.352993] inactive_file 1298432
> [ 93.353632] active_file 7987200
> [ 93.354281] unevictable 0
> [ 93.354483] slab_reclaimable 943096
> [ 93.355085] slab_unreclaimable 6340520
> [ 93.355369] slab 7283616
> [ 93.355597] workingset_refault_anon 1138
> [ 93.355857] workingset_refault_file 180
> [ 93.356135] workingset_activate_anon 627
> [ 93.356410] workingset_activate_file 123
> [ 93.356694] workingset_restore_anon 579
> [ 93.357001] workingset_restore_file 115
> [ 93.382485] workingset_nodereclaim 0
> [ 93.457426] pgscan 101315
> [ 93.457631] pgsteal 51494
> [ 93.457843] pgscan_kswapd 0
> [ 93.458033] pgscan_direct 101315
> [ 93.458725] pgscan_khugepaged 0
> [ 93.459494] pgsteal_kswapd 0
> [ 93.460338] pgsteal_direct 51494
> [ 93.461046] pgsteal_khugepaged 0
> [ 93.461701] pgfault 994774
> [ 93.461895] pgmajfault 1839
> [ 93.462123] pgrefill 134581
> [ 93.462315] pgactivate 32506
> [ 93.463086] pgdeactivate 0
> [ 93.463314] pglazyfree 0
> [ 93.463527] pglazyfreed 0
> [ 93.463727] zswpin 0
> [ 93.463912] zswpout 0
> [ 93.464114] zswpwb 0
> [ 93.464321] thp_fault_alloc 485
> [ 93.464963] thp_collapse_alloc 0
> [ 93.465578] thp_swpout 4
> [ 93.465815] thp_swpout_fallback 0
> [ 93.466457] Tasks state (memory values in pages):
> [ 93.467153] [ pid ] uid tgid total_vm rss rss_anon
> rss_file rss_shmem pgtables_bytes swapents oom_score_adj name
> [ 93.467917] [ 1461] 1000 1461 1795 530 53
> 477 0 45056 0 0 kbench
> [ 93.468600] [ 4170] 1000 4170 636 321 0
> 321 0 45056 0 0 time
> [ 93.569307] [ 4171] 1000 4171 3071 810 48
> 762 0 69632 48 0 make
> [ 93.570111] [ 4172] 1000 4172 2706 827 144
> 683 0 65536 192 0 make
> [ 93.571015] [ 4951] 1000 4951 2733 791 144
> 647 0 61440 192 0 make
> [ 93.571747] [ 4956] 1000 4956 2560 852 144
> 708 0 69632 0 0 make
> [ 93.572478] [ 4957] 1000 4957 2541 803 96
> 707 0 61440 96 0 make
> [ 93.573244] [ 4958] 1000 4958 2541 750 96
> 654 0 53248 48 0 make
> [ 93.574016] [ 4960] 1000 4960 2565 753 96
> 657 0 65536 48 0 make
> [ 93.674651] [ 4961] 1000 4961 2538 837 144
> 693 0 53248 0 0 make
> [ 93.675446] [ 4962] 1000 4962 2569 845 192
> 653 0 69632 0 0 make
> [ 93.676220] [ 4963] 1000 4963 2567 852 192
> 660 0 57344 0 0 make
> [ 93.676946] [ 4964] 1000 4964 2536 901 192
> 709 0 65536 0 0 make
> [ 93.677679] [ 4965] 1000 4965 2540 887 192
> 695 0 61440 0 0 make
> [ 93.678377] [ 4967] 1000 4967 2563 853 144
> 709 0 61440 48 0 make
> [ 93.679168] [ 4969] 1000 4969 2538 836 144
> 692 0 57344 48 0 make
> [ 93.679937] [ 4973] 1000 4973 2535 827 144
> 683 0 61440 48 0 make
> [ 93.680628] [ 4976] 1000 4976 2571 878 192
> 686 0 57344 0 0 make
> [ 93.681397] [ 4977] 1000 4977 2534 850 192
> 658 0 53248 0 0 make
> [ 93.682121] [ 4978] 1000 4978 1797 766 48
> 718 0 49152 0 0 sh
> [ 93.683272] [ 4980] 1000 4980 2540 839 192
> 647 0 65536 48 0 make
> [ 93.709270] [ 4982] 1000 4982 2539 853 144
> 709 0 65536 0 0 make
> [ 93.784725] [ 4983] 1000 4983 1798 885 96
> 789 0 61440 0 0 sh
> [ 93.785895] [ 4984] 1000 4984 2539 878 192
> 686 0 57344 0 0 make
> [ 93.786661] [ 4986] 1000 4986 2537 863 192
> 671 0 61440 0 0 make
> [ 93.787378] [ 4988] 1000 4988 2540 824 144
> 680 0 61440 48 0 make
> [ 93.788060] [ 4989] 1000 4989 2538 792 144
> 648 0 65536 0 0 make
> [ 93.788873] [ 4990] 1000 4990 1282 810 48
> 762 0 45056 0 0 gcc
>
> Chris
>
> On Fri, Jun 21, 2024 at 11:48 PM <yangge1116@....com> wrote:
>>
>> From: yangge <yangge1116@....com>
>>
>> If a large number of CMA memory are configured in system (for example, the
>> CMA memory accounts for 50% of the system memory), starting a virtual
>> virtual machine, it will call pin_user_pages_remote(..., FOLL_LONGTERM,
>> ...) to pin memory. Normally if a page is present and in CMA area,
>> pin_user_pages_remote() will migrate the page from CMA area to non-CMA
>> area because of FOLL_LONGTERM flag. But the current code will cause the
>> migration failure due to unexpected page refcounts, and eventually cause
>> the virtual machine fail to start.
>>
>> If a page is added in LRU batch, its refcount increases one, remove the
>> page from LRU batch decreases one. Page migration requires the page is not
>> referenced by others except page mapping. Before migrating a page, we
>> should try to drain the page from LRU batch in case the page is in it,
>> however, folio_test_lru() is not sufficient to tell whether the page is
>> in LRU batch or not, if the page is in LRU batch, the migration will fail.
>>
>> To solve the problem above, we modify the logic of adding to LRU batch.
>> Before adding a page to LRU batch, we clear the LRU flag of the page so
>> that we can check whether the page is in LRU batch by folio_test_lru(page).
>> Seems making the LRU flag of the page invisible a long time is no problem,
>> because a new page is allocated from buddy and added to the lru batch,
>> its LRU flag is also not visible for a long time.
>>
>> Cc: <stable@...r.kernel.org>
>> Signed-off-by: yangge <yangge1116@....com>
>> ---
>> mm/swap.c | 43 +++++++++++++++++++++++++++++++------------
>> 1 file changed, 31 insertions(+), 12 deletions(-)
>>
>> diff --git a/mm/swap.c b/mm/swap.c
>> index dc205bd..9caf6b0 100644
>> --- a/mm/swap.c
>> +++ b/mm/swap.c
>> @@ -211,10 +211,6 @@ static void folio_batch_move_lru(struct folio_batch *fbatch, move_fn_t move_fn)
>> for (i = 0; i < folio_batch_count(fbatch); i++) {
>> struct folio *folio = fbatch->folios[i];
>>
>> - /* block memcg migration while the folio moves between lru */
>> - if (move_fn != lru_add_fn && !folio_test_clear_lru(folio))
>> - continue;
>> -
>> folio_lruvec_relock_irqsave(folio, &lruvec, &flags);
>> move_fn(lruvec, folio);
>>
>> @@ -255,11 +251,16 @@ static void lru_move_tail_fn(struct lruvec *lruvec, struct folio *folio)
>> void folio_rotate_reclaimable(struct folio *folio)
>> {
>> if (!folio_test_locked(folio) && !folio_test_dirty(folio) &&
>> - !folio_test_unevictable(folio) && folio_test_lru(folio)) {
>> + !folio_test_unevictable(folio)) {
>> struct folio_batch *fbatch;
>> unsigned long flags;
>>
>> folio_get(folio);
>> + if (!folio_test_clear_lru(folio)) {
>> + folio_put(folio);
>> + return;
>> + }
>> +
>> local_lock_irqsave(&lru_rotate.lock, flags);
>> fbatch = this_cpu_ptr(&lru_rotate.fbatch);
>> folio_batch_add_and_move(fbatch, folio, lru_move_tail_fn);
>> @@ -352,11 +353,15 @@ static void folio_activate_drain(int cpu)
>>
>> void folio_activate(struct folio *folio)
>> {
>> - if (folio_test_lru(folio) && !folio_test_active(folio) &&
>> - !folio_test_unevictable(folio)) {
>> + if (!folio_test_active(folio) && !folio_test_unevictable(folio)) {
>> struct folio_batch *fbatch;
>>
>> folio_get(folio);
>> + if (!folio_test_clear_lru(folio)) {
>> + folio_put(folio);
>> + return;
>> + }
>> +
>> local_lock(&cpu_fbatches.lock);
>> fbatch = this_cpu_ptr(&cpu_fbatches.activate);
>> folio_batch_add_and_move(fbatch, folio, folio_activate_fn);
>> @@ -700,6 +705,11 @@ void deactivate_file_folio(struct folio *folio)
>> return;
>>
>> folio_get(folio);
>> + if (!folio_test_clear_lru(folio)) {
>> + folio_put(folio);
>> + return;
>> + }
>> +
>> local_lock(&cpu_fbatches.lock);
>> fbatch = this_cpu_ptr(&cpu_fbatches.lru_deactivate_file);
>> folio_batch_add_and_move(fbatch, folio, lru_deactivate_file_fn);
>> @@ -716,11 +726,16 @@ void deactivate_file_folio(struct folio *folio)
>> */
>> void folio_deactivate(struct folio *folio)
>> {
>> - if (folio_test_lru(folio) && !folio_test_unevictable(folio) &&
>> - (folio_test_active(folio) || lru_gen_enabled())) {
>> + if (!folio_test_unevictable(folio) && (folio_test_active(folio) ||
>> + lru_gen_enabled())) {
>> struct folio_batch *fbatch;
>>
>> folio_get(folio);
>> + if (!folio_test_clear_lru(folio)) {
>> + folio_put(folio);
>> + return;
>> + }
>> +
>> local_lock(&cpu_fbatches.lock);
>> fbatch = this_cpu_ptr(&cpu_fbatches.lru_deactivate);
>> folio_batch_add_and_move(fbatch, folio, lru_deactivate_fn);
>> @@ -737,12 +752,16 @@ void folio_deactivate(struct folio *folio)
>> */
>> void folio_mark_lazyfree(struct folio *folio)
>> {
>> - if (folio_test_lru(folio) && folio_test_anon(folio) &&
>> - folio_test_swapbacked(folio) && !folio_test_swapcache(folio) &&
>> - !folio_test_unevictable(folio)) {
>> + if (folio_test_anon(folio) && folio_test_swapbacked(folio) &&
>> + !folio_test_swapcache(folio) && !folio_test_unevictable(folio)) {
>> struct folio_batch *fbatch;
>>
>> folio_get(folio);
>> + if (!folio_test_clear_lru(folio)) {
>> + folio_put(folio);
>> + return;
>> + }
>> +
>> local_lock(&cpu_fbatches.lock);
>> fbatch = this_cpu_ptr(&cpu_fbatches.lru_lazyfree);
>> folio_batch_add_and_move(fbatch, folio, lru_lazyfree_fn);
>> --
>> 2.7.4
>>
>>
Powered by blists - more mailing lists