linux-kernel - Re: [PATCH v2 1/1] mm: fix interrupt disabled long time inside deferred_init

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <386d7d5f-a57d-f5b1-acee-131ce23d35ec@linux.alibaba.com>
Date:   Wed, 4 Mar 2020 10:34:03 +0800
From:   Shile Zhang <shile.zhang@...ux.alibaba.com>
To:     Kirill Tkhai <ktkhai@...tuozzo.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Pavel Tatashin <pasha.tatashin@...een.com>
Cc:     linux-mm@...ck.org, linux-kernel@...r.kernel.org,
        Vlastimil Babka <vbabka@...e.cz>,
        Michal Hocko <mhocko@...e.com>
Subject: Re: [PATCH v2 1/1] mm: fix interrupt disabled long time inside
 deferred_init_memmap()

Hi Kirill,

Thanks for your quickly reply!

On 2020/3/4 00:52, Kirill Tkhai wrote:
> On 03.03.2020 19:15, Shile Zhang wrote:
>> When 'CONFIG_DEFERRED_STRUCT_PAGE_INIT' is set, 'pgdatinit' kthread will
>> initialise the deferred pages with local interrupts disabled. It is
>> introduced by commit 3a2d7fa8a3d5 ("mm: disable interrupts while
>> initializing deferred pages").
>>
>> The local interrupt will be disabled long time inside
>> deferred_init_memmap(), depends on memory size.
>> On machine with NCPUS <= 2, the 'pgdatinit' kthread could be pined on
>> boot CPU, then the tick timer will stuck long time, which caused the
>> system wall time inaccuracy.
>>
>> For example, the dmesg shown that:
>>
>>    [    0.197975] node 0 initialised, 32170688 pages in 1ms
>>
>> Obviously, 1ms is unreasonable.
>> Now, fix it by restore in the pending interrupts inside the while loop.
>> The reasonable demsg shown likes:
>>
>> [    1.069306] node 0 initialised, 32203456 pages in 894ms
> The way I understand the original problem, that Pavel fixed:
>
> we need disable irqs in deferred_init_memmap() since this function may be called
> in parallel with deferred_grow_zone() called from interrupt handler. So, Pavel
> added lock to fix the race.
>
> In case of we temporary unlock the lock, interrupt still be possible,
> so my previous proposition returns the problem back.
>
> Now thought again, I think we have to just add:
>
> 	pgdat_resize_unlock();
> 	pgdat_resize_lock();
>
> instead of releasing interrupts, since in case of we just release them with lock held,
> a call of interrupt->deferred_grow_zone() bring us to a deadlock.
>
> So, unlock the lock is must.

Yes, you're right! I missed this point.
Thanks for your comment!

>
>> Signed-off-by: Shile Zhang <shile.zhang@...ux.alibaba.com>
>> ---
>>   mm/page_alloc.c | 6 +++++-
>>   1 file changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 3c4eb750a199..d3f337f2e089 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -1809,8 +1809,12 @@ static int __init deferred_init_memmap(void *data)
>>   	 * that we can avoid introducing any issues with the buddy
>>   	 * allocator.
>>   	 */
>> -	while (spfn < epfn)
>> +	while (spfn < epfn) {
>>   		nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
>> +		/* let in any pending interrupts */
>> +		local_irq_restore(flags);
>> +		local_irq_save(flags);
>> +	}
>>   zone_empty:
>>   	pgdat_resize_unlock(pgdat, &flags);
> I think we need here something like below (untested):
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 79e950d76ffc..323afa9a4db5 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1828,7 +1828,7 @@ static int __init deferred_init_memmap(void *data)
>   {
>   	pg_data_t *pgdat = data;
>   	const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
> -	unsigned long spfn = 0, epfn = 0, nr_pages = 0;
> +	unsigned long spfn = 0, epfn = 0, nr_pages = 0, prev_nr_pages = 0;
>   	unsigned long first_init_pfn, flags;
>   	unsigned long start = jiffies;
>   	struct zone *zone;
> @@ -1869,8 +1869,18 @@ static int __init deferred_init_memmap(void *data)
>   	 * that we can avoid introducing any issues with the buddy
>   	 * allocator.
>   	 */
> -	while (spfn < epfn)
> +	while (spfn < epfn) {
>   		nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> +		/*
> +		 * Release interrupts every 1Gb to give a possibility
> +		 * a timer to advance jiffies.
> +		 */
> +		if (nr_pages - prev_nr_pages > (1UL << (30 - PAGE_SHIFT))) {
> +			prev_nr_pages = nr_pages;
> +			pgdat_resize_unlock(pgdat, &flags);
> +			pgdat_resize_lock(pgdat, &flags);
> +		}
> +	}
>   zone_empty:
>   	pgdat_resize_unlock(pgdat, &flags);
>   
>
> (I believe the comment may be improved more).

Yeah, your patch is better!
I test your code and it works!
But it seems that 1G is still hold the interrupts too long, about 40ms 
in my env
with Intel(R) Xeon(R) 2.5GHz). I tried other size, it is OK to use 1024 
pages (4MB),
which suggested by Andrew's before.

Could you please help to review it again?

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3c4eb750a199..5def66d3ffcd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1768,7 +1768,7 @@ static int __init deferred_init_memmap(void *data)
  {
         pg_data_t *pgdat = data;
         const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
-       unsigned long spfn = 0, epfn = 0, nr_pages = 0;
+       unsigned long spfn = 0, epfn = 0, nr_pages = 0, prev_nr_pages = 0;
         unsigned long first_init_pfn, flags;
         unsigned long start = jiffies;
         struct zone *zone;
@@ -1809,8 +1809,17 @@ static int __init deferred_init_memmap(void *data)
          * that we can avoid introducing any issues with the buddy
          * allocator.
          */
-       while (spfn < epfn)
+       while (spfn < epfn) {
                 nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
+               /*
+                * Restore pending interrupts every 1024 pages to give
+                * the chance tick timer to advance jiffies.
+                */
+               if (nr_pages - prev_nr_pages > 1024) {
+                       pgdat_resize_unlock(&flags);
+                       pgdat_resize_lock(&flags);
+               }
+       }
  zone_empty:
         pgdat_resize_unlock(pgdat, &flags);