[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <386d7d5f-a57d-f5b1-acee-131ce23d35ec@linux.alibaba.com>
Date: Wed, 4 Mar 2020 10:34:03 +0800
From: Shile Zhang <shile.zhang@...ux.alibaba.com>
To: Kirill Tkhai <ktkhai@...tuozzo.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Pavel Tatashin <pasha.tatashin@...een.com>
Cc: linux-mm@...ck.org, linux-kernel@...r.kernel.org,
Vlastimil Babka <vbabka@...e.cz>,
Michal Hocko <mhocko@...e.com>
Subject: Re: [PATCH v2 1/1] mm: fix interrupt disabled long time inside
deferred_init_memmap()
Hi Kirill,
Thanks for your quickly reply!
On 2020/3/4 00:52, Kirill Tkhai wrote:
> On 03.03.2020 19:15, Shile Zhang wrote:
>> When 'CONFIG_DEFERRED_STRUCT_PAGE_INIT' is set, 'pgdatinit' kthread will
>> initialise the deferred pages with local interrupts disabled. It is
>> introduced by commit 3a2d7fa8a3d5 ("mm: disable interrupts while
>> initializing deferred pages").
>>
>> The local interrupt will be disabled long time inside
>> deferred_init_memmap(), depends on memory size.
>> On machine with NCPUS <= 2, the 'pgdatinit' kthread could be pined on
>> boot CPU, then the tick timer will stuck long time, which caused the
>> system wall time inaccuracy.
>>
>> For example, the dmesg shown that:
>>
>> [ 0.197975] node 0 initialised, 32170688 pages in 1ms
>>
>> Obviously, 1ms is unreasonable.
>> Now, fix it by restore in the pending interrupts inside the while loop.
>> The reasonable demsg shown likes:
>>
>> [ 1.069306] node 0 initialised, 32203456 pages in 894ms
> The way I understand the original problem, that Pavel fixed:
>
> we need disable irqs in deferred_init_memmap() since this function may be called
> in parallel with deferred_grow_zone() called from interrupt handler. So, Pavel
> added lock to fix the race.
>
> In case of we temporary unlock the lock, interrupt still be possible,
> so my previous proposition returns the problem back.
>
> Now thought again, I think we have to just add:
>
> pgdat_resize_unlock();
> pgdat_resize_lock();
>
> instead of releasing interrupts, since in case of we just release them with lock held,
> a call of interrupt->deferred_grow_zone() bring us to a deadlock.
>
> So, unlock the lock is must.
Yes, you're right! I missed this point.
Thanks for your comment!
>
>> Signed-off-by: Shile Zhang <shile.zhang@...ux.alibaba.com>
>> ---
>> mm/page_alloc.c | 6 +++++-
>> 1 file changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 3c4eb750a199..d3f337f2e089 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -1809,8 +1809,12 @@ static int __init deferred_init_memmap(void *data)
>> * that we can avoid introducing any issues with the buddy
>> * allocator.
>> */
>> - while (spfn < epfn)
>> + while (spfn < epfn) {
>> nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
>> + /* let in any pending interrupts */
>> + local_irq_restore(flags);
>> + local_irq_save(flags);
>> + }
>> zone_empty:
>> pgdat_resize_unlock(pgdat, &flags);
> I think we need here something like below (untested):
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 79e950d76ffc..323afa9a4db5 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1828,7 +1828,7 @@ static int __init deferred_init_memmap(void *data)
> {
> pg_data_t *pgdat = data;
> const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
> - unsigned long spfn = 0, epfn = 0, nr_pages = 0;
> + unsigned long spfn = 0, epfn = 0, nr_pages = 0, prev_nr_pages = 0;
> unsigned long first_init_pfn, flags;
> unsigned long start = jiffies;
> struct zone *zone;
> @@ -1869,8 +1869,18 @@ static int __init deferred_init_memmap(void *data)
> * that we can avoid introducing any issues with the buddy
> * allocator.
> */
> - while (spfn < epfn)
> + while (spfn < epfn) {
> nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> + /*
> + * Release interrupts every 1Gb to give a possibility
> + * a timer to advance jiffies.
> + */
> + if (nr_pages - prev_nr_pages > (1UL << (30 - PAGE_SHIFT))) {
> + prev_nr_pages = nr_pages;
> + pgdat_resize_unlock(pgdat, &flags);
> + pgdat_resize_lock(pgdat, &flags);
> + }
> + }
> zone_empty:
> pgdat_resize_unlock(pgdat, &flags);
>
>
> (I believe the comment may be improved more).
Yeah, your patch is better!
I test your code and it works!
But it seems that 1G is still hold the interrupts too long, about 40ms
in my env
with Intel(R) Xeon(R) 2.5GHz). I tried other size, it is OK to use 1024
pages (4MB),
which suggested by Andrew's before.
Could you please help to review it again?
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3c4eb750a199..5def66d3ffcd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1768,7 +1768,7 @@ static int __init deferred_init_memmap(void *data)
{
pg_data_t *pgdat = data;
const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
- unsigned long spfn = 0, epfn = 0, nr_pages = 0;
+ unsigned long spfn = 0, epfn = 0, nr_pages = 0, prev_nr_pages = 0;
unsigned long first_init_pfn, flags;
unsigned long start = jiffies;
struct zone *zone;
@@ -1809,8 +1809,17 @@ static int __init deferred_init_memmap(void *data)
* that we can avoid introducing any issues with the buddy
* allocator.
*/
- while (spfn < epfn)
+ while (spfn < epfn) {
nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
+ /*
+ * Restore pending interrupts every 1024 pages to give
+ * the chance tick timer to advance jiffies.
+ */
+ if (nr_pages - prev_nr_pages > 1024) {
+ pgdat_resize_unlock(&flags);
+ pgdat_resize_lock(&flags);
+ }
+ }
zone_empty:
pgdat_resize_unlock(pgdat, &flags);
Powered by blists - more mailing lists