[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <e059d5be-79fe-b2c3-0d77-c33da4e7bce0@virtuozzo.com>
Date: Wed, 11 Mar 2020 15:04:50 +0300
From: Kirill Tkhai <ktkhai@...tuozzo.com>
To: Shile Zhang <shile.zhang@...ux.alibaba.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Pavel Tatashin <pasha.tatashin@...een.com>
Cc: linux-mm@...ck.org, linux-kernel@...r.kernel.org,
Vlastimil Babka <vbabka@...e.cz>,
Michal Hocko <mhocko@...e.com>
Subject: Re: [PATCH v2 1/1] mm: fix interrupt disabled long time inside
deferred_init_memmap()
On 11.03.2020 15:00, Shile Zhang wrote:
>
>
> On 2020/3/11 19:42, Kirill Tkhai wrote:
>> On 11.03.2020 04:44, Shile Zhang wrote:
>>> Hi Kirill,
>>>
>>> Sorry for late to reply!
>>> I'm not fully understood the whole thing about deferred page init, so I
>>> just force on the jiffies update issue itself.
>>>
>>> Maybe I'm in wrong path, it seems make no sense that deferred page init in 1 CPU system,
>>> it cannot be initialize memory parallel.
>> Yes, it can't be initialized in parallel. But scheduler interprets deferred thread as a separate
>> task, so CPU time will be shared between that thread and the rest of tasks. This still should
>> make boot faster.
>
> Thanks for your quickly reply!
>
> Sorry, I don't think the CPU time can be shared to the rest of tasks since the CPU be blocked until
> the deferred pages are all initialized, as following code:
> 8<------
> #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
>
> /* There will be num_node_state(N_MEMORY) threads */
> atomic_set(&pgdat_init_n_undone, num_node_state(N_MEMORY));
> for_each_node_state(nid, N_MEMORY) {
> kthread_run(deferred_init_memmap, NODE_DATA(nid), "pgdatinit%d", nid);
> }
>
> /* Block until all are initialised */
> wait_for_completion(&pgdat_init_all_done_comp);
> 8<-------
Yes, sure. Than, there won't be any profit in runtime.
Anyway, we get maintaince profit in case of all code is written in generic way.
No new corner cases are welcomed.
> Maybe I misunderstood this point.
>
> @Pavel,
> Could you please give any advice on this point? Thanks!
>
>>> It might be better to disable deferred page init in 'deferred_init' in case of 1 CPU
>>> (or only one memory node).
>>>
>>> In other word, seems the better way to solve this issue is do not bind 'pgdatinit' thread
>>> on boot CPU.
>>>
>>> I also refactor the patch based on your comment, please help to check, thanks!
>>>
>>>
>>> On 2020/3/4 18:47, Kirill Tkhai wrote:
>>>> On 04.03.2020 05:34, Shile Zhang wrote:
>>>>> Hi Kirill,
>>>>>
>>>>> Thanks for your quickly reply!
>>>>>
>>>>> On 2020/3/4 00:52, Kirill Tkhai wrote:
>>>>>> On 03.03.2020 19:15, Shile Zhang wrote:
>>>>>>> When 'CONFIG_DEFERRED_STRUCT_PAGE_INIT' is set, 'pgdatinit' kthread will
>>>>>>> initialise the deferred pages with local interrupts disabled. It is
>>>>>>> introduced by commit 3a2d7fa8a3d5 ("mm: disable interrupts while
>>>>>>> initializing deferred pages").
>>>>>>>
>>>>>>> The local interrupt will be disabled long time inside
>>>>>>> deferred_init_memmap(), depends on memory size.
>>>>>>> On machine with NCPUS <= 2, the 'pgdatinit' kthread could be pined on
>>>>>>> boot CPU, then the tick timer will stuck long time, which caused the
>>>>>>> system wall time inaccuracy.
>>>>>>>
>>>>>>> For example, the dmesg shown that:
>>>>>>>
>>>>>>> [ 0.197975] node 0 initialised, 32170688 pages in 1ms
>>>>>>>
>>>>>>> Obviously, 1ms is unreasonable.
>>>>>>> Now, fix it by restore in the pending interrupts inside the while loop.
>>>>>>> The reasonable demsg shown likes:
>>>>>>>
>>>>>>> [ 1.069306] node 0 initialised, 32203456 pages in 894ms
>>>>>> The way I understand the original problem, that Pavel fixed:
>>>>>>
>>>>>> we need disable irqs in deferred_init_memmap() since this function may be called
>>>>>> in parallel with deferred_grow_zone() called from interrupt handler. So, Pavel
>>>>>> added lock to fix the race.
>>>>>>
>>>>>> In case of we temporary unlock the lock, interrupt still be possible,
>>>>>> so my previous proposition returns the problem back.
>>>>>>
>>>>>> Now thought again, I think we have to just add:
>>>>>>
>>>>>> pgdat_resize_unlock();
>>>>>> pgdat_resize_lock();
>>>>>>
>>>>>> instead of releasing interrupts, since in case of we just release them with lock held,
>>>>>> a call of interrupt->deferred_grow_zone() bring us to a deadlock.
>>>>>>
>>>>>> So, unlock the lock is must.
>>>>> Yes, you're right! I missed this point.
>>>>> Thanks for your comment!
>>>>>
>>>>>>> Signed-off-by: Shile Zhang<shile.zhang@...ux.alibaba.com>
>>>>>>> ---
>>>>>>> mm/page_alloc.c | 6 +++++-
>>>>>>> 1 file changed, 5 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>>>>> index 3c4eb750a199..d3f337f2e089 100644
>>>>>>> --- a/mm/page_alloc.c
>>>>>>> +++ b/mm/page_alloc.c
>>>>>>> @@ -1809,8 +1809,12 @@ static int __init deferred_init_memmap(void *data)
>>>>>>> * that we can avoid introducing any issues with the buddy
>>>>>>> * allocator.
>>>>>>> */
>>>>>>> - while (spfn < epfn)
>>>>>>> + while (spfn < epfn) {
>>>>>>> nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
>>>>>>> + /* let in any pending interrupts */
>>>>>>> + local_irq_restore(flags);
>>>>>>> + local_irq_save(flags);
>>>>>>> + }
>>>>>>> zone_empty:
>>>>>>> pgdat_resize_unlock(pgdat, &flags);
>>>>>> I think we need here something like below (untested):
>>>>>>
>>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>>>> index 79e950d76ffc..323afa9a4db5 100644
>>>>>> --- a/mm/page_alloc.c
>>>>>> +++ b/mm/page_alloc.c
>>>>>> @@ -1828,7 +1828,7 @@ static int __init deferred_init_memmap(void *data)
>>>>>> {
>>>>>> pg_data_t *pgdat = data;
>>>>>> const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
>>>>>> - unsigned long spfn = 0, epfn = 0, nr_pages = 0;
>>>>>> + unsigned long spfn = 0, epfn = 0, nr_pages = 0, prev_nr_pages = 0;
>>>>>> unsigned long first_init_pfn, flags;
>>>>>> unsigned long start = jiffies;
>>>>>> struct zone *zone;
>>>>>> @@ -1869,8 +1869,18 @@ static int __init deferred_init_memmap(void *data)
>>>>>> * that we can avoid introducing any issues with the buddy
>>>>>> * allocator.
>>>>>> */
>>>>>> - while (spfn < epfn)
>>>>>> + while (spfn < epfn) {
>>>>>> nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
>>>>>> + /*
>>>>>> + * Release interrupts every 1Gb to give a possibility
>>>>>> + * a timer to advance jiffies.
>>>>>> + */
>>>>>> + if (nr_pages - prev_nr_pages > (1UL << (30 - PAGE_SHIFT))) {
>>>>>> + prev_nr_pages = nr_pages;
>>>>>> + pgdat_resize_unlock(pgdat, &flags);
>>>>>> + pgdat_resize_lock(pgdat, &flags);
>>>>>> + }
>>>>>> + }
>>>>>> zone_empty:
>>>>>> pgdat_resize_unlock(pgdat, &flags);
>>>>>> (I believe the comment may be improved more).
>>>>> Yeah, your patch is better!
>>>>> I test your code and it works!
>>>>> But it seems that 1G is still hold the interrupts too long, about 40ms in my env
>>>>> with Intel(R) Xeon(R) 2.5GHz). I tried other size, it is OK to use 1024 pages (4MB),
>>>>> which suggested by Andrew's before.
>>>>>
>>>>> Could you please help to review it again?
>>>>>
>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>>> index 3c4eb750a199..5def66d3ffcd 100644
>>>>> --- a/mm/page_alloc.c
>>>>> +++ b/mm/page_alloc.c
>>>>> @@ -1768,7 +1768,7 @@ static int __init deferred_init_memmap(void *data)
>>>>> {
>>>>> pg_data_t *pgdat = data;
>>>>> const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
>>>>> - unsigned long spfn = 0, epfn = 0, nr_pages = 0;
>>>>> + unsigned long spfn = 0, epfn = 0, nr_pages = 0, prev_nr_pages = 0;
>>>>> unsigned long first_init_pfn, flags;
>>>>> unsigned long start = jiffies;
>>>>> struct zone *zone;
>>>>> @@ -1809,8 +1809,17 @@ static int __init deferred_init_memmap(void *data)
>>>>> * that we can avoid introducing any issues with the buddy
>>>>> * allocator.
>>>>> */
>>>>> - while (spfn < epfn)
>>>>> + while (spfn < epfn) {
>>>>> nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
>>>>> + /*
>>>>> + * Restore pending interrupts every 1024 pages to give
>>>>> + * the chance tick timer to advance jiffies.
>>>>> + */
>>>>> + if (nr_pages - prev_nr_pages > 1024) {
>>>>> + pgdat_resize_unlock(&flags);
>>>>> + pgdat_resize_lock(&flags);
>>>> Here is problem: prev_nr_pages must be updated.
>>>>
>>>> Anyway, releasing every 4M looks wrong for me, since you removes the fix that Pavel introduced.
>>>> He protected against big allocations made from interrupt content. But in case of we unlock
>>>> the lock after 4Mb, only 4Mb will be available for allocations from interrupts. pgdat->first_deferred_pfn
>>>> is updated at the start of function, so interrupt allocations won't be able to initialize
>>>> mode for themselve.
>>> Yes, you're right. I missed this point since I'm not fully understood the code before.
>>> Thanks for your advice!
>>>> In case of you want unlock interrupts very often, you should make some creativity with first_deferred_pfn.
>>>> We should update it sequentially. Something like below (untested):
>>> I got your point now, thanks!
>>>> ---
>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>> index 79e950d76ffc..be09d158baeb 100644
>>>> --- a/mm/page_alloc.c
>>>> +++ b/mm/page_alloc.c
>>>> @@ -1828,7 +1828,7 @@ static int __init deferred_init_memmap(void *data)
>>>> {
>>>> pg_data_t *pgdat = data;
>>>> const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
>>>> - unsigned long spfn = 0, epfn = 0, nr_pages = 0;
>>>> + unsigned long spfn = 0, epfn = 0, nr_pages;
>>>> unsigned long first_init_pfn, flags;
>>>> unsigned long start = jiffies;
>>>> struct zone *zone;
>>>> @@ -1838,7 +1838,7 @@ static int __init deferred_init_memmap(void *data)
>>>> /* Bind memory initialisation thread to a local node if possible */
>>>> if (!cpumask_empty(cpumask))
>>>> set_cpus_allowed_ptr(current, cpumask);
>>>> -
>>>> +again:
>>>> pgdat_resize_lock(pgdat, &flags);
>>>> first_init_pfn = pgdat->first_deferred_pfn;
>>>> if (first_init_pfn == ULONG_MAX) {
>>>> @@ -1850,7 +1850,6 @@ static int __init deferred_init_memmap(void *data)
>>>> /* Sanity check boundaries */
>>>> BUG_ON(pgdat->first_deferred_pfn < pgdat->node_start_pfn);
>>>> BUG_ON(pgdat->first_deferred_pfn > pgdat_end_pfn(pgdat));
>>>> - pgdat->first_deferred_pfn = ULONG_MAX;
>>>> /* Only the highest zone is deferred so find it */
>>>> for (zid = 0; zid < MAX_NR_ZONES; zid++) {
>>>> @@ -1864,14 +1863,30 @@ static int __init deferred_init_memmap(void *data)
>>>> first_init_pfn))
>>>> goto zone_empty;
>>>> + nr_pages = 0;
>>> 'nr_pages' used to mark the total init pages before, so it cannot be zerolized each round.
>>> seems we need one more to count the pages init each round.
>>>
>>>> +
>>>> /*
>>>> * Initialize and free pages in MAX_ORDER sized increments so
>>>> * that we can avoid introducing any issues with the buddy
>>>> * allocator.
>>>> + * Final iteration marker is: spfn=ULONG_MAX and epfn=0.
>>>> */
>>>> - while (spfn < epfn)
>>>> + while (spfn < epfn) {
>>>> nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
>>>> + if (!epfn)
>>>> + break;
>>> Seems 'epfn' never goes to 0 since it is "end page frame number", right?
>>> So this is needless.
>>>> + pgdat->first_deferred_pfn = epfn;
>>> I think first_deferred_pfn update wrong value here, it seems should be the spfn, the start pfn right?
>>>> + /*
>>>> + * Restore pending interrupts every 128Mb to give
>>>> + * the chance tick timer to advance jiffies.
>>>> + */
>>>> + if (nr_pages > (1UL << 27 - PAGE_SHIFT)) {
>>>> + pgdat_resize_unlock(pgdat, &flags);
>>>> + goto again;
>>>> + }
>>>> + }
>>>> zone_empty:
>>>> + pgdat->first_deferred_pfn = ULONG_MAX;
>>>> pgdat_resize_unlock(pgdat, &flags);
>>>> /* Sanity check that the next zone really is unpopulated */
>>>>
>>>>
>>> I update the patch based on your comment, it passed the test.
>>> Could you please help to review it again? Thanks!
>> The patch is OK for me. It may be sent into ml with an appropriate description.
>>
>> Feel free to not forget to add my:
>>
>> Co-developed-by: Kirill Tkhai <ktkhai@...tuozzo.com>
>
> Yeah, sure!
> Many thanks for your kindly help on this issue!
> I'll send out v3 later for further review.
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index 3c4eb750a199..841c902d4509 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -1763,12 +1763,17 @@ deferred_init_maxorder(u64 *i, struct zone *zone, unsigned long *start_pfn,
>>> return nr_pages;
>>> }
>>>
>>> +/*
>>> + * Release the tick timer interrupts for every TICK_PAGE_COUNT pages.
>>> + */
>>> +#define TICK_PAGE_COUNT (32 * 1024)
>>> +
>>> /* Initialise remaining memory on a node */
>>> static int __init deferred_init_memmap(void *data)
>>> {
>>> pg_data_t *pgdat = data;
>>> const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
>>> - unsigned long spfn = 0, epfn = 0, nr_pages = 0;
>>> + unsigned long spfn = 0, epfn = 0, nr_pages = 0, prev_nr_pages = 0;
>>> unsigned long first_init_pfn, flags;
>>> unsigned long start = jiffies;
>>> struct zone *zone;
>>> @@ -1779,6 +1784,7 @@ static int __init deferred_init_memmap(void *data)
>>> if (!cpumask_empty(cpumask))
>>> set_cpus_allowed_ptr(current, cpumask);
>>>
>>> +again:
>>> pgdat_resize_lock(pgdat, &flags);
>>> first_init_pfn = pgdat->first_deferred_pfn;
>>> if (first_init_pfn == ULONG_MAX) {
>>> @@ -1790,7 +1796,6 @@ static int __init deferred_init_memmap(void *data)
>>> /* Sanity check boundaries */
>>> BUG_ON(pgdat->first_deferred_pfn < pgdat->node_start_pfn);
>>> BUG_ON(pgdat->first_deferred_pfn > pgdat_end_pfn(pgdat));
>>> - pgdat->first_deferred_pfn = ULONG_MAX;
>>>
>>> /* Only the highest zone is deferred so find it */
>>> for (zid = 0; zid < MAX_NR_ZONES; zid++) {
>>> @@ -1809,9 +1814,23 @@ static int __init deferred_init_memmap(void *data)
>>> * that we can avoid introducing any issues with the buddy
>>> * allocator.
>>> */
>>> - while (spfn < epfn)
>>> + while (spfn < epfn) {
>>> nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
>>> + /*
>>> + * Release the interrupts for every TICK_PAGE_COUNT pages
>>> + * (128MB) to give the chance that tick timer to advance
>>> + * the jiffies.
>>> + */
>>> + if ((nr_pages - prev_nr_pages) > TICK_PAGE_COUNT) {
>>> + prev_nr_pages = nr_pages;
>>> + pgdat->first_deferred_pfn = spfn;
>>> + pgdat_resize_unlock(pgdat, &flags);
>>> + goto again;
>>> + }
>>> + }
>>> +
>>> zone_empty:
>>> + pgdat->first_deferred_pfn = ULONG_MAX;
>>> pgdat_resize_unlock(pgdat, &flags);
>>>
>>> /* Sanity check that the next zone really is unpopulated */
>>>
>
Powered by blists - more mailing lists