linux-kernel - Re: [PATCH] sched/fair: Care divide error in update_task_scan

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <54361E7E.5090006@jp.fujitsu.com>
Date:	Thu, 9 Oct 2014 14:34:54 +0900
From:	Yasuaki Ishimatsu <isimatu.yasuaki@...fujitsu.com>
To:	Wanpeng Li <kernellwp@...il.com>, <mingo@...hat.com>,
	<peterz@...radead.org>
CC:	<linux-kernel@...r.kernel.org>, <riel@...hat.com>,
	<tkhai@...dex.ru>
Subject: Re: [PATCH] sched/fair: Care divide error in update_task_scan_period()

(2014/10/08 20:51), Wanpeng Li wrote:
> 
> 于 10/8/14, 2:43 PM, Yasuaki Ishimatsu 写道:
>> While offling node by hot removing memory, the following divide error
>> occurs:
>>
>>    divide error: 0000 [#1] SMP
>>    [...]
>>    Call Trace:
>>     [...] handle_mm_fault
>>     [...] ? try_to_wake_up
>>     [...] ? wake_up_state
>>     [...] __do_page_fault
>>     [...] ? do_futex
>>     [...] ? put_prev_entity
>>     [...] ? __switch_to
>>     [...] do_page_fault
>>     [...] page_fault
>>    [...]
>>    RIP  [<ffffffff810a7081>] task_numa_fault
>>     RSP <ffff88084eb2bcb0>
>>
>> The issue occurs as follows:
>>    1. When page fault occurs and page is allocated from node 1,
>>       task_struct->numa_faults_buffer_memory[] of node 1 is
>>       incremented and p->numa_faults_locality[] is also incremented
>>       as follows:
>>
>>       o numa_faults_buffer_memory[]       o numa_faults_locality[]
>>                NR_NUMA_HINT_FAULT_TYPES
>>               |      0     |     1     |
>>       ----------------------------------  ----------------------
>>        node 0 |      0     |     0     |   remote |      0     |
>>        node 1 |      0     |     1     |   locale |      1     |
>>       ----------------------------------  ----------------------
>>
>>    2. node 1 is offlined by hot removing memory.
>>
>>    3. When page fault occurs, fault_types[] is calculated by using
>>       p->numa_faults_buffer_memory[] of all online nodes in
>>       task_numa_placement(). But node 1 was offline by step 2. So
>>       the fault_types[] is calculated by using only
>>       p->numa_faults_buffer_memory[] of node 0. So both of fault_types[]
>>       are set to 0.
>>
>>    4. The values(0) of fault_types[] pass to update_task_scan_period().
>>
>>    5. numa_faults_locality[1] is set to 1. So the following division is
>>       calculated.
>>
>>          static void update_task_scan_period(struct task_struct *p,
>>                                  unsigned long shared, unsigned long private){
>>          ...
>>                  ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared));
>>          }
>>
>>    6. But both of private and shared are set to 0. So divide error
>>       occurs here.
>>
>>    The divide error is rare case because the trigger is node offline.
>>    By this patch, when both of private and shared are set to 0, diff
>>    is just set to 0, not calculating the division.
>>
>> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@...fujitsu.com>
>> ---
>>   kernel/sched/fair.c | 30 +++++++++++++++++++-----------
>>   1 file changed, 19 insertions(+), 11 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index bfa3c86..fb7dc3f 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -1496,18 +1496,26 @@ static void update_task_scan_period(struct task_struct *p,
>>   			slot = 1;
>>   		diff = slot * period_slot;
>>   	} else {
>> -		diff = -(NUMA_PERIOD_THRESHOLD - ratio) * period_slot;
>> +		if (unlikely((private + shared) == 0))
>> +			/*
>> +			 * This is a rare case. The trigger is node offline.
>> +			 */
>> +			diff = 0;
>> +		else {
>> +			diff = -(NUMA_PERIOD_THRESHOLD - ratio) * period_slot;
>>
>> -		/*
>> -		 * Scale scan rate increases based on sharing. There is an
>> -		 * inverse relationship between the degree of sharing and
>> -		 * the adjustment made to the scanning period. Broadly
>> -		 * speaking the intent is that there is little point
>> -		 * scanning faster if shared accesses dominate as it may
>> -		 * simply bounce migrations uselessly
>> -		 */
>> -		ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared));
>> -		diff = (diff * ratio) / NUMA_PERIOD_SLOTS;
>> +			/*
>> +			 * Scale scan rate increases based on sharing. There is
>> +			 * an inverse relationship between the degree of sharing
>> +			 * and the adjustment made to the scanning period.
>> +			 * Broadly speaking the intent is that there is little
>> +			 * point scanning faster if shared accesses dominate as
>> +			 * it may simply bounce migrations uselessly
>> +			 */
>> +			ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS,
>> +							(private + shared));
>> +			diff = (diff * ratio) / NUMA_PERIOD_SLOTS;
>> +		}
>>   	}
> 

> How about just
> 
> ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared + 1));

Thank you for providing sample code. Rik also provided other idea.
So I am confirming which is better idea.

Thanks,
Yasuaki Ishimatsu

> 
> 
> Regards,
> Wanpeng Li
> 
>>   	p->numa_scan_period = clamp(p->numa_scan_period + diff,
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/