lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Mon, 20 Oct 2014 16:47:31 +0900
From:	Yasuaki Ishimatsu <isimatu.yasuaki@...fujitsu.com>
To:	<mingo@...hat.com>, <peterz@...radead.org>
CC:	<kernellwp@...il.com>, <riel@...hat.com>, <tkhai@...dex.ru>,
	<linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v2] sched/fair: Care divide error in update_task_scan_period()

Could you review this patch?

(2014/10/16 18:48), Yasuaki Ishimatsu wrote:
> While offling node by hot removing memory, the following divide error
> occurs:
> 
>    divide error: 0000 [#1] SMP
>    [...]
>    Call Trace:
>     [...] handle_mm_fault
>     [...] ? try_to_wake_up
>     [...] ? wake_up_state
>     [...] __do_page_fault
>     [...] ? do_futex
>     [...] ? put_prev_entity
>     [...] ? __switch_to
>     [...] do_page_fault
>     [...] page_fault
>    [...]
>    RIP  [<ffffffff810a7081>] task_numa_fault
>     RSP <ffff88084eb2bcb0>
> 
> The issue occurs as follows:
>    1. When page fault occurs and page is allocated from node 1,
>       task_struct->numa_faults_buffer_memory[] of node 1 is
>       incremented and p->numa_faults_locality[] is also incremented
>       as follows:
> 
>       o numa_faults_buffer_memory[]       o numa_faults_locality[]
>                NR_NUMA_HINT_FAULT_TYPES
>               |      0     |     1     |
>       ----------------------------------  ----------------------
>        node 0 |      0     |     0     |   remote |      0     |
>        node 1 |      0     |     1     |   locale |      1     |
>       ----------------------------------  ----------------------
> 
>    2. node 1 is offlined by hot removing memory.
> 
>    3. When page fault occurs, fault_types[] is calculated by using
>       p->numa_faults_buffer_memory[] of all online nodes in
>       task_numa_placement(). But node 1 was offline by step 2. So
>       the fault_types[] is calculated by using only
>       p->numa_faults_buffer_memory[] of node 0. So both of fault_types[]
>       are set to 0.
> 
>    4. The values(0) of fault_types[] pass to update_task_scan_period().
> 
>    5. numa_faults_locality[1] is set to 1. So the following division is
>       calculated.
> 
>          static void update_task_scan_period(struct task_struct *p,
>                                  unsigned long shared, unsigned long private){
>          ...
>                  ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared));
>          }
> 
>    6. But both of private and shared are set to 0. So divide error
>       occurs here.
> 
> The divide error is rare case because the trigger is node offline.
> By this patch, when both of private and shared are set to 0,
> denominator is set to 1 for avoiding divide error.
> 
> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@...fujitsu.com>
> CC: Wanpeng Li <kernellwp@...il.com>
> CC: Rik van Riel <riel@...hat.com>
> CC: Peter Zijlstra <peterz@...radead.org>
> ---
>   kernel/sched/fair.c | 11 ++++++++++-
>   1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bfa3c86..580fc74 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1466,6 +1466,7 @@ static void update_task_scan_period(struct task_struct *p,
> 
>   	unsigned long remote = p->numa_faults_locality[0];
>   	unsigned long local = p->numa_faults_locality[1];
> +	unsigned long total_faults = shared + private;
> 
>   	/*
>   	 * If there were no record hinting faults then either the task is
> @@ -1496,6 +1497,14 @@ static void update_task_scan_period(struct task_struct *p,
>   			slot = 1;
>   		diff = slot * period_slot;
>   	} else {
> +		/*
> +		 * This is a rare case. total_faults might become 0 after
> +		 * offlining node. In this case, total_faults is set to 1
> +		 * for avoiding divide error.
> +		 */
> +		if (unlikely(total_faults == 0))
> +			total_faults = 1;
> +
>   		diff = -(NUMA_PERIOD_THRESHOLD - ratio) * period_slot;
> 
>   		/*
> @@ -1506,7 +1515,7 @@ static void update_task_scan_period(struct task_struct *p,
>   		 * scanning faster if shared accesses dominate as it may
>   		 * simply bounce migrations uselessly
>   		 */
> -		ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared));
> +		ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (total_faults));
>   		diff = (diff * ratio) / NUMA_PERIOD_SLOTS;
>   	}
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ