linux-kernel - Re: high latency in CFS when disable autogroup, bug or not?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:   Tue, 24 Oct 2017 09:48:51 +0800
From:   Li RongQing <roy.qing.li@...il.com>
To:     linux-kernel@...r.kernel.org
Subject: Re: high latency in CFS when disable autogroup, bug or not?

I find the root cause, the delayed process run a very long time
before, and its vruntime become very large, and must wait all process;
so it is delayed

the reason is below:

1. there is a low weight load process A (nice=19, weight=15)
2. there is a process B which is doing IO
3. there is a process C whose nice is 0

the running status likes below:

Step 1: when C is running, B wakeup, and preempt C;  B start to run
Step 2: when B sleeps, vruntime of A is min, but A cannot preempt
C(cfq->last); then C restore to run; then repeat Step 1;


A can not preempt B and C, since vruntime of B and C is not larger
that 4ms*1024/15 [sched_wakeup_granularity_ns *1024/wight of nice 19]


but this condition will block other all process(about 500 processes)
running in  4ms*1024/15;


so I think we should consider more when doing preempt

/*
 * Pick the next process, keeping these things in mind, in this order:
 * 1) keep things fair between processes/task groups
 * 2) pick the "next" process, since someone really wants that to run
 * 3) pick the "last" process, for cache locality
 * 4) do not run the "skip" process, if something else is available
 */
static struct sched_entity *pick_next_entity(struct cfs_rq *cfs_rq)
{
    struct sched_entity *se = __pick_first_entity(cfs_rq);
    struct sched_entity *left = se;

    /*
     * Avoid running the skip buddy, if running something else can
     * be done without getting too unfair.
     */
    if (cfs_rq->skip == se) {
        struct sched_entity *second = __pick_next_entity(se);
        if (second && wakeup_preempt_entity(second, left) < 1)
            se = second;
    }

    /*
     * Prefer last buddy, try to return the CPU to a preempted task.
     */
    if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
        se = cfs_rq->last;

    /*
     * Someone really wants this to run. If it's not unfair, run it.
     */
    if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
        se = cfs_rq->next;

    clear_buddies(cfs_rq, se);

    return se;
}


On Fri, Sep 1, 2017 at 11:28 AM, Li RongQing <roy.qing.li@...il.com> wrote:
> I have a programe to test CFS latency or fairness on high load,
> the test shows CFS has high latency when disable autogroup
>
> the program is in
> https://gist.github.com/anonymous/af21ae289cfa5c6310eeac73b7a478ff.
> the programe creates many threads, every thread does thing like:
>
> do {
>     now =gettime()
>     loop to spend 10 microseconds
>     delta = gettime()-now
>
>     if (delta > max_latency) {
>          max_latency=delta
>          printf( max_latency);
>     }
> } while(1)
>
>
> I run this program on a machine with 48 processor,  this programe
> created 2000 thread,
> then every processor has about 40 pthread,  every thread should finish
> its 10ms computation in 400ms= 40pthread*10ms
> but test result show some thread takes very long time
>
> # ./pipe_test -n 2000
> eat cpu with 2000 threads, delay time 1000 ms
> 648 delay 1037 ms
> 48923 delay 1038 ms
> 1810 delay 1349 ms
> 49142 delay 1482 ms
> 1728 delay 1574 ms
> 1518 delay 1713 ms
> 808 delay 1714 ms
> 1702 delay 1733 ms
> 49004 delay 1783 ms
> 48821 delay 1785 ms
> 451 delay 1865 ms
> 990 delay 1910 ms
> 1626 delay 1957 ms
> 537 delay 2420 ms
> 2021 delay 3242 ms
> 763 delay 3488 ms
> 622 delay 3614 ms
> 1887 delay 4474 ms
> 1267 delay 4924 ms
> 721 delay 5406 ms
> 1810 delay 5873 ms
> 1470 delay 5969 ms
> 1702 delay 7834 ms
> 48821 delay 8063 ms
> ^C
>
> The kernel version is 4.9.23 and I disable autogroup;   if autogroup
> is enabled, no this issue
>
>
> ftrace result:
>
>            <...>-48821 [003] d... 64151.635476: sched_switch:
> prev_comm=a.out prev_pid=48821 prev_prio=120 prev_state=R ==>
> next_comm=baas_agent next_pid=17118 next_prio=120
>            <...>-48821 [003] d... 64151.636531: sched_switch:
> prev_comm=a.out prev_pid=48821 prev_prio=120 prev_state=R ==>
> next_comm=baas_agent next_pid=17118 next_prio=120
>            <...>-48821 [003] d... 64151.639570: sched_switch:
> prev_comm=a.out prev_pid=48821 prev_prio=120 prev_state=R ==>
> next_comm=baas_agent next_pid=17275 next_prio=120
>            <...>-48821 [003] d... 64159.703051: sched_switch:
> prev_comm=a.out prev_pid=48821 prev_prio=120 prev_state=R+ ==>
> next_comm=kworker/u97:0 next_pid=36929 next_prio=120
>            <...>-48821 [003] d... 64159.703091: sched_switch:
> prev_comm=a.out prev_pid=48821 prev_prio=120 prev_state=R ==>
> next_comm=kworker/u97:0 next_pid=36929 next_prio=120
>            <...>-48821 [003] d... 64159.703978: sched_switch:
> prev_comm=a.out prev_pid=48821 prev_prio=120 prev_state=R ==>
> next_comm=baas_agent next_pid=17275 next_prio=120
>            <...>-48821 [003] d... 64159.705542: sched_switch:
> prev_comm=a.out prev_pid=48821 prev_prio=120 prev_state=R ==>
> next_comm=baas_agent next_pid=16879 next_prio=120
>
>
> # grep sched_migrate_task trace|grep 48821
>            <...>-688   [019] d... 64161.828654: sched_migrate_task:
> comm=a.out pid=48821 prio=120 orig_cpu=3 dest_cpu=19
>            <...>-48821 [019] d... 64161.828862: sched_migrate_task:
> comm=a.out pid=49053 prio=120 orig_cpu=43 dest_cpu=19
> #