linux-kernel - Re: [PATCH] Wait to remove active timer when rescheduling hrtimer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <AANLkTinG5RmVERMKWFaJRNki+1NGbLLWudc2jtw=igPV@mail.gmail.com>
Date:	Fri, 21 Jan 2011 15:06:12 -0800
From:	Vernon Mauery <vernux@...ibm.com>
To:	Thomas Gleixner <tglx@...utronix.de>
Cc:	linux-kernel@...r.kernel.org
Subject: Re: [PATCH] Wait to remove active timer when rescheduling hrtimer

On Fri, Jan 21, 2011 at 1:32 PM, Thomas Gleixner <tglx@...utronix.de> wrote:
>
>
> On Fri, 21 Jan 2011, Vernon Mauery wrote:
>
>> In the current hrtimer implementation, when explicitly
>> removing a timer, the hrtimer_cancel function waits
>> until the timer is not currently running before actually
>> calling remove_hrtimer.  But in the __hrtimer_start_range_ns
>> function (called by hrtimer_start, among others), it
>> calls remove_hrtimer without checking to see if the
>> timer is currently running.  This doesn't seem to cause
>> much trouble in the vanilla kernel, but in the -RT
>> kernel, this causes a silent hang when exercising the
>> sch_htb module on a 10GbE interface.
>>
>> Signed-off-by: Vernon Mauery <vernux@...ibm.com>
>> ---
>>  kernel/hrtimer.c |    5 +++++
>>  1 files changed, 5 insertions(+), 0 deletions(-)
>>
>> diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
>> index 0c8d7c0..c7e8ba0 100644
>> --- a/kernel/hrtimer.c
>> +++ b/kernel/hrtimer.c
>> @@ -935,6 +935,11 @@ int __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
>>       base = lock_hrtimer_base(timer, &flags);
>>
>>       /* Remove an active timer from the queue: */
>> +     while (hrtimer_callback_running(timer)) {
>> +             unlock_hrtimer_base(timer, &flags);
>> +             hrtimer_wait_for_timer(timer);
>> +             base = lock_hrtimer_base(timer, &flags);
>> +     }
>>       ret = remove_hrtimer(timer, base);
>>
>>       /* Switch the timer base, if necessary: */
>
> Hmm, why does it hang ?

It hangs when I set up QoS on a 10GbE interface and run traffic over
it.  Here is what I do:
#!/bin/bash

if [ -z "$1" ]; then
        ETH=eth2
else
        ETH="$1"
fi
IP=`ip addr show $ETH | grep 'inet\>' | sed 's/.*inet \([^/]*\)\/.*/\1/'`
IP=`echo $IP | sed 's/\([^.]*\)\.\([^.]*\)\.\([^.]*\)\.\([^.]*\)/\1.\2.\3.95/'`

SPEED=`ethtool $ETH | grep Speed | sed 's/[^0-9]*\([0-9]*\).*/\1/'`
case $SPEED in
   10000) ZEROS=00 ;;
    1000) ZEROS=0 ;;
 default) ZEROS='' ;;
esac

tc qdisc del dev $ETH root >&/dev/null || :
[ "$2" == "stop" ] && exit 0
set -x
tc qdisc add dev $ETH root handle 1: htb default 30 r2q 600$ZEROS
tc class add dev $ETH parent 1: classid 1:1 htb rate 30${ZEROS}mbit
tc class add dev $ETH parent 1:1 classid 1:10 htb rate 5${ZEROS}mbit prio 1
tc class add dev $ETH parent 1:1 classid 1:20 htb rate 5${ZEROS}mbit prio 2
tc class add dev $ETH parent 1:1 classid 1:30 htb rate 8${ZEROS}mbit
set +x

echo "netperf -l 2000 -H 10.1.1.94 -t UDP_STREAM -- -m 65505"
----------------------8< snip 8<-------------------------------

After running the script, I run a netperf command similar to the one
that the script prints, that runs traffic over the interface that we
just set up QoS on.  Within 10-15 seconds, the machine goes silent.
(this is on a 2.6.33.7-rt29 kernel)

My guess would be that without this patch, it is possible to have a
timer currently in the tree, not delete it and then schedule it again.
 Not on this machine or on this kernel, I saw a similar problem
(caused by the same thing: using sch_htb on -rt kernel), but instead
of a silent hang, it gave me an oops that looked like this:

Unable to handle kernel NULL pointer dereference at 0000000000000010
 RIP: [<ffffffff8113b38c>] rb_erase+0x1f3/0x2b1
PGD 14e150067 PUD 142d34067 PMD 0
Oops: 0000 [1] PREEMPT SMP
CPU 2
Modules linked in: sch_htb pktgen nfs nfsd lockd nfs_acl auth_rpcgss exportfs
ipmi_devintf ipmi_si ipmi_msghandler ibm_rtl ipv6 autofs4 i2c_dev i2c_core hidp
rfcomm l2cap bluetooth sunrpc dm_mirror dm_multipath scsi_dh dm_mod video
output sbs sbshc battery ac parport_pc lp parport sg bnx2 button netxen_nic
serio_raw amd64_edac edac_core pcspkr shpchp mptsas mptscsih mptbase
scsi_transport_sas sd_mod scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd
Pid: 38, comm: sirq-hrtimer/2 Not tainted 2.6.24.7-139.el5rt #1
RIP: 0010:[<ffffffff8113b38c>]  [<ffffffff8113b38c>] rb_erase+0x1f3/0x2b1
RSP: 0018:ffff81014f16fe50  EFLAGS: 00010082
RAX: 0000000000000000 RBX: ffff81014640bac8 RCX: ffff810001085780
RDX: 0000000000000000 RSI: ffff8100010076a8 RDI: 0000000000000000
RBP: ffff81014f16fe60 R08: ffff81033f15dac8 R09: 0000000000000000
R10: 0000000000000002 R11: 0000000000000000 R12: ffff8100010076a8
R13: 0000000000000000 R14: 0000000000000002 R15: 0000000000000080
FS:  00007ff9960016e0(0000) GS:ffff81014fc09cc0(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000010 CR3: 000000014e188000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process sirq-hrtimer/2 (pid: 38, threadinfo ffff81014f16e000, task
ffff81014f16c300)
Stack:  ffff81013a5e67d0 ffff810001007698 ffff81014f16fe90 ffffffff81054dfc
 ffffffff81227401 ffff81013a5e67d0 ffff810001085640 0000000000000002
 ffff81014f16fec0 ffffffff81055cbb 0000000000000002 ffffffff815005e8
Call Trace:
 [<ffffffff81054dfc>] __remove_hrtimer+0x6e/0x7b
 [<ffffffff81227401>] ? qdisc_watchdog+0x0/0x23
 [<ffffffff81055cbb>] run_hrtimer_softirq+0x7a/0x14e
 [<ffffffff81043d26>] ksoftirqd+0x16a/0x26f
 [<ffffffff81043bbc>] ? ksoftirqd+0x0/0x26f
 [<ffffffff81043bbc>] ? ksoftirqd+0x0/0x26f
 [<ffffffff8105261c>] kthread+0x49/0x79
 [<ffffffff8100d088>] child_rip+0xa/0x12
 [<ffffffff810525d3>] ? kthread+0x0/0x79
 [<ffffffff8100d07e>] ? child_rip+0x0/0x12


My most common experience is to just experience a silent hang, which
allows me to do very little in the way of debugging -- unless I catch
it before it goes completely silent, not even sysrq works.  I have
managed to crash the machine on occasion, but the crash kernel never
seems to get very far before it dies off too... Very odd behavior.  I
tried to capture a sysrq+T, but it only gave me the names of the
processes, not the backtraces, and then the per CPU info.  I didn't
see anything that stuck out in the per-CPU info, but right after the
dump, the machine went completely silent again.

Sample from sysrq+T:

sshd          S ffff880417e65ea8     0  3621   3204 0x00000080

Call Trace:
bash          S ffff880417dd0200     0  3636   3621 0x00000080

Call Trace:
sshd          S 0000000000000000     0  3637   3204 0x00000080

Call Trace:
bash          S ffff880417ddc440     0  3662   3637 0x00000080

Call Trace:
top           S ffff88002838af48     0  3686   3662 0x00000080

Call Trace:
netperf       R  running task        0  3708   3636 0x00000080

Call Trace:
Sched Debug Version: v0.09, 2.6.33.7-rt29.35.el5rt #1
now at 134835.593881 msecs
  .jiffies                                 : 4294787328
  .sysctl_sched_latency                    : 20.000000
  .sysctl_sched_min_granularity            : 4.000000
  .sysctl_sched_wakeup_granularity         : 4.000000
  .sysctl_sched_child_runs_first           : 0.000000
  .sysctl_sched_features                   : 7917179
  .sysctl_sched_tunable_scaling            : 1 (logaritmic)

cpu#0, 2833.945 MHz
  .nr_running                    : 1
  .load                          : 0
  .nr_switches                   : 1030470
  .nr_load_updates               : 120033
  .nr_uninterruptible            : 141
  .next_balance                  : 4294.787429
  .curr->pid                     : 3708
  .clock                         : 122477.541494
  .cpu_load[0]                   : 0
  .cpu_load[1]                   : 4
  .cpu_load[2]                   : 56
  .cpu_load[3]                   : 122
  .cpu_load[4]                   : 168
  .rt.rt_nr_running              : 1
  .rt.rt_nr_uninterruptible      : 26920
  .rto_schedule                  : 534494
  .rto_schedule_tail             : 0
  .rto_wakeup                    : 0
  .rto_pulled                    : 0
  .rto_pushed                    : 0
  .yld_count                     : 0
  .sched_switch                  : 0
  .sched_count                   : 1070624
  .sched_goidle                  : 279923
  .avg_idle                      : 93170
  .ttwu_count                    : 640746
  .ttwu_local                    : 613469
  .bkl_count                     : 299

cfs_rq[0]:
  .exec_clock                    : 4267.411040
  .MIN_vruntime                  : 0.000001
  .min_vruntime                  : 11634.164614
  .max_vruntime                  : 0.000001
  .spread                        : 0.000000
  .spread0                       : 0.000000
  .nr_running                    : 0
  .load                          : 0
  .nr_spread_over                : 12978

rt_rq[0]:
  .rt_nr_running                 : 1
  .rt_throttled                  : 0
  .rt_time                       : 7.163839
  .rt_runtime                    : 850.000000

runnable tasks:
            task   PID         tree-key  switches  prio
exec-runtime         sum-exec        sum-sleep
----------------------------------------------------------------------------------------------------------
R        netperf  3708         0.000000    206988     9
0.000000      3237.342882      2155.844073



Any ideas on this?  The patch I sent fixes the problem.  The idea came
from a proposed patch a long time ago that just added a hrtimer_cancel
call just before the hrtimer_start call in the sch_api.c watchdog
code.  I figured if the explicit cancel before the start fixed the
problem, there was something that the cancel did that the start alone
didn't.  What I found was that the cancel waits if the timer is
currently running, while start just brute-force cancels the timer.

--Vernon
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/