[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <BANLkTi=x+BEeGtpBe2reG4erNxyeZweAQA@mail.gmail.com>
Date: Wed, 11 May 2011 08:58:53 +0200
From: Stefan Majer <stefan.majer@...il.com>
To: Sage Weil <sage@...dream.net>
Cc: Yehuda Sadeh Weinraub <yehudasa@...il.com>,
linux-kernel@...r.kernel.org, ceph-devel@...r.kernel.org
Subject: Re: Kernel 2.6.38.6 page allocation failure (ixgbe)
Hi Sage,
we were running rados bench like this:
# rados -p data bench 60 write -t 128
Maintaining 128 concurrent writes of 4194304 bytes for at least 60 seconds.
sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
0 0 0 0 0 0 - 0
1 128 296 168 671.847 672 0.051857 0.131839
2 127 537 410 819.838 968 0.052679 0.115476
3 128 772 644 858.516 936 0.043241 0.114372
4 128 943 815 814.865 684 0.799326 0.121142
5 128 1114 986 788.673 684 0.082748 0.13059
6 128 1428 1300 866.526 1256 0.065376 0.119083
7 127 1716 1589 907.859 1156 0.037958 0.11151
8 127 1986 1859 929.36 1080 0.063171 0.11077
9 128 2130 2002 889.645 572 0.048705 0.109477
10 127 2333 2206 882.269 816 0.062555 0.115842
11 127 2466 2339 850.419 532 0.051618 0.117356
12 128 2602 2474 824.545 540 0.06113 0.124453
13 128 2807 2679 824.187 820 0.075126 0.125108
14 127 2897 2770 791.312 364 0.077479 0.125009
15 127 2955 2828 754.023 232 0.084222 0.123814
16 127 2973 2846 711.393 72 0.078568 0.123562
17 127 2975 2848 670.011 8 0.923208 0.124123
as you can see, the transferrate drops suddenly down to 8 and even to 0.
Memory consumption during this is low:
top - 08:52:24 up 18:12, 1 user, load average: 0.64, 3.35, 4.17
Tasks: 203 total, 1 running, 202 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 0.3%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 24731008k total, 24550172k used, 180836k free, 79136k buffers
Swap: 0k total, 0k used, 0k free, 22574812k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
22203 root 20 0 581m 284m 2232 S 0.0 1.2 0:44.34 cosd
21922 root 20 0 577m 281m 2148 S 0.0 1.2 0:39.91 cosd
22788 root 20 0 576m 213m 2084 S 0.0 0.9 0:44.10 cosd
22476 root 20 0 509m 204m 2156 S 0.0 0.8 0:33.92 cosd
And after we hit this, ceph -w still reports clean state, all cosd are
still running.
We have no clue :-(
Greetings
Stefan Majer
On Tue, May 10, 2011 at 6:06 PM, Stefan Majer <stefan.majer@...il.com> wrote:
> Hi Sage,
>
>
> On Tue, May 10, 2011 at 6:02 PM, Sage Weil <sage@...dream.net> wrote:
>> Hi Stefan,
>>
>> On Tue, 10 May 2011, Stefan Majer wrote:
>>> Hi,
>>>
>>> On Tue, May 10, 2011 at 4:20 PM, Yehuda Sadeh Weinraub
>>> <yehudasa@...il.com> wrote:
>>> > On Tue, May 10, 2011 at 7:04 AM, Stefan Majer <stefan.majer@...il.com> wrote:
>>> >> Hi,
>>> >>
>>> >> im running 4 nodes with ceph on top of btrfs with a dualport Intel
>>> >> X520 10Gb Ethernet Card with the latest 3.3.9 ixgbe driver.
>>> >> during benchmarks i get the following stack.
>>> >> I can easily reproduce this by simply running rados bench from a fast
>>> >> machine using this 4 nodes as ceph cluster.
>>> >> We saw this with stock ixgbe driver from 2.6.38.6 and with the latest
>>> >> 3.3.9 ixgbe.
>>> >> This kernel is tainted because we use fusion-io iodrives as journal
>>> >> devices for btrfs.
>>> >>
>>> >> Any hints to nail this down are welcome.
>>> >>
>>> >> Greetings Stefan Majer
>>> >>
>>> >> May 10 15:26:40 os02 kernel: [ 3652.485219] cosd: page allocation
>>> >> failure. order:2, mode:0x4020
>>> >
>>> > It looks like the machine running the cosd is crashing, is that the case?
>>>
>>> No the machine is still running. Even the cosd is still there.
>>
>> How much memory is (was?) cosd using? Is it possible for you to watch RSS
>> under load when the errors trigger?
>
> I will look on this tomorrow
> just for the record:
> each machine has 24GB of RAM and 4 cosd with 1 btrfs formated disks
> each, which is a raid5 over 3 2TB spindles.
>
> The rados bench reaches a constant rate of about 1000Mb/sec !
>
> Greetings
>
> Stefan
>> The osd throttles incoming client bandwidth, but it doesn't throttle
>> inter-osd traffic yet because it's not obvious how to avoid deadlock.
>> It's possible that one node is getting significantly behind the
>> others on the replicated writes and that is blowing up its memory
>> footprint. There are a few ways we can address that, but I'd like to make
>> sure we understand the problem first.
>>
>> Thanks!
>> sage
>>
>>
>>
>>> > Are you running both ceph kernel module on the same machine by any
>>> > chance? If not, it can be some other fs bug (e.g., the underlying
>>> > btrfs). Also, the stack here is quite deep, there's a chance for a
>>> > stack overflow.
>>>
>>> There is only the cosd running on these machines. We have 3 seperate
>>> mons and clients which uses qemu-rbd.
>>>
>>>
>>> > Thanks,
>>> > Yehuda
>>> >
>>>
>>>
>>> Greetings
>>> --
>>> Stefan Majer
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@...r.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>
>
>
>
> --
> Stefan Majer
>
--
Stefan Majer
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists