[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <BANLkTinjqK6pkjYig-NyWuJ2s2Tq3AHdnw@mail.gmail.com>
Date: Wed, 11 May 2011 09:36:42 +0200
From: Stefan Majer <stefan.majer@...il.com>
To: Sage Weil <sage@...dream.net>
Cc: Yehuda Sadeh Weinraub <yehudasa@...il.com>,
linux-kernel@...r.kernel.org, ceph-devel@...r.kernel.org
Subject: Re: Kernel 2.6.38.6 page allocation failure (ixgbe)
Hi Sage,
after some digging we set
sysctl -w vm.min_free_kbytes=262144
default was around 16000
This solved our problem and rados bench survived a 5 minute torture
with no single failure:
min lat: 0.036177 max lat: 299.924 avg lat: 0.553904
sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
300 40 61736 61696 822.498 1312 299.602 0.553904
Total time run: 300.421378
Total writes made: 61736
Write size: 4194304
Bandwidth (MB/sec): 821.992
Average Latency: 0.621895
Max latency: 300.362
Min latency: 0.036177
Sorry for the noise, but i think you should mention this sysctl
modification in the ceph wiki (at least for 10GB/s deployments).
thanks
Stefan Majer
On Wed, May 11, 2011 at 8:58 AM, Stefan Majer <stefan.majer@...il.com> wrote:
> Hi Sage,
>
> we were running rados bench like this:
> # rados -p data bench 60 write -t 128
> Maintaining 128 concurrent writes of 4194304 bytes for at least 60 seconds.
> sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
> 0 0 0 0 0 0 - 0
> 1 128 296 168 671.847 672 0.051857 0.131839
> 2 127 537 410 819.838 968 0.052679 0.115476
> 3 128 772 644 858.516 936 0.043241 0.114372
> 4 128 943 815 814.865 684 0.799326 0.121142
> 5 128 1114 986 788.673 684 0.082748 0.13059
> 6 128 1428 1300 866.526 1256 0.065376 0.119083
> 7 127 1716 1589 907.859 1156 0.037958 0.11151
> 8 127 1986 1859 929.36 1080 0.063171 0.11077
> 9 128 2130 2002 889.645 572 0.048705 0.109477
> 10 127 2333 2206 882.269 816 0.062555 0.115842
> 11 127 2466 2339 850.419 532 0.051618 0.117356
> 12 128 2602 2474 824.545 540 0.06113 0.124453
> 13 128 2807 2679 824.187 820 0.075126 0.125108
> 14 127 2897 2770 791.312 364 0.077479 0.125009
> 15 127 2955 2828 754.023 232 0.084222 0.123814
> 16 127 2973 2846 711.393 72 0.078568 0.123562
> 17 127 2975 2848 670.011 8 0.923208 0.124123
>
> as you can see, the transferrate drops suddenly down to 8 and even to 0.
>
> Memory consumption during this is low:
>
> top - 08:52:24 up 18:12, 1 user, load average: 0.64, 3.35, 4.17
> Tasks: 203 total, 1 running, 202 sleeping, 0 stopped, 0 zombie
> Cpu(s): 0.0%us, 0.3%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
> Mem: 24731008k total, 24550172k used, 180836k free, 79136k buffers
> Swap: 0k total, 0k used, 0k free, 22574812k cached
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 22203 root 20 0 581m 284m 2232 S 0.0 1.2 0:44.34 cosd
> 21922 root 20 0 577m 281m 2148 S 0.0 1.2 0:39.91 cosd
> 22788 root 20 0 576m 213m 2084 S 0.0 0.9 0:44.10 cosd
> 22476 root 20 0 509m 204m 2156 S 0.0 0.8 0:33.92 cosd
>
> And after we hit this, ceph -w still reports clean state, all cosd are
> still running.
>
> We have no clue :-(
>
> Greetings
> Stefan Majer
>
>
> On Tue, May 10, 2011 at 6:06 PM, Stefan Majer <stefan.majer@...il.com> wrote:
>> Hi Sage,
>>
>>
>> On Tue, May 10, 2011 at 6:02 PM, Sage Weil <sage@...dream.net> wrote:
>>> Hi Stefan,
>>>
>>> On Tue, 10 May 2011, Stefan Majer wrote:
>>>> Hi,
>>>>
>>>> On Tue, May 10, 2011 at 4:20 PM, Yehuda Sadeh Weinraub
>>>> <yehudasa@...il.com> wrote:
>>>> > On Tue, May 10, 2011 at 7:04 AM, Stefan Majer <stefan.majer@...il.com> wrote:
>>>> >> Hi,
>>>> >>
>>>> >> im running 4 nodes with ceph on top of btrfs with a dualport Intel
>>>> >> X520 10Gb Ethernet Card with the latest 3.3.9 ixgbe driver.
>>>> >> during benchmarks i get the following stack.
>>>> >> I can easily reproduce this by simply running rados bench from a fast
>>>> >> machine using this 4 nodes as ceph cluster.
>>>> >> We saw this with stock ixgbe driver from 2.6.38.6 and with the latest
>>>> >> 3.3.9 ixgbe.
>>>> >> This kernel is tainted because we use fusion-io iodrives as journal
>>>> >> devices for btrfs.
>>>> >>
>>>> >> Any hints to nail this down are welcome.
>>>> >>
>>>> >> Greetings Stefan Majer
>>>> >>
>>>> >> May 10 15:26:40 os02 kernel: [ 3652.485219] cosd: page allocation
>>>> >> failure. order:2, mode:0x4020
>>>> >
>>>> > It looks like the machine running the cosd is crashing, is that the case?
>>>>
>>>> No the machine is still running. Even the cosd is still there.
>>>
>>> How much memory is (was?) cosd using? Is it possible for you to watch RSS
>>> under load when the errors trigger?
>>
>> I will look on this tomorrow
>> just for the record:
>> each machine has 24GB of RAM and 4 cosd with 1 btrfs formated disks
>> each, which is a raid5 over 3 2TB spindles.
>>
>> The rados bench reaches a constant rate of about 1000Mb/sec !
>>
>> Greetings
>>
>> Stefan
>>> The osd throttles incoming client bandwidth, but it doesn't throttle
>>> inter-osd traffic yet because it's not obvious how to avoid deadlock.
>>> It's possible that one node is getting significantly behind the
>>> others on the replicated writes and that is blowing up its memory
>>> footprint. There are a few ways we can address that, but I'd like to make
>>> sure we understand the problem first.
>>>
>>> Thanks!
>>> sage
>>>
>>>
>>>
>>>> > Are you running both ceph kernel module on the same machine by any
>>>> > chance? If not, it can be some other fs bug (e.g., the underlying
>>>> > btrfs). Also, the stack here is quite deep, there's a chance for a
>>>> > stack overflow.
>>>>
>>>> There is only the cosd running on these machines. We have 3 seperate
>>>> mons and clients which uses qemu-rbd.
>>>>
>>>>
>>>> > Thanks,
>>>> > Yehuda
>>>> >
>>>>
>>>>
>>>> Greetings
>>>> --
>>>> Stefan Majer
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@...r.kernel.org
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>>
>>
>>
>>
>> --
>> Stefan Majer
>>
>
>
>
> --
> Stefan Majer
>
--
Stefan Majer
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists