linux-kernel - Re: Kernel 2.6.38.6 page allocation failure (ixgbe)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <BANLkTinjqK6pkjYig-NyWuJ2s2Tq3AHdnw@mail.gmail.com>
Date:	Wed, 11 May 2011 09:36:42 +0200
From:	Stefan Majer <stefan.majer@...il.com>
To:	Sage Weil <sage@...dream.net>
Cc:	Yehuda Sadeh Weinraub <yehudasa@...il.com>,
	linux-kernel@...r.kernel.org, ceph-devel@...r.kernel.org
Subject: Re: Kernel 2.6.38.6 page allocation failure (ixgbe)

Hi Sage,

after some digging we set
sysctl -w vm.min_free_kbytes=262144
default was around 16000

This solved our problem and rados bench survived a 5 minute torture
with no single failure:

min lat: 0.036177 max lat: 299.924 avg lat: 0.553904
  sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
  300      40     61736     61696   822.498      1312   299.602  0.553904
Total time run:        300.421378
Total writes made:     61736
Write size:            4194304
Bandwidth (MB/sec):    821.992

Average Latency:       0.621895
Max latency:           300.362
Min latency:           0.036177

Sorry for the noise, but i think you should mention this sysctl
modification in the ceph wiki (at least for 10GB/s deployments).

thanks

Stefan Majer


On Wed, May 11, 2011 at 8:58 AM, Stefan Majer <stefan.majer@...il.com> wrote:
> Hi Sage,
>
> we were running rados bench like this:
> # rados -p data bench 60 write -t 128
> Maintaining 128 concurrent writes of 4194304 bytes for at least 60 seconds.
>  sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
>    0       0         0         0         0         0         -         0
>    1     128       296       168   671.847       672  0.051857  0.131839
>    2     127       537       410   819.838       968  0.052679  0.115476
>    3     128       772       644   858.516       936  0.043241  0.114372
>    4     128       943       815   814.865       684  0.799326  0.121142
>    5     128      1114       986   788.673       684  0.082748   0.13059
>    6     128      1428      1300   866.526      1256  0.065376  0.119083
>    7     127      1716      1589   907.859      1156  0.037958   0.11151
>    8     127      1986      1859    929.36      1080  0.063171   0.11077
>    9     128      2130      2002   889.645       572  0.048705  0.109477
>   10     127      2333      2206   882.269       816  0.062555  0.115842
>   11     127      2466      2339   850.419       532  0.051618  0.117356
>   12     128      2602      2474   824.545       540   0.06113  0.124453
>   13     128      2807      2679   824.187       820  0.075126  0.125108
>   14     127      2897      2770   791.312       364  0.077479  0.125009
>   15     127      2955      2828   754.023       232  0.084222  0.123814
>   16     127      2973      2846   711.393        72  0.078568  0.123562
>   17     127      2975      2848   670.011         8  0.923208  0.124123
>
> as you can see, the transferrate drops suddenly down to 8 and even to 0.
>
> Memory consumption during this is low:
>
> top - 08:52:24 up 18:12,  1 user,  load average: 0.64, 3.35, 4.17
> Tasks: 203 total,   1 running, 202 sleeping,   0 stopped,   0 zombie
> Cpu(s):  0.0%us,  0.3%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Mem:  24731008k total, 24550172k used,   180836k free,    79136k buffers
> Swap:        0k total,        0k used,        0k free, 22574812k cached
>
>  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> 22203 root      20   0  581m 284m 2232 S  0.0  1.2   0:44.34 cosd
> 21922 root      20   0  577m 281m 2148 S  0.0  1.2   0:39.91 cosd
> 22788 root      20   0  576m 213m 2084 S  0.0  0.9   0:44.10 cosd
> 22476 root      20   0  509m 204m 2156 S  0.0  0.8   0:33.92 cosd
>
> And after we hit this, ceph -w still reports clean state, all cosd are
> still running.
>
> We have no clue :-(
>
> Greetings
> Stefan Majer
>
>
> On Tue, May 10, 2011 at 6:06 PM, Stefan Majer <stefan.majer@...il.com> wrote:
>> Hi Sage,
>>
>>
>> On Tue, May 10, 2011 at 6:02 PM, Sage Weil <sage@...dream.net> wrote:
>>> Hi Stefan,
>>>
>>> On Tue, 10 May 2011, Stefan Majer wrote:
>>>> Hi,
>>>>
>>>> On Tue, May 10, 2011 at 4:20 PM, Yehuda Sadeh Weinraub
>>>> <yehudasa@...il.com> wrote:
>>>> > On Tue, May 10, 2011 at 7:04 AM, Stefan Majer <stefan.majer@...il.com> wrote:
>>>> >> Hi,
>>>> >>
>>>> >> im running 4 nodes with ceph on top of btrfs with a dualport Intel
>>>> >> X520 10Gb Ethernet Card with the latest 3.3.9 ixgbe driver.
>>>> >> during benchmarks i get the following stack.
>>>> >> I can easily reproduce this by simply running rados bench from a fast
>>>> >> machine using this 4 nodes as ceph cluster.
>>>> >> We saw this with stock ixgbe driver from 2.6.38.6 and with the latest
>>>> >> 3.3.9 ixgbe.
>>>> >> This kernel is tainted because we use fusion-io iodrives as journal
>>>> >> devices for btrfs.
>>>> >>
>>>> >> Any hints to nail this down are welcome.
>>>> >>
>>>> >> Greetings Stefan Majer
>>>> >>
>>>> >> May 10 15:26:40 os02 kernel: [ 3652.485219] cosd: page allocation
>>>> >> failure. order:2, mode:0x4020
>>>> >
>>>> > It looks like the machine running the cosd is crashing, is that the case?
>>>>
>>>> No the machine is still running. Even the cosd is still there.
>>>
>>> How much memory is (was?) cosd using?  Is it possible for you to watch RSS
>>> under load when the errors trigger?
>>
>> I will look on this tomorrow
>> just for the record:
>> each machine has 24GB of RAM and 4 cosd with 1 btrfs formated disks
>> each, which is a raid5 over 3 2TB spindles.
>>
>> The rados bench reaches a constant rate of about 1000Mb/sec !
>>
>> Greetings
>>
>> Stefan
>>> The osd throttles incoming client bandwidth, but it doesn't throttle
>>> inter-osd traffic yet because it's not obvious how to avoid deadlock.
>>> It's possible that one node is getting significantly behind the
>>> others on the replicated writes and that is blowing up its memory
>>> footprint.  There are a few ways we can address that, but I'd like to make
>>> sure we understand the problem first.
>>>
>>> Thanks!
>>> sage
>>>
>>>
>>>
>>>> > Are you running both ceph kernel module on the same machine by any
>>>> > chance? If not, it can be some other fs bug (e.g., the underlying
>>>> > btrfs). Also, the stack here is quite deep, there's a chance for a
>>>> > stack overflow.
>>>>
>>>> There is only the cosd running on these machines. We have 3 seperate
>>>> mons and clients which uses qemu-rbd.
>>>>
>>>>
>>>> > Thanks,
>>>> > Yehuda
>>>> >
>>>>
>>>>
>>>> Greetings
>>>> --
>>>> Stefan Majer
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@...r.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>>
>>
>>
>>
>> --
>> Stefan Majer
>>
>
>
>
> --
> Stefan Majer
>



-- 
Stefan Majer
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/