lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:	Sat, 3 Nov 2012 16:24:42 +0100
From:	Miroslav Kratochvil <exa.exa@...il.com>
To:	Eric Dumazet <eric.dumazet@...il.com>
Cc:	netdev@...r.kernel.org
Subject: Re: iptables/tc: page allocation failures question

Hi,

Thanks for the patch! I think it will fix the problem, I patched one
of the production boxes and will see if it breaks again; it usually
happens after a day or two.

Anyway, more questions:

- my problem sometimes happens even when there are no big xt_recent
allocations happening (just TC/HFSC). Therefore:

  1] Is it possible that something similarly big gets allocated in
HFSC? I didn't actually find anything that would, so...

  2] Is it possible that allocation fragmentation of kalloc/kfree zone
(well it's 10k filters + 10k classes + filter hash table
infrastructure and it is still being rewritten/restructured by the
management software...) can cause similar problems?

- is there some decent way to possibly fix this without manually
patching all production kernels? magic kernel parameter that would
convert failing kalloc to valloc? sysctl to prevent exhausting the
memory? or, at least, something that would reset the failing machine's
memory to a state other than "everything fails"?

Sorry for asking too many questions, but I feel it'd be unwise to let
it behave this way... :]

Thanks,
-mk


On Sat, Nov 3, 2012 at 12:34 PM, Eric Dumazet <eric.dumazet@...il.com> wrote:
> On Sat, 2012-11-03 at 11:27 +0100, Miroslav Kratochvil wrote:
>> Hello everyone,
>>
>> I've got several linux boxes that do mostly routing and traffic
>> shaping stuff. The load isn't any dramatic - it's around 100Mbit of
>> traffic shaping over a HFSC qdisc with ~10k classes/filters.
>>
>> Recently I started seeing messages like this in dmesg:
>>
>> iptables: page allocation failure: order:9, mode:0xc0d0
>>
>> tc: page allocation failure (....)
>>
>> (full messages are attached below)
>>
>> I understood that it means the kernel couldn't allocate memory for
>> execution of given command - it is usually triggered by stuff like 'tc
>> class add' or 'iptables -A something'.
>>
>> The boxes, on the other hand, still have pretty much free memory
>> (alloc+buffers+cache fill around 400MB of 2 gigs available, swap is
>> empty). I guess the problem is caused by the fact that the allocation
>> is constrained by something (like GFP_ATOMIC, or that they can only
>> allocate lower memory). Is this true? If so, is there some possibility
>> to avoid such constraint?
>>
>> What also worries me is that when the box at some point starts to do
>> memory allocation failures, I've been unable to make it stop, even if
>> I delete all qdiscs/iptable entries, clear every cache I know about
>> and restart most of userspace, which should hopefully free a good
>> amount of memory, nothing can be added back.
>>
>> I'm attaching the dmesg of the failure below. Could anyone provide a
>> comment on this, or possibly point me to what can cause this behavior?
>> Is there any better debug output that could clarify this?
>>
>> Thanks in advance,
>> Mirek Kratochvil
>
> You apparently load xt_recent module with a big ip_list_tot value
> (default is 100), and kzalloc() wants an order-9 page (contiguous 2MB of
> ram), and it fails.
>
> I guess following patch should solve your problem
>
> diff --git a/net/netfilter/xt_recent.c b/net/netfilter/xt_recent.c
> index 4635c9b..ceebd8b 100644
> --- a/net/netfilter/xt_recent.c
> +++ b/net/netfilter/xt_recent.c
> @@ -29,6 +29,7 @@
>  #include <linux/skbuff.h>
>  #include <linux/inet.h>
>  #include <linux/slab.h>
> +#include <linux/vmalloc.h>
>  #include <net/net_namespace.h>
>  #include <net/netns/generic.h>
>
> @@ -310,6 +311,14 @@ out:
>         return ret;
>  }
>
> +static void recent_table_free(void *addr)
> +{
> +       if (is_vmalloc_addr(addr))
> +               vfree(addr);
> +       else
> +               kfree(addr);
> +}
> +
>  static int recent_mt_check(const struct xt_mtchk_param *par,
>                            const struct xt_recent_mtinfo_v1 *info)
>  {
> @@ -322,6 +331,7 @@ static int recent_mt_check(const struct xt_mtchk_param *par,
>  #endif
>         unsigned int i;
>         int ret = -EINVAL;
> +       size_t sz;
>
>         if (unlikely(!hash_rnd_inited)) {
>                 get_random_bytes(&hash_rnd, sizeof(hash_rnd));
> @@ -360,8 +370,11 @@ static int recent_mt_check(const struct xt_mtchk_param *par,
>                 goto out;
>         }
>
> -       t = kzalloc(sizeof(*t) + sizeof(t->iphash[0]) * ip_list_hash_size,
> -                   GFP_KERNEL);
> +       sz = sizeof(*t) + sizeof(t->iphash[0]) * ip_list_hash_size;
> +       if (sz <= PAGE_SIZE)
> +               t = kzalloc(sz, GFP_KERNEL);
> +       else
> +               t = vzalloc(sz);
>         if (t == NULL) {
>                 ret = -ENOMEM;
>                 goto out;
> @@ -377,14 +390,14 @@ static int recent_mt_check(const struct xt_mtchk_param *par,
>         uid = make_kuid(&init_user_ns, ip_list_uid);
>         gid = make_kgid(&init_user_ns, ip_list_gid);
>         if (!uid_valid(uid) || !gid_valid(gid)) {
> -               kfree(t);
> +               recent_table_free(t);
>                 ret = -EINVAL;
>                 goto out;
>         }
>         pde = proc_create_data(t->name, ip_list_perms, recent_net->xt_recent,
>                   &recent_mt_fops, t);
>         if (pde == NULL) {
> -               kfree(t);
> +               recent_table_free(t);
>                 ret = -ENOMEM;
>                 goto out;
>         }
> @@ -434,7 +447,7 @@ static void recent_mt_destroy(const struct xt_mtdtor_param *par)
>                 remove_proc_entry(t->name, recent_net->xt_recent);
>  #endif
>                 recent_table_flush(t);
> -               kfree(t);
> +               recent_table_free(t);
>         }
>         mutex_unlock(&recent_mutex);
>  }
>
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Powered by blists - more mailing lists