linux-kernel - Re: [RFC PATCH] initramfs: Add size validation to prevent tmpfs exhaustion

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID:
 <BYAPR12MB3205A7903D8EF06EFF8F575AD5DF2@BYAPR12MB3205.namprd12.prod.outlook.com>
Date: Mon, 17 Mar 2025 09:41:35 +0000
From: Stephen Eta Zhou <stephen.eta.zhou@...look.com>
To: David Disseldorp <ddiss@...e.de>
CC: "jsperbeck@...gle.com" <jsperbeck@...gle.com>, "akpm@...ux-foundation.org"
	<akpm@...ux-foundation.org>, "gregkh@...uxfoundation.org"
	<gregkh@...uxfoundation.org>, "lukas@...ner.de" <lukas@...ner.de>,
	"wufan@...ux.microsoft.com" <wufan@...ux.microsoft.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>
Subject: Re: [RFC PATCH] initramfs: Add size validation to prevent tmpfs
 exhaustion

Hi david，

> On Fri, 14 Mar 2025 05:04:58 +0000, Stephen Eta Zhou wrote:



> > From 3499daeb5caf934f08a485027b5411f9ef82d6be Mon Sep 17 00:00:00 2001

> > From: Stephen Eta Zhou <stephen.eta.zhou@...look.com>

> > Date: Fri, 14 Mar 2025 12:32:59 +0800

> > Subject: [PATCH] initramfs: Add size validation to prevent tmpfs exhaustion

> >

> > When initramfs is loaded into a small memory environment, if its size

> > exceeds the tmpfs max blocks limit, the loading will fail. Additionally,

> > if the required blocks are close to the tmpfs max blocks boundary,

>>  subsequent drivers or subsystems using tmpfs may fail to initialize.

> >

> > To prevent this, the size limit is set to half of tmpfs max blocks.

> > This ensures that initramfs can complete its mission without exhausting

> > tmpfs resources, as user-space programs may also rely on tmpfs after boot.

> >

> > This patch adds a validation mechanism to check the decompressed size

> > of initramfs based on its compression type and ratio. If the required

> > blocks exceed half of the tmpfs max blocks limit, the loading will be

> > aborted with an appropriate error message, exposing the issue early

> > and preventing further escalation.



> This behaviour appears fragile and quite arbitrary. I don't think

> initramfs should be responsible for making any of these decisions.


> Why can't the init binary make the decision of whether or not the amount

> of free memory remaining is sufficient for user-space, instead of this

> magic 50% limit?


> What are you trying to achieve by failing in this way before initramfs

> extraction instead of during / after?

Before the init process runs, initramfs needs to be decompressed to tmpfs and become the root file system (rootfs). If there is insufficient tmpfs space after decompression, init may not be able to run at all, causing the system to crash or panic.

Letting the init process decide whether it is sufficient means that the initramfs must be decompressed first, which may have filled up tmpfs, making the entire system unusable, rather than a controllable error handling process.

This problem is more obvious in extreme cases, for example:

1. After initramfs is decompressed, there is only a small amount of available space in tmpfs, causing early-user-space tasks such as mount and udevadm to fail, affecting device initialization.

2. On embedded devices, tmpfs is usually configured small, and insufficient space is found after decompression, which directly leads to boot failure.

The reason why the check is performed before decompression is to expose the problem in advance to avoid the passive failure mode of insufficient space after decompression.
Calculating the theoretically required tmpfs resources and making judgments in advance can reduce unnecessary I/O operations and provide clearer error reports to help users adjust the initramfs size or tmpfs configuration.
My idea is to expose problems as early as possible. If problems occur during operation, it may be more troublesome to troubleshoot or bring unnecessary risks.

As for why 50% of the available tmpfs space is reserved, I think initramfs is only a temporary root file system during the boot phase, and tmpfs still needs to provide enough space for user space processes, such as: udev device management, console, serial port driver...log or other user-mode process runtime data. If initramfs is too large and occupies too much tmpfs space, even if the system successfully enters the user state, it may cause further failures due to tmpfs exhaustion. 50% is a conservative but reasonable upper limit to ensure that there are still enough tmpfs resources to support early-user-space after initramfs is successfully decompressed

Of course, this 50% can be adjusted. This is a patch for discussion. If it is willing to be accepted by the community, a config variable can be added to configure it.


>> Signed-off-by: Stephen Eta Zhou <stephen.eta.zhou@...look.com>

>> ---

>>  init/initramfs.c | 162 +++++++++++++++++++++++++++++++++++++++++++++++

>>  1 file changed, 162 insertions(+)

>>

>> diff --git a/init/initramfs.c b/init/initramfs.c

>> index b2f7583bb1f5..dadda0a42b48 100644

>> --- a/init/initramfs.c

>> +++ b/init/initramfs.c

>> @@ -497,6 +497,157 @@ static unsigned long my_inptr __initdata; /* index of next byte to be processed

>>  

>>  #include <linux/decompress/generic.h>

>>  

>> +#ifdef CONFIG_TMPFS

>> +/*

>> + * struct compress_info - Describes a compression method.

>> + *

>> + * @magic: Magic numbers to identify the compression method (e.g., GZIP, XZ, etc.).

>> + *         Each magic number is a byte array of maximum length 256.

>> + *         The first dimension (2) represents the number of possible magic numbers.

>> + * @rate: Compression ratio, calculated as R = (compressed size / original size) * 100.

>> + *        The value is in percentage (0-100).

>> + * @mark: Name of the compression scheme (e.g., "GZIP", "XZ").

>> + * @len: Length of each magic byte array. Used for comparison with memcmp.

>> + *       The first dimension (2) corresponds to the number of magic numbers.

>> + * @magic_max: Maximum number of magic numbers supported (used when multiple magics are possible).

>> + */

>> +struct compress_info {

>> +     unsigned char magic[2][256];

>> +     unsigned long rate;

>> +     char *mark;

>> +     size_t len[2];

>> +     size_t magic_max;

>> +};



> initramfs doesn't have much knowledge of underlying compression details.

> This seems like a pretty significant layering violation...

I wrote this part of the code to demonstrate the correct working of this patch. Later, you can consider using decompress or saving it to a variable during the build so that initramfs can get it directly without calculation.

>> +

>> +static struct compress_info cfm[] __initdata = {

>> +     {

>> +           .mark = "Gzip",

>> +           .magic = { { 0x1F, 0x8B } },

>> +           .len = { 2 },

>> +           .rate = 43,

>> +           .magic_max = 1,

>> +     },

>> +     {

>> +           .mark = "Bzip2",

>> +           .magic = { { 0x42, 0x5A, 0x68 } },

>> +           .len = { 3 },

>> +           .rate = 22,

>> +           .magic_max = 1,

>> +     },

>> +     {

>> +           .mark = "LZMA",

>> +           .magic = { { 0x5D, 0x00, 0x00 }, { 0xFF, 0x5D, 0x00 } },

>> +           .len = { 3, 3 },

>> +           .rate = 5,

>> +           .magic_max = 2,

>> +     },

>> +     {

>> +           .mark = "XZ",

>> +           .magic = { { 0xFD, 0x37, 0x7A, 0x58, 0x5A, 0x00 } },

>> +           .len = { 6 },

>> +           .rate = 7,

>> +           .magic_max = 1,

>> +     },

>> +     {

>> +           .mark = "LZO",

>> +           .magic = { { 0x89, 0x4C, 0x5A, 0x4F, 0x00, 0x0D, 0x0A, 0x1A, 0x0A } },

>> +           .len = { 9 },

>> +           .rate = 47,

>> +           .magic_max = 1,

>> +     },

>> +     {

>> +           .mark = "LZ4",

>> +           .magic = {

>> +                             { 0x04, 0x22, 0x4D, 0x18 },

>> +                             { 0x02, 0x21, 0x4C, 0x18 }

>> +                        },

>> +           .len = { 4 },

>> +           .rate = 52,

>> +           .magic_max = 2,

>> +     },

>> +     {

>> +           .mark = "ZSTD",

>> +           .magic = { { 0x28, 0xB5, 0x2F, 0xFD } },

>> +           .len = { 4 },

>> +           .rate = 7,

>> +           .magic_max = 1,

>> +     },

>> +     {

>> +           .mark = "None",

>> +           .magic = {

>> +                             { 0x30, 0x37, 0x30, 0x37, 0x30, 0x31 },

>> +                             { 0x30, 0x37, 0x30, 0x37, 0x30, 0x32 }

>> +                        },

>> +           .len = { 6, 6 },

>> +           .rate = 0,



> This will trigger a divide by zero below.

Thanks, you reminded me, if this patch is worth continuing, I will fix it in v2

>> +           .magic_max = 2,

>> +     },

>> +};

>> +

>> +static int __init validate_rootfs_size(char *buf, unsigned long len)

>> +{

>> +     unsigned long i, j, result, quotient, half_tmpfs_blocks;

>> +

>> +     /*

>> +      * Calculate how many blocks are needed to decompress

>> +      * and check if they are within a reasonable range.

>> +      */

>> +     for (i = 0; i < ARRAY_SIZE(cfm); ++i) {

>> +           for (j = 0; j < cfm[i].magic_max; ++j) {

>> +                 if (memcmp(buf, cfm[i].magic[j], cfm[i].len[j]) == 0) {

>> +                       pr_debug("Compression method: %\n", cfm[i].mark);

>> +                       /*

>> +                        * The calculation is divided into three steps:

>> +                        * 1. Calculate the decompressed size based on the ratio.

>> +                        * 2. Check for potential overflow risks and ensure that

>> +                        *    the temporary decompressed

>> +                        *    initramfs does not exceed the maximum range of 2^(32/64),

>> +                        *    ensuring that the initramfs size does not approach the

>> +                        *    memory addressing limit (this cannot be fully guaranteed).

>> +                        * 3. Determine whether the required page size exceeds 1/4 of

>> +                        *    the total memory pages, restricting it from using excessively

>> +                        *    large amounts of memory pages.

>> +                        *

>> +                        * Note1: Here, `len` cannot be directly multiplied by 100,

>> +                        *        as it may cause overflow.

>> +                        *        Dividing by `rate` first and then multiplying by 100

>> +                        *        can effectively reduce the risk of overflow.

>> +                        *

>> +                        * Note2: Due to integer division and rounding,

>> +                        *        the calculated size may deviate by a few MB.

>> +                        */

>> +                       quotient = len / cfm[i].rate;

>> +

>> +                       if (quotient > ULONG_MAX / 100)

>> +                             goto err_overflow;

>> +                       else

>> +                             result = (quotient * 100) / PAGE_SIZE;

>> +

>> +                       /*

>> +                        * totalram_pages() / 2 = tmpfs max blocks

>> +                        */

>> +                       half_tmpfs_blocks = (totalram_pages() / 2) / 2;

>> +                       if (result > half_tmpfs_blocks)

>> +                             goto err_nomem;



> See Documentation/driver-api/early-userspace/buffer-format.rst .

> Initramfs images can be made up of several concatenated cpio archives,

> which would throw off these calculations.

Thank you for your reminder. As I said before, if this patch is willing to be discussed and accepted by the community, I will add a scanning logic in v2.


Thanks,
Stephen
________________________________________
From: David Disseldorp
Sent: Monday, March 17, 2025 15:21
To: Stephen Eta Zhou
Cc: jsperbeck@...gle.com; akpm@...ux-foundation.org; gregkh@...uxfoundation.org; lukas@...ner.de; wufan@...ux.microsoft.com; linux-kernel@...r.kernel.org; linux-fsdevel@...r.kernel.org
Subject: Re: [RFC PATCH] initramfs: Add size validation to prevent tmpfs exhaustion


[cc'ing fsdevel]



Hi,



On Fri, 14 Mar 2025 05:04:58 +0000, Stephen Eta Zhou wrote:



> From 3499daeb5caf934f08a485027b5411f9ef82d6be Mon Sep 17 00:00:00 2001

> From: Stephen Eta Zhou <stephen.eta.zhou@...look.com>

> Date: Fri, 14 Mar 2025 12:32:59 +0800

> Subject: [PATCH] initramfs: Add size validation to prevent tmpfs exhaustion

>

> When initramfs is loaded into a small memory environment, if its size

> exceeds the tmpfs max blocks limit, the loading will fail. Additionally,

> if the required blocks are close to the tmpfs max blocks boundary,

> subsequent drivers or subsystems using tmpfs may fail to initialize.

>

> To prevent this, the size limit is set to half of tmpfs max blocks.

> This ensures that initramfs can complete its mission without exhausting

> tmpfs resources, as user-space programs may also rely on tmpfs after boot.

>

> This patch adds a validation mechanism to check the decompressed size

> of initramfs based on its compression type and ratio. If the required

> blocks exceed half of the tmpfs max blocks limit, the loading will be

> aborted with an appropriate error message, exposing the issue early

> and preventing further escalation.



This behaviour appears fragile and quite arbitrary. I don't think

initramfs should be responsible for making any of these decisions.



Why can't the init binary make the decision of whether or not the amount

of free memory remaining is sufficient for user-space, instead of this

magic 50% limit?



What are you trying to achieve by failing in this way before initramfs

extraction instead of during / after?



> Signed-off-by: Stephen Eta Zhou <stephen.eta.zhou@...look.com>

> ---

>  init/initramfs.c | 162 +++++++++++++++++++++++++++++++++++++++++++++++

>  1 file changed, 162 insertions(+)

>

> diff --git a/init/initramfs.c b/init/initramfs.c

> index b2f7583bb1f5..dadda0a42b48 100644

> --- a/init/initramfs.c

> +++ b/init/initramfs.c

> @@ -497,6 +497,157 @@ static unsigned long my_inptr __initdata; /* index of next byte to be processed

>  

>  #include <linux/decompress/generic.h>

>  

> +#ifdef CONFIG_TMPFS

> +/*

> + * struct compress_info - Describes a compression method.

> + *

> + * @magic: Magic numbers to identify the compression method (e.g., GZIP, XZ, etc.).

> + *         Each magic number is a byte array of maximum length 256.

> + *         The first dimension (2) represents the number of possible magic numbers.

> + * @rate: Compression ratio, calculated as R = (compressed size / original size) * 100.

> + *        The value is in percentage (0-100).

> + * @mark: Name of the compression scheme (e.g., "GZIP", "XZ").

> + * @len: Length of each magic byte array. Used for comparison with memcmp.

> + *       The first dimension (2) corresponds to the number of magic numbers.

> + * @magic_max: Maximum number of magic numbers supported (used when multiple magics are possible).

> + */

> +struct compress_info {

> +     unsigned char magic[2][256];

> +     unsigned long rate;

> +     char *mark;

> +     size_t len[2];

> +     size_t magic_max;

> +};



initramfs doesn't have much knowledge of underlying compression details.

This seems like a pretty significant layering violation...



> +

> +static struct compress_info cfm[] __initdata = {

> +     {

> +           .mark = "Gzip",

> +           .magic = { { 0x1F, 0x8B } },

> +           .len = { 2 },

> +           .rate = 43,

> +           .magic_max = 1,

> +     },

> +     {

> +           .mark = "Bzip2",

> +           .magic = { { 0x42, 0x5A, 0x68 } },

> +           .len = { 3 },

> +           .rate = 22,

> +           .magic_max = 1,

> +     },

> +     {

> +           .mark = "LZMA",

> +           .magic = { { 0x5D, 0x00, 0x00 }, { 0xFF, 0x5D, 0x00 } },

> +           .len = { 3, 3 },

> +           .rate = 5,

> +           .magic_max = 2,

> +     },

> +     {

> +           .mark = "XZ",

> +           .magic = { { 0xFD, 0x37, 0x7A, 0x58, 0x5A, 0x00 } },

> +           .len = { 6 },

> +           .rate = 7,

> +           .magic_max = 1,

> +     },

> +     {

> +           .mark = "LZO",

> +           .magic = { { 0x89, 0x4C, 0x5A, 0x4F, 0x00, 0x0D, 0x0A, 0x1A, 0x0A } },

> +           .len = { 9 },

> +           .rate = 47,

> +           .magic_max = 1,

> +     },

> +     {

> +           .mark = "LZ4",

> +           .magic = {

> +                             { 0x04, 0x22, 0x4D, 0x18 },

> +                             { 0x02, 0x21, 0x4C, 0x18 }

> +                        },

> +           .len = { 4 },

> +           .rate = 52,

> +           .magic_max = 2,

> +     },

> +     {

> +           .mark = "ZSTD",

> +           .magic = { { 0x28, 0xB5, 0x2F, 0xFD } },

> +           .len = { 4 },

> +           .rate = 7,

> +           .magic_max = 1,

> +     },

> +     {

> +           .mark = "None",

> +           .magic = {

> +                             { 0x30, 0x37, 0x30, 0x37, 0x30, 0x31 },

> +                             { 0x30, 0x37, 0x30, 0x37, 0x30, 0x32 }

> +                        },

> +           .len = { 6, 6 },

> +           .rate = 0,



This will trigger a divide by zero below.



> +           .magic_max = 2,

> +     },

> +};

> +

> +static int __init validate_rootfs_size(char *buf, unsigned long len)

> +{

> +     unsigned long i, j, result, quotient, half_tmpfs_blocks;

> +

> +     /*

> +      * Calculate how many blocks are needed to decompress

> +      * and check if they are within a reasonable range.

> +      */

> +     for (i = 0; i < ARRAY_SIZE(cfm); ++i) {

> +           for (j = 0; j < cfm[i].magic_max; ++j) {

> +                 if (memcmp(buf, cfm[i].magic[j], cfm[i].len[j]) == 0) {

> +                       pr_debug("Compression method: %\n", cfm[i].mark);

> +                       /*

> +                        * The calculation is divided into three steps:

> +                        * 1. Calculate the decompressed size based on the ratio.

> +                        * 2. Check for potential overflow risks and ensure that

> +                        *    the temporary decompressed

> +                        *    initramfs does not exceed the maximum range of 2^(32/64),

> +                        *    ensuring that the initramfs size does not approach the

> +                        *    memory addressing limit (this cannot be fully guaranteed).

> +                        * 3. Determine whether the required page size exceeds 1/4 of

> +                        *    the total memory pages, restricting it from using excessively

> +                        *    large amounts of memory pages.

> +                        *

> +                        * Note1: Here, `len` cannot be directly multiplied by 100,

> +                        *        as it may cause overflow.

> +                        *        Dividing by `rate` first and then multiplying by 100

> +                        *        can effectively reduce the risk of overflow.

> +                        *

> +                        * Note2: Due to integer division and rounding,

> +                        *        the calculated size may deviate by a few MB.

> +                        */

> +                       quotient = len / cfm[i].rate;

> +

> +                       if (quotient > ULONG_MAX / 100)

> +                             goto err_overflow;

> +                       else

> +                             result = (quotient * 100) / PAGE_SIZE;

> +

> +                       /*

> +                        * totalram_pages() / 2 = tmpfs max blocks

> +                        */

> +                       half_tmpfs_blocks = (totalram_pages() / 2) / 2;

> +                       if (result > half_tmpfs_blocks)

> +                             goto err_nomem;



See Documentation/driver-api/early-userspace/buffer-format.rst .

Initramfs images can be made up of several concatenated cpio archives,

which would throw off these calculations.



> +

> +                       return 0;

> +                 }

> +           }

> +     }

> +

> +     pr_err("This compression format is not supported.\n");

> +     return -EOPNOTSUPP;

> +

> +err_overflow:

> +     pr_err("Decompressed size overflow!\n");

> +     return -ERANGE;

> +err_nomem:

> +     pr_err("Decompressed size exceeds tmpfs max blocks limit!\n");

> +     return -ENOMEM;

> +

> +}

> +#endif

> +

>  static char * __init unpack_to_rootfs(char *buf, unsigned long len)

>  {

>       long written;

> @@ -504,6 +655,17 @@ static char * __init unpack_to_rootfs(char *buf, unsigned long len)

>       const char *compress_name;

>       static __initdata char msg_buf[64];

>  

> +#ifdef CONFIG_TMPFS

> +     int ret = validate_rootfs_size(buf, len);

> +

> +     if (ret) {

> +           snprintf(msg_buf, sizeof(msg_buf),

> +                       "Rootfs does not comply with the rules, error code: %d", ret);

> +           message = msg_buf;

> +           return message;

> +     }

> +#endif

> +

>       header_buf = kmalloc(110, GFP_KERNEL);

>       symlink_buf = kmalloc(PATH_MAX + N_ALIGN(PATH_MAX) + 1, GFP_KERNEL);

>       name_buf = kmalloc(N_ALIGN(PATH_MAX), GFP_KERNEL);

> -- 

> 2.25.1

>