lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200305190253.GA28787@avx2>
Date:   Thu, 5 Mar 2020 22:02:53 +0300
From:   Alexey Dobriyan <adobriyan@...il.com>
To:     torvalds@...ux-foundation.org, corbet@....net
Cc:     linux-kernel@...r.kernel.org, linux-arch@...r.kernel,
        linux-doc@...r.kernel.org, x86@...nel.org
Subject: [PATCH] doc: code generation style

I wonder if it would be useful to have something like this in tree.

It states trivial things for anyone who looked at disassembly few times
but still...

Signed-off-by: Alexey Dobriyan <adobriyan@...il.com>
---

 Documentation/process/code-generation.rst |  196 ++++++++++++++++++++++++++++++
 1 file changed, 196 insertions(+)

new file mode 100644
--- /dev/null
+++ b/Documentation/process/code-generation.rst
@@ -0,0 +1,196 @@
+Code generation
+===============
+
+1) Generic techniques
+---------------------
+
+### a) Inlining/uninlining function calls ###
+
+External function call is serious business from code generation point of view.
+ABIs require that specific arguments are placed into specific registers before
+doing the call forcing spilling and register shuffling to accomodate ABI rules.
+Clobbered registers which aren't used by a function are wasted. Declaring
+function as ``static inline`` in a header gives compiler more information
+to work with.
+
+However, excessing inlining often leads to code bloat for no measurable
+performance gains. In such case it is probably better to save on generated code
+for icache, disk I/O and network bandwidth costs.
+
+Use ``noinline`` attribute to prevent inlining inside translation unit and
+see what happens:
+
+.. code-block:: c
+
+        noinline
+        int f()
+        {
+                ...
+        }
+
+It is hard to advice any more than that as modern compilers generate code in
+mysterious ways.
+
+
+### b) Appending arguments ###
+
+Some functions are thin wrappers appending an argument or two to another
+function which actually does the job:
+
+.. code-block:: c
+
+        int g(int, int, flag_t);
+        int f(int a, int b)
+        {
+                return g(a, b, FLAG_C);
+        }
+
+Appending an argument at the end adds minimum amount of code:
+
+.. code-block:: none
+
+        f:
+                mov     edx, FLAG_C
+                jmp     g
+
+Appending an argument in the middle or in the beginning will generate
+reshuffle sequence:
+
+.. code-block:: none
+
+        f:
+                mov     edx, esi
+                mov     esi, edi
+                mov     edi, FLAG_C
+                jmp     g
+
+Do not enforce this rule religiously as there may be other reasons for
+specific argument order most notably keeping related arguments together
+at source level.
+
+
+2) Architecture specific issues (i386/x86_64)
+---------------------------------------------
+
+### a) Member placement ###
+
+First member of any structure is very special on i386/x86_64: compiler will
+use ``[r32]`` or ``[r64]`` addressing mode which has the shortest encoding.
+After laying out members of a structure into cachelines for performance,
+move most often used member of the first cacheline to the very beginning.
+
+Done that, pay attention to bytes 1--127. Members placed there will be encoded
+with ``[r64+disp8]`` encoding (or ``[r32+disp8]`` on i386). This is only 1 byte
+longer than encoding used for the first member but 3 bytes _shorter_ than
+``[r64+disp32]`` used for all other members. Try to shift more often used
+members into first 2 cachelines.
+
+"Refugee" members living in byte 128 and beyond can be placed in any order.
+
+
+### b) Implicit 32/64-bit casts
+
+Avoid casts which change signedness and/or bitness of a value.
+
+If some piece of data appears in the code it generally should be kept in its
+original type unless there are specific reasons to do otherwise (packing, etc).
+With C's seemingly arcane implicit and explicit casting rules this is good advice
+from programming language point of view as well.
+
+Given the code:
+
+.. code-block:: c
+
+        void f(size_t);
+
+        int len = strlen(s);
+        f(len);
+
+if compiler doesn't or can't maintain value ranges through casts it will have
+no choice but to assume that all "size_t" values are possible and emit MOVSX
+instruction:
+
+.. code-block:: none
+
+        mov     rdi, ...
+        call    strlen
+        movsx   rdi, eax
+        call    f
+
+MOVSX by itself it not a problem but it a) may be 1 byte longer than MOV
+instruction with same arguments and b) it won't be handled by register renaming,
+increasing dependency chains by 1 instruction.
+
+
+### c) 64-bitness ###
+
+64-bit instruction are 1-byte longer than corresponding 32-bit equivalents
+on x86_64.
+
+There is one big 64-bit enabler which is dynamic memory allocation: all
+kmalloc variant accept ``size_t`` and ``sizeof`` operator returns ``size_t``.
+
+Do not use 64-bit/``size_t`` unless strictly necessary (pointer-to-integer
+conversion, syscall ABI interfaces, integers which can be genuinely big on
+big machines, statistics).
+
+Use 32-bit ``unsigned int``. Kernel simply doesn't to individual 4+ GB
+allocations and if it does it probably goes via page allocator. Such huge
+amounts of memory simply aren't needed: network doesn't do gigabyte packets,
+VFS caps IO at 2 GB minus a little and interating with userspace via
+``copy_from_user``/``copy_to_user`` is capped at ``INT_MAX`` as well.
+
+.. code-block:: c
+
+        #define MAX_RW_COUNT (INT_MAX & PAGE_MASK)
+
+The only exceptional case is ``size_t`` value being passed directly into
+a standard function accepting ``size_t`` (``memset``, ``memcpy``, ...).
+Truncating value to 32-bit won't do anything useful in this case.
+
+
+### d) 16-bitness ###
+
+16-bit instructions will generate 1-byte operand size override prefix (66)
+which again bloats an instruction by 1 byte. Unlike REX prefixes, this is
+unavoidable.
+
+It is better to use 16-bit types at ABI/protocol/memory level, convert
+to plain ``int``/``unsigned int`` as soon as possible and work with that.
+
+Preferred order of bitness on x86_64 is:
+
+        32/8-bit > 64-bit > 16-bit.
+
+3) Architecture specific issues (arm/arm64)
+-------------------------------------------
+
+### Constant flags value selection ###
+
+"Tight" constants can be loaded into a register in 1 instruction on arm and
+other RISC architectures:
+
+.. code-block:: c
+
+        int f()
+        {
+                return 1;
+        }
+
+.. code-block:: none
+
+        00000000 <f>:
+           0:   e3a00001        mov     r0, #1
+           4:   e12fff1e        bx      lr
+
+Constants which don't fit into 12-bit window on arm will be loaded from memory
+or constructed with 2 loads:
+
+.. code-block:: none
+
+        00000000 <f>:
+           0:   e59f0000        ldr     r0, [pc]        ; 8 <f+0x8>
+           4:   e12fff1e        bx      lr
+           8:   00000801        .word   0x00000801      ; <=== 2049
+
+After settling on flags/constants push often used values together bitwise.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ