Performance tuning

When compiled with -prod, V's generated C code usually performs well. However, in specialized scenarios, additional compiler flags and attributes can further optimize the executable for performance, memory usage, or size.

[!NOTE] These are rarely needed, and should not be used unless you profile your code, and then see that there are significant benefits for them. To cite GCC's documentation: "Programmers are notoriously bad at predicting how their programs actually perform".

Tuning Operation Benefits Drawbacks
[inline] Performance Increased executable size
[direct_array_access] Performance Safety risks
[packed] Memory usage Potential performance loss
[minify] Performance, Memory usage May break binary serialization/reflection
_likely_/_unlikely_ Performance Risk of negative performance impact
-skip-unused Performance, Compile time, Size Potential instability
-fast-math Performance Risk of incorrect mathematical operations results
-d no_segfault_handler Compile time, Size Loss of segfault trace
-cflags -march=native Performance Risk of reduced CPU compatibility
PGO Performance, Size Usage complexity

Tuning operations details

[inline]

You can tag functions with [inline], so the C compiler will try to inline them, which in some cases, may be beneficial for performance, but may impact the size of your executable.

When to Use

  • Functions that are called frequently in performance-critical loops.

When to Avoid

  • Large functions, as it might cause code bloat and actually decrease performance.
  • Large functions in if expressions - may have negative impact on instructions cache.

[direct_array_access]

In functions tagged with [direct_array_access] the compiler will translate array operations directly into C array operations - omitting bounds checking. This may save a lot of time in a function that iterates over an array but at the cost of making the function unsafe - unless the boundaries will be checked by the user.

When to Use

  • In tight loops that access array elements, where bounds have been manually verified or you are sure that the access index will be valid.

When to Avoid

  • Everywhere else.

[packed]

The @[packed] attribute can be applied to a structure to create an unaligned memory layout, which decreases the overall memory footprint of the structure. Using the [packed] attribute may negatively impact performance or even be prohibited on certain CPU architectures.

When to Use

  • When memory usage is more critical than performance, e.g., in embedded systems.

When to Avoid

  • On CPU architectures that do not support unaligned memory access or when high-speed memory access is needed.

[aligned]

The @[aligned] attribute can be applied to a structure or union to specify a minimum alignment (in bytes) for variables of that type. Using the @[aligned] attribute you can only increase the default alignment. Use @[packed] if you want to decrease it. The alignment of any struct or union, should be at least a perfect multiple of the lowest common multiple of the alignments of all of the members of the struct or union.

Example:

// Each u16 in the `data` field below, takes 2 bytes, and we have 3 of them = 6 bytes. // The smallest power of 2, bigger than 6 is 8, i.e. with `@[aligned]`, the alignment // for the entire struct U16s, will be 8: @[aligned] struct U16s { data [3]u16 }

When to Use

  • Only if the instances of your types, will be used in performance critical sections, or with specialised machine instructions, that do require a specific alignment to work.

When to Avoid

  • On CPU architectures, that do not support unaligned memory access. If you are not working on performance critical algorithms, you do not really need it, since the proper minimum alignment is CPU specific, and the compiler already usually will choose a good default for you.

[!NOTE] You can leave out the alignment factor, i.e. use just @[aligned], in which case the compiler will align a type to the maximum useful alignment for the target machine you are compiling for, i.e. the alignment will be the largest alignment which is ever used for any data type on the target machine. Doing this can often make copy operations more efficient, because the compiler can choose whatever instructions copy the biggest chunks of memory, when performing copies to or from the variables which have types that you have aligned this way.

See also "What Every Programmer Should Know About Memory", by Ulrich Drepper .

[minify]

The [minify] attribute can be added to a struct, allowing the compiler to reorder the fields in a way that minimizes internal gaps while maintaining alignment. Using the [minify] attribute may cause issues with binary serialization or reflection. Be mindful of these potential side effects when using this attribute.

When to Use

  • When you want to minimize memory usage and you're not using binary serialization or reflection.

When to Avoid

  • When using binary serialization or reflection, as it may cause unexpected behavior.

_likely_/_unlikely_

if _likely_(bool expression) { - hints to the C compiler, that the passed boolean expression is very likely to be true, so it can generate assembly code, with less chance of branch misprediction. In the JS backend, that does nothing.

if _unlikely_(bool expression) { is similar to _likely_(x), but it hints that the boolean expression is highly improbable. In the JS backend, that does nothing.

When to Use

  • In conditional statements where one branch is clearly more frequently executed than the other.

When to Avoid

  • When the prediction can be wrong, as it might cause a performance penalty due to branch misprediction.

-skip-unused

This flag tells the V compiler to omit code that is not needed in the final executable to run your program correctly. This will remove unneeded const arrays allocations and unused functions from the code in the generated executable.

This flag will be on by default in the future when its implementation will be stabilized and all severe bugs will be found.

When to Use

  • For production builds where you want to reduce the executable size and improve runtime performance.

When to Avoid

  • Where it doesn't work for you.

-fast-math

This flag enables optimizations that disregard strict compliance with the IEEE standard for floating-point arithmetic. While this could lead to faster code, it may produce incorrect or less accurate mathematical results.

The full specter of math operations that -fast-math affects can be found here.

When to Use

  • In applications where performance is more critical than precision, like certain graphics rendering tasks.

When to Avoid

  • In applications requiring strict mathematical accuracy, such as scientific simulations or financial calculations.

-d no_segfault_handler

Using this flag omits the segfault handler, reducing the executable size and potentially improving compile time. However, in the case of a segmentation fault, the output will not contain stack trace information, making debugging more challenging.

When to Use

  • In small, well-tested utilities where a stack trace is not essential for debugging.

When to Avoid

  • In large-scale, complex applications where robust debugging is required.

-cflags -march=native

This flag directs the C compiler to generate instructions optimized for the host CPU. This can improve performance but will produce an executable incompatible with other/older CPUs.

When to Use

  • When the software is intended to run only on the build machine or in a controlled environment with identical hardware.

When to Avoid

  • When distributing the software to users with potentially older CPUs.

PGO (Profile-Guided Optimization)

PGO allows the compiler to optimize code based on its behavior during sample runs. This can improve performance and reduce the size of the output executable, but it adds complexity to the build process.

When to Use

  • For performance-critical applications where the added build complexity is justifiable.

When to Avoid

  • For small, short-lived, or rapidly-changing projects where the added build complexity isn't justified.

PGO with Clang

This is an example bash script you can use to optimize your CLI V program without user interactions. In most cases, you will need to change this script to make it suitable for your particular program.

#!/usr/bin/env bash

# Get the full path to the current directory
CUR_DIR=$(pwd)

# Remove existing PGO data
rm -f *.profraw
rm -f default.profdata

# Initial build with PGO instrumentation
v -cc clang -skip-unused -prod -cflags -fprofile-generate -o pgo_gen .

# Run the instrumented executable 10 times
for i in {1..10}; do
    ./pgo_gen
done

# Merge the collected data
llvm-profdata merge -o default.profdata *.profraw

# Compile the optimized version using the PGO data
v -cc clang -skip-unused -prod -cflags "-fprofile-use=${CUR_DIR}/default.profdata" -o optimized_program .

# Remove PGO data and instrumented executable
rm *.profraw
rm pgo_gen