Accelerate Color Conversion Functions using NASM

June 9, 2022

#NASM

Introduction

As we developed certain features of our software, we came up with the idea that it was time to think about acceleration and optimization. What to do when it seems that everything which could be accelerated is already accelerated, and everything that needs to be optimized is already optimized?

By developing color-conversion functions, we achieved good performance due to the simplicity of conversions. However, we wondered if there were any external methods that could be used to accelerate our functions. In search of ways to speed up the color-conversion processes, we decided to try using the Assembler. Even though Assembler is a highly unpopular and complex programming language, it remains indispensable for a diverse range of problems. There are several assembler translators available for use with Intel processors, such as FASM, MASM, YASM, and NASM (the list is incomplete). In this article, we use the NASM translator because it is multi-platform (available for Linux, Windows, and macOS) and due to its powerful macros system. It also has a relatively large user base.

Our article will show how to download NASM and connect the Assembler for Linux (CLion) and Windows (Visual Studio). We'll show you how to use Assembler in your projects and tell you what were the results we encountered so that you can comprehend whether you need to include Assembler in your projects or not.

A summary of NASM:

NASM (Netwide Assembler) is a free and open-source assembler that supports several platforms and object file formats. Its syntax is more explicit and consistent than the MASM (Microsoft Macro Assembler). NASM is regularly updated with new instruction sets. NASM is currently the best multi-platform Assembler and is widely used by assembler developers.

Also, NASM contains a powerful macro processor that supports conditional assembling, multi-level file inclusion, two forms of macro definitions (single-line and multi-line), and a "context stack" mechanism that extends macro definitions. All preprocessor directives start with a % sign.

Install NASM

We would like to show you how to install and use NASM on your machine. We are going to consider Windows and Linux installation.

On Linux system, to install nasm package, use:

apt-get install nasm (Ubuntu)
yum install nasm (CentOS)

If you want to build NASM from the source, go to https://www.nasm.us and click DOWNLOAD.

Choose the version. We have used the latest NASM version - 2.15.05

There are directories with installers for each platform and archives with the source code.

Choose the version. We have used the latest NASM version - 2.15.05. Pic 3

You can choose different ways to install NASM. The basic and more useful is downloading the archive, such as nasm-2.15.05.tar.gz, extracting and typing from the unpacked directory /configurе to execute the shell script. NASM is configured, then builds NASM with the make command and is installed with the make install. Default installation is to /usr/local/bin. Note that the file INSTALL has more installation information. It might help to use the instructions required for your system.

Depending on your system, you also need to download the archive on Windows systems, either win32 or win64. Next, in order to get NASM in the PATH variable, you need to unzip the archive to any folder you like and run the .exe file.

Next, let's check if Visual Studio picks up the NASM translator. To do this, you need to create an empty C++ project.

Now create an .asm file. To do this, you have to choose the C++ File (.cpp) option and rename the file type to .asm

After creating the file, right-click on the solution, look for Build Dependencies, and click on Build Customization, as shown in the picture below.

The final step is to make sure that Visual Studio recognizes the assembler file and selects the right translator; otherwise, you must manually check the boxes for nasm.

After these preparations, your NASM should be installed in your development environment and is ready for use.

How to use NASM in a project?

First, we should point out that function arguments are passed in registers. There are several variants for each architecture. On Windows, we can define **fastcall, keywords throw the DEF_FASTCALL, on Linux, define attribute((regparm(3))). **fastcall keyword indicates the first four arguments are passed in registers (RCX, RDX, R8, R9), and other arguments are passed on the stack. In the case with attribute regparm(number), it allows passing number arguments in registers, and the rest of arguments are passed on the stack.

#if PLATFORM_WINDOWS
#define DEF_FASTCALL __fastcall
#else
#define DEF_FASTCALL __attribute__((__regparm__(3)))
#endif

Then we need to define color-conversion functions. extern "C" shows that the function uses “C” language calling. There are two pairs of functions. The first pair is a conversion from YUV420P to P010 realized in C (PlainC) language and AVX2 (asm). And another one - from YUV420P to UYVY422.

extern "C" bool  DEF_FASTCALL convertYUV420PToP010_PlainC(
    uint8_t const* src_y, int32_t const src_stride_y,
    uint8_t const* src_u, int32_t const src_stride_u,
    uint8_t const* src_v, int32_t const src_stride_v,
    uint8_t * dst_y, int32_t const dst_stride_y,
    uint8_t * dst_uv, int32_t const dst_stride_uv,
    int32_t const width, int32_t const height);
extern "C" bool  DEF_FASTCALL convertYUV420PToP010_AVX2(
    uint8_t const* src_y, int32_t const src_stride_y,
    uint8_t const* src_u, int32_t const src_stride_u,
    uint8_t const* src_v, int32_t const src_stride_v,
    uint8_t * dst_y, int32_t const dst_stride_y,
    uint8_t * dst_uv, int32_t const dst_stride_uv,
    int32_t const width, int32_t const height);
extern "C" bool DEF_FASTCALL convertYUV420PToUYVY422_PlainC(
    uint8_t const* src_y, int32_t const src_stride_y,
    uint8_t const* src_u, int32_t const src_stride_u,
    uint8_t const* src_v, int32_t const src_stride_v,
    uint8_t * dst, int32_t const dst_stride,
    int32_t const width, int32_t const height);
extern "C" bool DEF_FASTCALL convertYUV420PToUYVY422_AVX2(
    uint8_t const* src_y, int32_t const src_stride_y,
    uint8_t const* src_u, int32_t const src_stride_u,
    uint8_t const* src_v, int32_t const src_stride_v,
    uint8_t * dst, int32_t const dst_stride,
    int32_t const width, int32_t const height);

For note PlainC functions are implemented in .cpp file, AVX2 - in .asm. PlainC realizations are the next:

bool convertYUV420PToP010_PlainC(
    uint8_t const* src_y, int32_t const src_stride_y,
    uint8_t const* src_u, int32_t const src_stride_u,
    uint8_t const* src_v, int32_t const src_stride_v,
    uint8_t* dst_y, int32_t const dst_stride_y,
    uint8_t* dst_uv, int32_t const dst_stride_uv,
    int32_t const width, int32_t const height){
for (int32_t y = 0; y < height; y++)
    {
        const auto p_src_y = (src_y + y * src_stride_y);
        uint16_t* p_dst_y = reinterpret_cast<uint16_t*>(dst_y + y * dst_stride_y);
        for (int32_t x = 0; x < width; x++)
        {
            p_dst_y[x] = static_cast<uint16_t>(p_src_y[x]) << 8;
        }
    }
    for (int32_t y = 0; y < height / 2; y++)
    {
        const auto p_src_u = (src_u + (y * src_stride_u));
        const auto p_src_v = (src_v + (y  * src_stride_v));
        uint16_t* p_dst_uv = reinterpret_cast<uint16_t*>(dst_uv + y  * dst_stride_uv);
        for (int32_t x = 0; x < width / 2; x++)
        {
            p_dst_uv[0] = static_cast<uint16_t>(p_src_u[x]) << 8;
            p_dst_uv[1] = static_cast<uint16_t>(p_src_v[x]) << 8;
            p_dst_uv += 2;
        }
    }
    return true;
}

For AVX2 functions make visible external function definition.

global convertYUV420PToUYVY422_AVX2
global convertYUV420PToP010_AVX2

Passing arguments to registers on Windows and Linux is in different ways.

How it works on Windows according to the Calling Convention: look at part of the color conversion function implementation below. We can use sixteen 64-bit registers: %rax, %rcx, %rdx, %r8, %r9, %r10, %r11, %r12, %r13, %r14, %r15, %rdi, %rsi, %rbx, %rbp, %rsp. According to the calling convention, four registers such as %rcx, %rdx, %r8, %r9 are used by default. It means that the first four function arguments are passed to these registers. The remaining arguments are passed on the stack. If the return value is used in a function, we save it through the %rax register. Next we would like to add about the nonvolatile registers group: %rbx, %rbp, %rdi, %rsi, %rsp, %r12, %r13, %r14, %r15. %r10 and %r11 registers must be preserved as needed by the caller. Such registers should be saved and restored by the function. %rsp is used as a stack pointer. %rbp register may be used as a frame pointer.

;---------------------------------------------------------------------------------------------------------------
; bool  __fastcall convertYUV420PToP010_AVX2(uint8_t const* src_y, int32_t const src_stride_y, uint8_t const* src_u, int32_t const src_stride_u, uint8_t const* src_v, int32_t const src_stride_v, uint8_t* dst_y, int32_t const dst_stride_y, uint8_t* dst_uv, int32_t const dst_stride_uv, int32_t const width, int32_t const height);
;---------------------------------------------------------------------------------------------------------------
ALIGN 32
vq_convertYUV420PToP010_AVX2:

%assign STACK_PARAMS_BASE_OFFSET 28h

push     r12  ; save registers
push     r13
push     r14
push     r15
push     rdi
push     rsi
push     rbx

%xdefine    src_y            rcx ; redefine argument alias to register
%xdefine    src_stride_y     rdx
%xdefine    src_u            r8
%xdefine    src_stride_u     r9
%xdefine    src_v            r10
mov         r10,             QWORD[rsp+STACK_PARAMS_BASE_OFFSET] ; argument passed via stack
%xdefine    src_stride_v     r11
mov         r11d,            DWORD[rsp+STACK_PARAMS_BASE_OFFSET + 8]
%xdefine    dst_y            rax
mov         rax,             QWORD[rsp+STACK_PARAMS_BASE_OFFSET + 16]
%xdefine    dst_stride_y     r12
mov         r12d,            DWORD[rsp+STACK_PARAMS_BASE_OFFSET + 24]
%xdefine    dst_uv           r13
mov         r13,             QWORD[rsp+STACK_PARAMS_BASE_OFFSET + 32]
%xdefine    dst_stride_uv    r14
mov         r14d,            DWORD[rsp+STACK_PARAMS_BASE_OFFSET + 40]
%xdefine    width            r15
mov         r15d,            DWORD[rsp+STACK_PARAMS_BASE_OFFSET + 48]
%xdefine    height           rdi
mov         edi,             DWORD[rsp+STACK_PARAMS_BASE_OFFSET + 56]

%xdefine    p_src            rsi ; define local variables
%xdefine    p_dst            rbx

What about Linux register usage? We can use sixteen 64-bit registers: %rax, %rcx, %rdx, %rbx, %rsi, %rdi, %rsp, %rbp, %r8, %r9, %r10, %r11, %r12, %r13, %r14, %r15. As a rule, %rax is used to save a return value if it is used in the function. %rsp is used as a pointer to the top stack element. %rbx, %r12, %r13, %r14, %r15 registers are saved across function calls. %rdi, %rsi, %rdx, %rcx, %r8, %r9 registers are used to place the first six function parameters. Other parameters are pushed on the stack.

;---------------------------------------------------------------------------------------------------------------
; bool  __fastcall convertYUV420PToP010_AVX2(uint8_t const* src_y, int32_t const src_stride_y, uint8_t const* src_u, int32_t const src_stride_u, uint8_t const* src_v, int32_t const src_stride_v, uint8_t* dst_y, int32_t const dst_stride_y, uint8_t* dst_uv, int32_t const dst_stride_uv, int32_t const width, int32_t const height);
;---------------------------------------------------------------------------------------------------------------
ALIGN 32
vq_convertYUV420PToP010_AVX2:

push     rbx    ; save registers
push     r12
push     r13
push     r14
push     r15

%xdefine    src_y           rdi  ; redefine argument alias to register
%xdefine    src_stride_y    rsi
%xdefine    src_u           rdx
%xdefine    src_stride_u    rcx
%xdefine    src_v           r8
%xdefine    src_stride_v    r9
%xdefine    dst_y           rax
mov         rax,            QWORD[rsp+STACK_PARAMS_BASE_OFFSET ] ; argument passed via stack
%xdefine    dst_stride_y    r10
mov         r10d,           DWORD[rsp+STACK_PARAMS_BASE_OFFSET + 8]
%xdefine    dst_uv          r11
mov         r11,            QWORD[rsp+STACK_PARAMS_BASE_OFFSET + 16]
%xdefine    dst_stride_uv   rbx
mov         ebx,            DWORD[rsp+STACK_PARAMS_BASE_OFFSET + 24]
%xdefine    width           r12
mov         r12d,           DWORD[rsp+STACK_PARAMS_BASE_OFFSET + 32]
%xdefine    height          r13
mov         r13d,           DWORD[rsp+STACK_PARAMS_BASE_OFFSET + 40]

%xdefine    p_src           r14 ; define local variables
%xdefine    p_dst           r15

Alternatively, you can use the __cdecl - default calling convention for C and C++ programs. The main difference from __fastcall is that all arguments are passed on the stack. On Windows we can define __cdecl, keywords throw the DEF_CDECL, on Linux - attribute((__cdecl))

#if PLATFORM_WINDOWS
#define DEF_CDECL __cdecl
#else
#define DEF_CDECL  __attribute__((cdecl))
#endif

Functions are defined in the same way as in the __fastcall case. Let's show an example of one pair of functions.

extern "C" bool  DEF_CDECL convertYUV420PToP010_PlainC(
    uint8_t const* src_y, int32_t const src_stride_y,
    uint8_t const* src_u, int32_t const src_stride_u,
    uint8_t const* src_v, int32_t const src_stride_v,
    uint8_t * dst_y, int32_t const dst_stride_y,
    uint8_t * dst_uv, int32_t const dst_stride_uv,
    int32_t const width, int32_t const height);
extern "C" bool  DEF_CDECL convertYUV420PToP010_AVX2(
    uint8_t const* src_y, int32_t const src_stride_y,
    uint8_t const* src_u, int32_t const src_stride_u,
    uint8_t const* src_v, int32_t const src_stride_v,
    uint8_t * dst_y, int32_t const dst_stride_y,
    uint8_t * dst_uv, int32_t const dst_stride_uv,
    int32_t const width, int32_t const height);

Results

For example, we took two color conversion functions from YUV420P to P010 and from YUV420P to UYVY422 on 1, 4, and 8 threads. The performance results on Windows are shown in the figure below.

The performance results for Linux are demonstrated below. We see the performance increase on the comparison table for the color conversion function from YUV420P to UYVY422 on Assembler. It is conditioned by the algorithm conversion complexity. Conversion from YUV420P to P010 is more elementary, and it shows an insignificant performance increase on one thread for AVX2 implementation.

If we compare PlainC vs. AVX2 and PlainC vs. AVX2_4_threads on color conversion from YUV420p to UYVY422, then observe that AVX2 on one thread works faster than on 4 or 8 threads.

We can observe from the results that the acceleration of conversion from YUV420P to P010 with the Assembler is not as good as YUV420p to UYVY422 one. To explain this, we will look at the PlainC versions of these functions and see how the compiler disassembled them.

The pictures below show the color conversion from YUV420p to P010. The left one is what the compiler did, and the right one is our AVX2 version.

We can see that the compiler also uses vector instructions during compilation. And since the function is quite simple, the speed gain is insignificant compared to the manually written function, as can be seen from the results. From this, we can conclude that optimizing the code using Assembler will not give much benefit in case of simple functions. However, the situation with more complex functions is different. The compiler can lose in optimization of a manually written function. As shown in the results, the increase was significant in the YUV420p to UYVY422 case.

IntelⓇ Architecture Code Analyzer

We decided to check the performance results and compare them with the throughput report which we got with Architecture Code Analyzer. IntelⓇ Architecture Code Analyzer can analyze the throughput block of code. Look at the Code Analyzer tool output.

Look at the Code Analyzer tool output path 1

Look at the Code Analyzer tool output path 2

Obviously, if the throughput bottleneck is a frontend, it means we don't have any data dependency.

Conclusion

Studying Assembler offers many benefits in and by itself:

you will learn how the CPU and the RAM work;
you will learn how the computer and the operating system work together;
you will understand how high-level language compilers generate code machine language, and this knowledge can help you write more efficient code;
you will have a more effective means of analyzing errors in programs;

But, as we have shown in this article, sometimes using Assembler can be reasonable, and sometimes it can waste a lot of your nerves and can even slow down your functionality. In the case of relatively simple programs, rewriting the code with assembly language is unlikely to bring any advantage over C++. However, in more complex programs, using Assembler is likely to give a significant speed advantage. A critical moment was that AVX2 instructions work poorly with virtual threads.

If our article has inspired you to get into Assembler, the best place to start is with the Intel Reference Guide. You can find this one in the link below:

https://software.intel.com/en-us/articles/intel-sdm

Also, a helpful source of information can be found here:

https://www.felixcloutier.com/x86/

This site provides a list of all instructions with brief descriptions of their usage.