Image Blending in Windows – an Assembly Language Approach
To C, or Not to C?
Quality code from a C++ perspective could well be horribly wasteful and inefficient from the CPU’s viewpoint. If you’re not playing directly in the CPU’s arena, you’re not likely to come anywhere near maximizing its potential. With high-demand tasks such as image blending, you can’t rely on the Windows API – especially GDI functions – to get the job done efficiently. Windows is designed to be all things to all developers; it places ritual far ahead of performance. If you’re going to eliminate bottlenecks efficiently, you can’t go wrong by turning to assembly language.
The caveat here is that intrinsics are not a safe alternative. To squeeze every last iota of performance out of the CPU, you can’t write generic code that doesn’t know or care whether it’s running on a lightspeed hundredth generation i9 or a clunky lightweight ARM processor. In short, you can’t milk quality out of shortcuts. To get the best performance, sometimes you have to work for it: you must tailor your work to the hardware it’s running on. In the end, that work pays off in spades.
The Pitfalls of BitBlt
The first thing to note is that when you create a bitmap with CreateCompatibleBitmap, the very limited kernel memory is used to store that image. If Microsoft documents this fact at all, that discussion is well hidden from prying eyes. If such documentation exists, I never saw it. (And it may well exist; maybe I have no excuse for never having seen it!) When creating bitmaps on the fly, I’ve hit the “insufficient memory” wall exactly once; that experience was what led me to begin scouring development forums to track down where I was going wrong. That was when I learned about which bitmaps go into which memory areas, and in the course of adjusting for that, that particular problem never returned.
CreateDIBSection uses memory off the current process’ heap; making that switch solved my resource problem once and for all. I developed my own local function, CreateLocalBitmap, which takes the same parameters as CreateCompatibleBitmap (with one extra added) but creates a DIB section instead – and its functionality includes returning the location of the bitmap data. That little data pointer is worked with directly, across the board, in the image blending process discussed in this article.
My locally declared function CreateLocalBitmap always creates a 32-bit bitmap. This is done because 4 bytes per pixel work exceedingly well for processing through XMM registers. With a 24-bit format, you just can’t do it. You’ll have to convert the 24-bit format to a 32-bit layout first – a taxing process that is wholly unnecessary, given a little preparation that creates 32-bit bitmaps from the outset.
[NOTE: the following paragraph was accurate at the time it was written. It was inserted as a direct result of experiencing a forced change from 32 to 24 bits in a bitmap I was working with. However in the course of creating the sample images for this article, the effect appears to have vanished. I used BitBlt to copy from a LoadImage 24-bit format to a CreateDIBSection 32-bit format, and the 32-bit format remained. So take the following paragraph with a grain of salt – in true Windows tradition, it could happen but doesn’t always. Let your own experience guide you.]
BitBlt likes 24-bit formats, and it’s quite aggressive about forcing its views on every bitmap passing through it. Even if you’re blitting to a bitmap or DIB section that was explicitly created as a 32-bit image, BitBlt will overwrite the 32-bit data with 24-bit values. The amount of allocated bitmap memory remains unchanged from its initial size, but the expected 32-bit data format does not survive a call to BitBlt. The function is completely unware of alpha channels and will not hesitate to perform genocide on every byte of bitmap data referring to one.
Instead, AlphaBlend, designed to process alpha channels, respects the sovereignty of alpha channel bytes and leaves bitmaps formatted as 32-bit entities. Ideally, if you can get the function to work at all under Windows 10, it’s called with an opacity value of 255 (fully opaque) to emulate BitBlt without destroying the format of the bitmap data. What happens when a 24-bit image is AlphaBlended onto a 32-bit destination is a question I can’t answer – my last attempt to use that function resulted in an error that cannot and should not occur, and that was the end of my toying with it. After 22 years of developing under Windows (almost exclusively in assembly language), my intolerance for Windows API failures has grown to the point where I will no longer research most bugs and inexcusably recalcitrant functions – I simply go straight to a workaround. For example, in the case of filling a bitmap with a solid color, the silly and quite stupid (my opinion) process of creating a brush, calling FillRect, and deleting the brush, not to mention creating a memory DC to hold the bitmap while this is done, selecting the bitmap into it, then deselecting the bitmap and finally deleting the DC – is all so far over the line of sheer insanity that I couldn’t even begin to explain the thinking behind it. Instead, seven simple assembly language instructions fill the bitmap. No DC is required, no brush, no objects to select, deselect, or delete. As long as you have that bitmap data pointer, all is well.
lea rdi, BitmapData mov rax, BitmapWidth mov rcx, BitmapHeight mul rcx mov rcx, rax mov rax, FillColor rep stosd
Done … and … done. It’s just that simple. There’s no reason to make it any more complicated than that.
This article outlines a process for blending two bitmaps of 32-bit format. Alpha values are not used (premultiplied or otherwise); the blend is the final result of merging two images. The source opacity is passed as an integer percent value times one hundred, which is converted within the blend function to a single-precision float; the destination opacity is presumed to be one minus that value. For example, declaring a 20% opacity for the source bitmap would place the destination opacity at 80%. I have not found a way to do this with GDI (everything seems to be additive), and I won’t go beyond that to get the job done. DirectX requires a relatively enormous setup time that I find difficult to justify in most apps that I create. Moving beyond that to something like DirectComposition or Direct2D is for me unthinkable – I liken it to bringing a full-sized aircraft carrier to the grocery store just to haul your groceries home, when a little red wagon would do just as well. That doesn’t mean those API’s are that powerful – they’re just that needlessly complex (and embarrassingly bloated).
For the process outlined here, the focus is on speed, and to maximize speed, two main rules must be followed:
- Avoid memory access.
- Use XMM registers to process all color component values in parallel – this results in one multiplication per pixel, not three.
YMM registers don’t make sense in this kind of application because they use 64-bit values. For the process of blending images, 32-bit values are ultimately moved down to 8 bits for each color channel, and four 32-bit values processed by XMM registers is the low limit of SSE/AVX granularity.
Thinking it Through
Forethought is the precursor of performance. The impact of tailoring the task at hand for the specifics of the processor that task will run on cannot be overstated. A little analysis and planning can go a long way in creating unprecedented speed improvements; customizing code for the hardware it will run on has no parallel in its virtue.
In the actual blend loop, for optimal performance, there should only three occasions when memory is accessed: to read each pixel from image 1, to read each pixel from image 2, and to write the final blended result. Complicating the process is the inherent need to separate each pixel into color channels before performing the blend. After the blend is complete, the data again needs to be converted back from float values, reassembled into a single 32-bit value, and stored at its destination.
Logically, the process is so simple that most developers (if they don’t hand the task off to Windows) simply run a loop to execute it. The problem is that in C++ or any other language, the simplicity of the algorithm is usually mistaken as equating to speed in execution, with the particulars of actual implementation not being worth much attention.
The function created in this article will take two incoming bitmaps – one and two – and write the final blend data over the first bitmap’s incoming data. Since everything transpires across the CPU’s registers, no memory needs to be allocated (or subsequently freed) to achieve the blend.
The real key to the process lies in breaking up the pixel data into color channels, then getting that data into an XMM register where the required multiplications can be executed.
On that note, assume a blend of 20% for the first bitmap and 80% for the second. The multiplier values will not change across the life of the loop, so they should be placed into XMM registers directly from the registers they’re passed to the blend function in, then left alone.
On entry into the function:
RCX > bitmap one bits RDX > bitmap two bits R8 = bitmap one blend %, multiplied by one hundred to allow an integer value to be passed R9 = bitmap width [rsp+40] = bitmap height
The reason for passing the blend % as an integer is that 64-bit calling convention becomes confusing and convoluted when trying to pass a float. It can certainly be done, but being the only float value, the blend factor will come into the function in XMM0. You can do it if you want to; the code in this article won’t. As a general rule, I’ve found that a lot of mistakes and debugging time are saved by utilizing XMM registers in some kind of sensible order where they’re not so easily lost track of. There are sixteen of the things and it’s quite easy to forget who’s doing what and why if you implement them at random. In the end, the entire argument is purely aesthetic, but I personally find it far preferable to doing things any other way: parameter 3, R8, holds the source bitmap blend percent times a hundred.
Given color values ranging only from 0 to 255, there is no danger of overflow occurring in any registers with 100x magnitude applied to the blend factors.
On entry to the function, immediately point RDI at the first bitmap’s data, and RSI at the second. The assembly STOSD instruction writes to the location where RDI is pointing, and bitmap one is where the blended output is written, so it may as well be set up as the first order of business.
mov rdi, rcx ; On entry, RCX points to bitmap one data – copy that pointer to RDI, the output mov rsi, rdx ; On entry, RDX points to bitmap two data
With this done, it’s time to decide on XMM register usage. XMM 0 will get data from bitmap one; XMM1 will hold the blend factor for bitmap one. XMM2 will get data from bitmap two, and XMM3 will hold bitmap two’s blend factor.
There’s little left to do beyond setting up XMM1 and XMM3 with the blend factor values before diving into the process loop. Of critical importance is one final note: it would be insane to divide every output pixel by 100.0. After moving the incoming R8 into XMM0, it’s divided by 100.0 immediately so that the division – the slowest instruction on Earth – only needs to be performed once.
The following variables are declared as data:
one_hundred real4 100.0
Then the entry code for the function:
mov rdi, rcx ; Set write pointer @ bitmap 1 mov rsi, rdx ; Set read pointer @ bitmap 2 movd xmm1, r8 ; Move the incoming bitmap 1 blend factor cvtdq2ps xmm1, xmm1 ; Convert value to floating point divss xmm1, real4 ptr one_hundred ; Divide the blend factor by 100 shufps xmm1, xmm1, 0 ; Copy the low dword of XMM0 into all four XMM0 dwords
Setting XMM3 to (1 – XMM1) requires just a little creativity. There are no instructions for moving immediate data (data that’s encoded with an instruction) into an XMM register, and loading 1.0 from memory is last choice.
The CMPPS instruction compares whatever random data is in XMM3 with itself, checking for XMM3 = XMM3. The result must be true, regardless of what the register holds. All bits in XMM3 will then be set, as is done when a compare is true. Shifting those bits right by 31 bits will leave a value of 1.0 in each 32-bit division of XMM3.
cmpps xmm3, xmm3, 0 ; Compare XMM3 = XMM3 to set all bits of register psrld xmm3, 31 ; Right align bit 31, shift out all other bits cvtdq2ps xmm3, xmm3 ; Convert to floating point value subps xmm3, xmm1 ; Subtract bitmap one blend factor from 1.0
The blend factors are now correctly set as floating point values in XMM1 for image one, and XMM3 for image two.
Breaking Down and Building Up
The next parts of the process needing special attention are separating and recombining the color channel values. Fortunately, SSE will handle most of this for you.
Things would be much simpler if the blend factor didn’t come into play. All operations could be performed with SSE integer instructions, and there would be no need for ever converting values to and from floating point. But the blend factor is the entire reason for the function, and it has to be stored as a floating point value since it’s always <= 1.0. So there’s little choice in the matter but to convert the color data to floats before multiplying.
The following instruction loads the source data into XMM0 (bitmap one) and XMM2 (bitmap two):
pmovzxbd xmm0, dword ptr [ rdi ] ; Load the three color channels & the alpha channel: bitmap one cvtdq2ps xmm0, xmm0 ; Convert values to floating point pmovzxbd xmm2, dword ptr [ rsi ] ; Load the three color channels & the alpha channel: bitmap two cvtdq2ps xmm2, xmm2 ; Convert values to floating point
The above instructions separate each byte from the dword value loaded and conveniently place the data into consecutive dword locations within an XMM register. This saves oodles of execution time over having to do it manually.
Next, the multiplication occurs:
mulps xmm0, xmm1 ; Multiply color data by bitmap one blend factor mulps xmm2, xmm3 ; Multiply color data by bitmap two blend factor
Add the two values together:
addps xmm0, xmm2 ; Add the two values
Finally, build the output dword, store it, and advance both bitmap data pointers.
cvtps2dq xmm0, xmm0 ; Convert result back to integer values shufps xmm0, xmm0, 4Eh ; Shift result 2 slots or 64 bits (either direction, same result) movd ebx, xmm0 ; Get the red channel value shufps xmm0, xmm0, 93h ; Shift result 1 slot left movd eax, xmm0 ; Get the green channel value shl ebx, 8 ; Shift the accumulator left 1 byte mov bl, al ; Set green in BL shufps xmm0, xmm0, 93h ; Shift result 1 slot left movd eax, xmm0 ; Get the blue channel value shl ebx, 8 ; Shift the accumulator left 1 byte or eax, ebx ; OR the red and green with the blue currently in AL stosd ; Store the final result and advance write pointer (bitmap 1) 4 bytes
The above code comprises the entirety of the merging process.
No memory access occurs inside the loop, other than to load the source values and store the merged result. The only math performed is the multiplication of each color channel by the blend factor, and even this operation is performed simultaneously on all color channels via the XMM registers.
The complete function is shown below. Note that the images passed to this function are assumed to be the same size, and are merged in their entirety. A more sophisticated function that allowed specified areas to be selected out of each image would be considerably more complicated as far as loop controls – individual rows would need to be looped through, then columns within each row, while properly tracking the data pointers through both loops. This article focuses on utilizing assembly language, particularly XMM registers, to perform the actual blend.
The required data for the function:
align 16 ; Required for XMM access all_255 real4 4 dup ( 255.0 ) ; Compare destination for overflow check one_hundred real4 4 dup ( 100.0 ) ; Divisor of 100
Finally, the entire function:
;********************************************************************************************************************** ; BlendImages ; ; In: 1 RCX > bitmap one bits ; 2 RDX > bitmap two bits ; 3 R8 = bitmap one blend factor (integer, % * 100; bitmap 2 is 1 minus this value) ; 4 R9 = bitmap width (same for both bitmaps) ; 5 [ RSP + 20h ] = bitmap height BlendImages2 proc ; Declare the function mov rdi, rcx ; Set write pointer @ bitmap 1 mov rsi, rdx ; Set read pointer @ bitmap 2 movd xmm1, r8 ; Move the incoming bitmap 1 blend factor cvtdq2ps xmm1, xmm1 ; Convert value to floating point divss xmm1, real4 ptr one_hundred ; Divide the blend factor by 100 shufps xmm1, xmm1, 0 ; Copy the low dword of XMM0 into all four XMM0 dwords cmpps xmm3, xmm3, 0 ; Compare XMM3 = XMM3 to set all bits of register psrld xmm3, 31 ; Right align bit 31, shift out all other bits cvtdq2ps xmm3, xmm3 ; Convert to floating point value subps xmm3, xmm1 ; Subtract bitmap one blend factor from 1.0 mov rax, [ rsp + 40 ] ; Get the bitmap width mul r9 ; Multiply by height for pixel count mov rcx, rax ; Set the count through the loop BlendLoop: pmovzxbd xmm0, dword ptr [ rdi ] ; Load the three color channels & the alpha channel: bitmap one cvtdq2ps xmm0, xmm0 ; Convert values to floating point pmovzxbd xmm2, dword ptr [ rsi ] ; Load the three color channels & the alpha channel: bitmap two cvtdq2ps xmm2, xmm2 ; Convert values to floating point mulps xmm0, xmm1 ; Multiply color data by bitmap one blend factor mulps xmm2, xmm3 ; Multiply color data by bitmap two blend factor addps xmm0, xmm2 ; Add the two values ; Build and store the final output dword cvtps2dq xmm0, xmm0 ; Convert result back to integer values shufps xmm0, xmm0, 4Eh ; Shift result 2 slots or 64 bits (either direction, same result) movd ebx, xmm0 ; Get the red channel value shufps xmm0, xmm0, 93h ; Shift result 1 slot left movd eax, xmm0 ; Get the green channel value shl ebx, 8 ; Shift the accumulator left 1 byte mov bl, al ; Set green in BL shufps xmm0, xmm0, 93h ; Shift result 1 slot left movd eax, xmm0 ; Get the blue channel value shl ebx, 8 ; Shift the accumulator left 1 byte or eax, ebx ; OR the red and green with the blue currently in AL stosd ; Store the final result and advance write pointer (bitmap 1) 4 bytes add rsi, 4 ; Advance the source pointer (bitmap two) loop BlendLoop ; Return to top of loop xor rax, rax ; Zero the return value ret ; Return from function BlendImages2 endp ; End function declaration
The function conforms to the 64-bit calling convention, allowing it to be exported from a DLL and called from any language that allows calling external functions.
Limit checking is not required because the nature of the blend process makes an overflow impossible – with 8 bit color channels, no pixel can have a value greater than 255. Even if a blend factor > 1 were passed, such as 1.25, the bitmap two blend factor would be 1 – 1.25 or -0.25. Worst case, both pixels being blended would be 255. The Bitmap 1 pixel would factor in as 255 * 1.25 or 319 (rounded up); the bitmap 2 pixel would weigh in at 255 * -0.25 or -64 (rounded up). Summing the two values would yield 255. So no limit checking is required with the implementation shown.
The code in this article has actually been compiled and executed, blending the following images at 45% image 1 (the sunset) and 55% image 2 (the forest); the third (blended) image was captured from actual output generated by this article’s source code:
This article demonstrated how drastically custom tailoring code for the hardware it’s running on can speed up an operation. Garbage in, garbage out; shortcuts are the fast path to the next bottleneck.
The function outlined here is hardly made up of thousands of lines of code, and it’s almost completely self-contained. In most (if not all) modern languages, the attendant declarations, includes, and endless compiler instructions far outweigh any savings of time in the coding process. To use an analogy, every day, more and more workers are replaced with dead weight managers. The workers (actual functions) may or may not get the job done faster, but the explosion of dead weight management makes the company as a whole larger and larger each year, while producing less and less – and what’s produced is slower and slower to execute.
A forthcoming article covers a much more complex subject: implementing Gaussian blur in assembly language. The article’s code will use an 11×11 convolution kernel and will run a single pass. The speed improvement in that code is nothing short of staggering, resulting from the application of the same logic that was presented in this article: customizing code to maximize what the CPU has to offer.