Nightmare on (Overwh)Elm Street: the 64-bit Calling Convention


The 64-bit calling convention: they call it __fastcall, but when it comes to actual implementation in Windows, there’s nothing __fast about it.  It could be fast, if the calling code wasn’t bogged down in a nightmarish level of stack manipulation just to call a function.  “Penny wise and Pound foolish” never saw a more devoted implementation.  (For those who never quite understood that phrase, think in terms of British currency.)

The general rules of the ABI (Application Binary Interface) seem simple enough on their face, but when one begins to actually work within the convention, questions begin to arise.  Critical particulars here and there that are shrouded in mystery, and all too often, answers are nowhere to be found.

MSDN presents the following synopsis of the ABI:

The x64 Application Binary Interface (ABI) uses a four register fast-call calling convention by default. Space is allocated on the call stack as a shadow store for callees to save those registers. There is a strict one-to-one correspondence between the arguments to a function call and the registers used for those arguments. Any argument that doesn’t fit in 8 bytes, or is not 1, 2, 4, or 8 bytes, must be passed by reference. There is no attempt to spread a single argument across multiple registers. The x87 register stack is unused. It may be used by the callee, but must be considered volatile across function calls. All floating point operations are done using the 16 XMM registers. Integer arguments are passed in registers RCX, RDX, R8, and R9. Floating point arguments are passed in XMM0L, XMM1L, XMM2L, and XMM3L. 16-byte arguments are passed by reference. Parameter passing is described in detail in Parameter Passing. In addition to these registers, RAX, R10, R11, XMM4, and XMM5 are considered volatile. All other registers are non-volatile.

Unfortunately, the above explanation leaves some unanswered questions when it comes to actual implementation.

The information in this article was pieced together from bloody life-and-death battles with Visual Studio 2017 during my efforts to create an all-assembly Win64 application that utilized DirectX.  (It was a nuclear-level conflict that I ultimately won.)  The most pressing problem I encountered was a phenomenon I call “code mangling.”  WinDbg, as well as the VS2017 debugger (such that it is), had an apparent affinity for skipping instructions, displaying them or not depending on the phase of the moon or whatever the determinant ultimately was.  VS2017 also liked to display the same source line multiple times, which seriously confused the debugging effort.  Add a C++ .DLL (for learning purposes only) exporting functions to an all ASM app to the mix and the end result was more volatile than nitroglycerine.  I tried debug builds, release builds, Bob the Builder, even a duck-build platypus, all to little avail.  Most of the time I was convinced I had to be dreaming because reality could never be that bad.

I was wrong.  It was that bad.

The Devil in the Details

The first problem with calling COM (in particular, DirectX) methods from a 64-bit app that doesn’t have built-in COM support is the requirement that RCX (parameter 0) holds the interface pointer (“this”) for each call.  This bumps up each additional parameter one beyond what’s documented if the development environment doesn’t inherently account for it.  Worse, DirectX uses 32-bit float values across the board, except for pointers, even when it’s in 64 bit mode.  This led to the first unanswered question: are float parameters (which abound in DirectX) 32-bit single, or 64-bit double, precision?  Diving into the VS2017 debugger answered that question relatively quickly: single-precision, 32-bit.   So how are these handled within the ABI?

Even the inline function calls (the DirectXMath library) have an undocumented parameter … sometimes.  It all depends on the function.  Typically (but not always), functions returning an XMMATRIX structure require a pointer to the output matrix destination in RCX when the call is made.  If you code the statement

                mOut = XMMFoo ( Parameter1, Parameter 2);

then what is actually coded is

                RAX = XMMFoo ( &mOut, Parameter1, Parameter 2);

This is most likely a compiler shenanigan to move memory access (writing the final output) into the AVX arena of the actual function, but it confuses the issue to no end if DirectXMath access is attempted from outside Visual Studio. 

Regarding function calls, it’s the same issue that rears its ugly head when calling methods: all the other parameters get bumped up by 1 parameter if and when that undocumented parameter is present.  Which functions do this can be discovered by perusing the inline code that defines them, but who has the time for doing that with every single function?  If your language doesn’t have built-in COM support, you won’t have a choice.

From a perspective outside Visual Studio, the whole of DirectX (which means, very possibly, everything else COM based) is clearly a chaotic, non-uniform cluster of confusion and inconsistency.  And there was no choice but to navigate through it.  While these issues don’t directly impact the 64-bit calling convention, mentioning them underscores the sometimes radical inconsistency that can be, and often is, encountered when trying to comply with the convention in the real world.

To INT, or Not to INT?

__notsofastcall mandates that floats get put into XMM registers.  But how?  XMM0 for the first float, regardless of where it appears in the parameter list?  I was unable to find an answer to this question, although I have to concede that I didn’t read, in depth, every source that a search returned because it seemed to me that everybody likes to start blogs and articles in the middle of the learning curve, faithfully targeting a core audience that wouldn’t be reading them in the first place because those people already knew what they were doing.

As it turns out, the following table applies:

The table above is adhered to always for the first four parameters.  For each parameter above, the int or float columns are the only options for where to assign each parameter value.  If (for example) parameter 2 is a float and the others are not, then XMM0, XMM1 and XMM3 are left alone.  Parameter 2 goes into XMM2 and that is that; parameters 0, 1, and 3 are placed in RCX, R8, and R9, respectively.  The data to be passed only occupies the low 64 bits of the XMM register, or possibly the low 32 bits (as is the case with DirectX and its use of single-precision floats).  If parameters 1 and 3 are floats, they go into XMM1 and XMM3; XMM0 and XMM2 are not used for the call.

Stacking the Deck

Setting up the stack for a call is a time loss, all things being relative.  The stack lives in memory, and memory access costs.  That price goes way up if you try writing a 64-bit value to a location that is not properly aligned (on an 8-byte boundary).  Windows will keep the stack properly aligned, initially, but what’s placed on it (thereby modifying RSP) by the time your app begins executing, is normally beyond that app’s control.

The “red zone” is a formally declared 128 byte area below RSP that’s guaranteed not to be decimated by signal and interrupt handlers.  If a function is a leaf function – it calls no other functions – it can safely use this area as work space without having to adjust RSP before or after use.  However since the very act of making a call precludes the caller from being a leaf function, it cannot feasibly take advantage of the red zone – unless that space is used between function calls and the red zone is assumed to be volatile during any other function call (except for calls to functions you wrote, that you know aren’t going to mess up the stack – but what if that changes later?).

When a call is made, the caller has to reserve stack space (completely redundantly for the first four parameters) for the data being passed.  The nearest rationale I could find, going by what’s written about this, says that shadowing the first four parameters on the stack “might happen” within a called function, so every call must accommodate that eventuality.  It’s not too far removed from the concept of the ever-annoying handicapped parking space that might actually be used once per decade, and when it is used, it’s usually by a driver who’s not actually handicapped but has convenient access to the vehicle with the blue permission slip hanging from the rear view mirror.  So, for all the rest of the time, nobody can park there. 

There is no stack adjustment on a return.  The caller must undo whatever changes it made to the stack after each call returns.

Stack space must be created for a call’s return address (at [RSP]), parameters 0 through 3 (at [RSP+8] through [RSP+32]), and whatever parameters beyond the first four that might be passed.  Any parameters beyond the first four must be placed on the stack by the caller before the call is made, after leaving space for the first four.  The caller is not required to place the first four parameters on the stack, but must reserved space for them.

The figure below shows the stack layout for a 6-parameter call to function XMMFoo:

Figure 1. Stack Layout for 6-Parameter Call

Normally, loading packed values from memory into an XMM register requires that the source memory location be 16-byte aligned; failure to do this raises an exception.  (And now it’s soapbox time: I’m not sure where the term “throw” came from regarding exceptions, but somehow it came into being as the ultimate commonly-used term.  Intel’s CPU documentation has always referred to exceptions as being “raised” (as in, raising a flag) or “generated.”  I have never heard Intel use the term “throw,” as there is no direct relation between programming and sports.  They may or may not have jumped on the bandwagon some time after this silly “throw” term came into mainstream use.  Maybe another CPU manufacturer coined the term?)  However x64 architecture provides specific instructions for moving unaligned (not on a 16-byte boundary) packed data into an XMM register.  It executes slower than its aligned cousins; internally, two memory accesses must occur to complete the transfer.  The entire issue is moot for the purposes of this article, as only scalar (single 64 or 32 bit) values are used.  For these, the 16-byte alignment requirement goes away.  Floats go into their little nests on the stack, wherever the proper location happens to be, 16-byte aligned or not.


The __fastcall convention isn’t going away any time soon, so it has to be dealt with whether you love it, hate it, or don’t care.  If you live in Visual Studio and its family of languages, or some other language that understands all the nuances of 64-bit calling, you won’t need to concern yourself with the details therein.  However if you’re among the less fortunate who has to manually adjust every call for the sometimes wild requirement deviations of each function being called, you’ll have to look very closely at each individual function.  Don’t assume you have it right if you’re not completely sure.  Look for those undocumented parameters; Microsoft seems to love them.  Getting the parameters wrong doesn’t always result in an outright crash.  Sometimes you just get bad data back and you never directly know that anything is wrong.  I experienced this one day before creating this article; that experience is what triggered this article being written: the DirectX method to clear the depth stencil buffer is so simple that it was one of only two functions I bypassed in-depth scrutiny of during a 16-hour marathon debugging session to figure out why my cubes were not rendering.  It couldn’t be that; that call is just too simple and straightforward.  It was that.  Never assume.  Verify, verify, verify.  If you’re working in an “outsider” language and/or environment, you have to double check everything; you must go poking your nose into places other developers would never bother with.

The trend of what to me is internal chaos seemed to begin with Windows 8 and, in my opinion,  it accelerates daily, with bad going to worse constantly.  If you’re not a Visual Studio devotee, and you’re not using something equally well suited to working with any number of MS platforms, your development life isn’t likely to get any easier any time soon.  You have to either adapt to what’s out there, move to Visual Studio, or get out of development completely.  Eventually, a pattern to the apparent internal anarchy in MS platforms will present itself and you’ll be able to predict far more than you’ll need to research.  But you have to put in your time and build experience to get there.