Behind `memcpy`/`memmove`: overlap, words and alignment

September 21, 2024

⊕

How do memcpy/memmove work?

I ended asking that question a few days ago while I was implementing the copy between sections of a file for the xoz library.

memcpy and memmove deal with memory, not with files, but I imagined that I could borrow some ideas anyways.

And I did.

memcpy copyies bytes from a src buffer into another that does not overlap (both buffers are in differen parts of the memory) memmove does the same thing, copying from src to dst but handling carefuly the situation of overlap.

Under the hood, both functions are implemented in the same way. But why the stdlib has 2 separated functions then?

When doing the “copy” between overlapping buffers, you will end up being overriding part of the src content.

⊕ Well, quantum information cannot be copied without destroying the original one. It is called teleportation.

I guess the physics are more sci-fi inclined for picking terms. It is obvious to me that the Universe is just calling memmove.

Imagine wanting to copy src and got src corrupted, the copies does not work like that! It makes sense to name the function memmove because it feels that the content is moved from src to dst (so you don’t care what was left in src) and leave memcpy for when it is a truly copy.

For the rest of the post I will refer to memcpy for simplicity in the wording, even if we are going to see how to handle overlapping buffers like memmove does.

Copy one byte at time - bytecopy

Let’s start with the most obvious algorithm: copy one byte at time.

⊕

void naivecopy(char* dst, char* src, size_t sz) {
    while (sz) {
        *dst = *src;
        ++dst;
        ++src;
        --sz;
    }
}

When we copy between two non-overlaping areas, eveything goes smooth but when they do overlap, this doesn’t work.

The problem is that when we do the write we are overriding data we didn’t copyied yet.

So we just need start copying from the overlapping bytes before they are overriden.

There are 2 cases:

when the overlap is at the begin of src we copy forwards as in naivecopy
when the overlap is at the end of src we copy backwards

Let’s call these bytecopy_fwd and bytecopy_bwd respectively and bytecopy the main algorithm that call one or the other depending of the overlap.

Copy multiple bytes at time - wordcopy and blockcopy

⊕ The term “word” is misleading: a 64 bits architecture may support operands larger than its word size, like uint128_t. So don’t take this term to strictly.

AVX instructions can these days work with 512 bits but its usefulness for copying seems not be clear according this post. While it can copy data faster, it may degrade the performance of the system due how they are implemented (other post).

At the end of the day is a per-architecture analysis: just pick the largest and fastest word size.

Instead of copying byte by byte we can do it word by word. Modern computers can operate with uint32_t, uint64_t, or even uint128_t in a single instruction so a memcpy implementation using them can go 4, 8 or even 16 times faster.

⊕ To be precise, the glibc uses vm_copy that maps the pages of dst to the pages of src on copy-on-write. So vm_copy defers the copy until a page fault.
When copying large buffers with infrequent writes, this lazy copy avoids the costs of a real copy.

Reference: developer apple docs

And if supported by the architecture, we could use even larger chunks or blocks. glibc for example has PAGE_COPY_FWD_MAYBE that copyies entire pages: few kilobytes in a single operation!

We may call these algorithm wordcopy and blockcopy respectively.

⊕ If we are talking of uint64_t or similar word sizes, all the implementations that I saw don’t try smaller word sizes and fallback immediately to bytecopy.

However, some implementations, like glibc’s memcpy, copy by page sizes (a few kilobytes), then copy by word and finally copy the last bytes by one byte at time.
And it makes sense, from copy a few kilobytes per operation to copy one byte at time, there is room to do a much efficient copy in between.

When dealing with input sizes not multiple of the word/block size, we need to copy by smaller sizes towards the end and copying byte by byte in the last steps.

Overlapping areas

To deal with overlapping areas we have the same fordward/backward algorithms than before but copying word by word so let’s call them wordcopy_fwd and wordcopy_bwd respectively.

But…

⊕ In most architectures reading/writing words are only possible from/to addresses aligned to the word size. So the pathological example should never happen.

However, as mentioned before, relax the term “word”. In xoz for example, the copy between parts of a files does not require any alignment so I faced this very overlapped case.

When the source and destination areas are very overlapped, it may be possible that the read of a word overlap with its write (left diagram), The solution is to copy into a local variable in between (right diagram).

Alignment

In general, copying words/blocks requires that both the source and the destination are aligned to the word/block size.

Eventually `src` and `dst` get aligned

If we are lucky, we could have src and dst addresses misaligned but by the same amount. This is the same to say that both addresses share the same \(N\) lower bits where \(N\) is the \(\log_2\left(word size\right)\) or

src & (-wordsize) == dst & (-wordsize)

If that is the case we can call bytecopy until both gets aligned and then go full speed with copying by words/blocks.

Only `dst` gets eventually aligned

As before we start with by calling bytecopy until dst is aligned but not necessary src is.

Consider the following:

Take the src address and compute:

⊕ In this example, src is at address 3 so with wordsize of 4 bytes sh1 is 3 and sh2 is 1.

int sh1 = src % wordsize;
int sh2 = wordsize - sh1;

sh1 and sh2 are how many bytes we should shift a word to make it aligned either towards lower addresses or higher addresses.

Now the trick: for writing 1 word into dst, we read 2 words from src, combine (or merge) them and write that into dst.

⊕ This is valid only for big endian architectures. For little endian, invert the shift directions.

Reference: glibc’s MERGE

The combine (also called merge) of two words takes the left word (lower addr) and shifts it by sh1 to the left; takes the right word (higher addr) and shifts it by sh2 to the right and finally bitwise or them and write the result into dst.

Note how the word-aligned read is right shifted 1 byte making room for the single byte from the previous word and in combination, is formed and written.

The remaining byte is not thrown away but used to build the next word to write.

The first loop iteration is slighly special because we cannot read a full word (we cannot read behind the initial src pointer).

For the rest of the loop iterations we don’t need any padding and work with entire words.

The shifts and or operations are not free so this memcpy implementation is slower than the one with the inputs aligned but still much faster that a bytecopy.

The key insight is that while we are reading 2 words per write, we can hold in a local variable a word to used in the next iteration. So we are not reading the whole src buffer twice.

⊕ Here are the glibc’s implementations.

Let’s call this algorithm wordcopy_dst_aligned and like before it has two flavours wordcopy_dst_aligned_fwd and wordcopy_dst_aligned_bwd for handling overlap.

My take aways

I didn’t know the combine / merge trick or the possibility to lazy copy entire pages.

Taking the seemingly simple memcpy and memmove functions and seeing their implementations shown me that even a plain copy can hide some interesting tricks.

What are your take aways?

Related tags: memory, memcpy, copy, bytes, xoz

Behind memcpy/memmove: overlap, words and alignment