Lessons Learnt on Refactoring Without Breaking
May 25, 2025
In xoz, every single piece of content data is associated with a descriptor. The content data may fit in the descriptor structure so it is placed somewhere else in the xoz
file.
This content is referenced by the descriptor by a object called Segment
that knows how to recover the descriptor’s content.
I eventually realized that a single segment would not be enough: if the descriptor would like to have multiple contents parts, the descriptor should own multiple segments as well.
“I should replace Segment segm;
by Segment segm[];
and that’s all - a simple refactor - how hard could be?”
Hell…
- the branch ended with 69 commits, squashed in a single commit
- it was 1 month of work
- 41 files were changed
- with 3014 insertions and 1479 deletions in total
It was hard, not because it was difficult the change but because it was really hard to not break 300 tests with a single change.
This post is a summary of a few good lessons learnt.
The price of breaking backward compatibility
Let me show exactly how was the situation before the refactor.
The descriptor is stored in disk as as follows:
struct descriptor_t {
uint16_t {
uint own_content : 1; // mask: 0x8000
uint lo_isize : 5; // mask: 0x7c00
uint has_id : 1; // mask: 0x0200
uint type : 9; // mask: 0x01ff
};
/* present if own_content == 1 */
uint16_t {
uint large : 1; // mask: 0x8000
uint lo_csize : 15; // mask: 0x7fff
};
/* present if own_content == 1 and large == 1 */
uint16_t hi_csize;
/* present if own_content == 1 */
struct segment_t segm;
};
It is obvious that if I wanted to change the single segm
by an array, I needed to store in disk how many elements segments the array has, how many “parts” the content has: a content_part_cnt
counter.
The problem is that xoz
has almost 500 tests and most of them check directly or indirectly how exactly a descriptor is written to disk.
Adding a new field will break a lot of them if not all.
Even if some tests don’t check what is written, they compute a checksum of the structure. Adding content_part_cnt
will make the checksum differ and again, more tests would be broken.
By that time xoz
was not released to the public so changing the file format in such non-backward compatible way was fine… except for me!
Lesson 1: minimize the changes
The own_content
signals if the descriptor has or not a segment. With multiple content parts (multiple segments), that bit would signal if there is at least one segment.
This means that one segment is always implicit if own_content
is set. A content_part_cnt
does not have to count how many parts are but how many are plus the one implicit. A content_part_cnt
valued to zero would mean that there is one content part.
So what?
xoz
uses the same checksum that the IP protocol uses: it is not very robust but it can be computed easily and in particular, adding 2-bytes zeros does not change the checksum.
So for all my tests that have a single segment, content_part_cnt
will be zero and for such tests, the checksum will not change.
I cannot avoid fixing the tests that check exactly what is written to disk but I can prevent breaking the ones that check the checksum.
This is how the structure looks like now:
struct descriptor_t {
uint16_t {
uint own_content : 1; // mask: 0x8000
uint lo_isize : 5; // mask: 0x7c00
uint has_id : 1; // mask: 0x0200
uint type : 9; // mask: 0x01ff
};
/* present if own_content == 1 */
uint16_t content_part_cnt;
/* present if own_content == 1 */
struct content_part_t cdata_entries[1 + content_part_cnt];
};
Another detail: the large
, lo_csize
and hi_csize
fields define the size of the content part so now they are bundled together with its segment segm
. The content_part_t
has the same layout as before to not break more tests.
struct content_part_t {
uint16_t {
uint large : 1; // mask: 0x8000
uint lo_csize : 15; // mask: 0x7fff
};
/* present if large == 1 */
uint16_t hi_csize;
struct segment_t segm;
};
Lesson 2: if you break too much, stop and step back
Once I resolved the disk format it was time to do the refactor of the structures in memory and run the tests.
But when I tried to do a much larger change in the code base, hell, too many broken tests! Much more that I anticipated.
I faced this situation several times and the lesson that I learnt was “if you break too much, stop and step back”.
No, it was not a bug, it was a symptom that the refactor was too disruptive to be done in a single changeset.
Starting to do multiple sub-fixes + sub-refactors, all at once, while having the tests broken was a no-go. Doing it so it’s like going for a driving, blind.
Step back and try something smaller…, even if you are going to throw it away later.
Lesson 3: do incomplete throw-away refactors, keep all tests passing
Step back and try something smaller…, small incomplete refactors even if you are going to throw it away later.
For example, I only modified the read-from-disk and write-to-disk functions and maintain all the rest of the code from using the first segment loaded.
That allowed me to test the read/write of the new content_part_cnt
and fix the tests being 99% sure that if a test was broken it was because I added 2 zero bytes (content_part_cnt
) and nothing else.
I was sure that part of the code that I did it was temporal and not part of the real refactor but doing these intermedian changes allowed me to keep the tests passing all the time (after a reasonable amount of fixing).
But I eventually I got stuck.
Lesson 4: if decoupling is hard, isolate first (with spies)
In xoz
, the Descriptor
objects are managed by DescriptorSet
sets.
These sets are an unorder collections of descriptors, not just in memory but also in disk. There is an intimate relationship between those two in order to operate efficiently. In fact, DescriptorSet
is a friend
class of Descriptor
.
And this is a problem.
Changing the internals of Descriptor
made the logic in DescriptorSet
to change in unexpected ways.
But why? Because of life.
And because DescriptorSet
was developed much earlier. Back then the Descriptor
class didn’t offer any public API to manage the content segment but also because using the internals, DescriptorSet
could do things hard to get done otherwise.
Time passed, and Descriptor
acquired a simplified, high-level API for managing its content. Great for any subclass of it but too limited for DescriptorSet
.
And to make it worse, DescriptorSet
used its own self-managed content segment (to store the descriptors in the set) bypassing the public interface of Descriptor
to interact with the content.
The following is a oversimplification of the code:
class Descriptor {
private:
friend class DescriptorSet;
// old, single-part content logic (but still the one working by now)
uint32_t csize;
Segment segm;
// new, multi-part content logic (loaded from disk, but not in use)
std::vector<content_part_t> cparts;
};
class DescriptorSet : public Descriptor {
private:
uint32_t own_csize;
Segment own_managed_segm;
void update() {
// attention! DescriptorSet is overriding Descriptor's (old) internal state!
this->segm = own_managed_segm;
this->csize = own_csize;
}
};
Before trying to do any change, I needed to isolate what DescriptorSet
knew of Descriptor
. You may think, “ok, just add a bunch of public set()
and get()
methods to Descriptor
and make DescriptorSet
to use them instead of accessing the internals.
But no, adding just set()
and get()
would lead to a bad design. And worse, those public methods would be callable from anybody, not just DescriptorSet
.
My approach was to use spies: classes that are friend
of Descriptor
and implement a small subset of the functionality needed by DescriptorSet
making DescriptorSet
a friend not longer.
class Descriptor {
private:
friend class DSpy; // DescriptorSet is not longer a friend
// old, single-part content logic (but still the one working by now)
uint32_t csize;
Segment segm;
// new, multi-part content logic (loaded from disk, but not in use)
std::vector<content_part_t> cparts;
};
class DSpy {
private:
Descriptor& dsc;
public:
void update_segm(const Segment& segm, const uint32_t csize) const {
// internals access, to the old logic
dsc.segm = segm;
dsc.csize = csize;
}
};
class DescriptorSet : public Descriptor {
private:
uint32_t own_csize;
Segment own_managed_segm;
void update() {
// Not longer accessing anything private
DSpy(*this).update_segm(own_managed_segm, own_csize);
}
};
It may not seem like much, I’m still exposing the internals but:
- I have mapped exactly which internals I’m leaking: they are in the
DSpy
methods, a much smaller place too look. For comparison,DSpy
has 3 methods of 1 line each andDescriptorSet
has ~40 methods with a total of more than 1000 lines of code. - The use of spies allows me to be selective to which internals I want to leak: I have a small
DSpy
class forDescriptorSet
and another, largerDSpyForTesting
for testing but nobody is a direct friend ofDescriptor
so nobody has access to the entire internals.
Lesson 5: chop-chop, one small bite at time
With the spies in place, I took each of their methods, one by one, and started to figure out how to remove it, doing the necessary changes in DescriptorSet
and fixing the tests, if required.
I started with a DSpy
with ~8 non-trivial methods and today I reduced it to just 2, one-liners. Each method removed, each simplification lead to a smaller, much easier to understand, interaction between Descriptor
and DescriptorSet
.
Only then I took the leap to remplace any use of the old single-part implementation by the new one. The real refactor.
The result?
- the branch ended with 69 commits, squashed in a single commit
- it was 1 month of work
- 41 files were changed
- with 3014 insertions and 1479 deletions in total
Summary
Try to break as little as possible.
If you break too much, stop and step back
Incomplete throw-away code is ok, keep your tests passing, don’t get blind.
Decouple, and if it is hard, isolate first.
Split your refactor into the smallest chunks possible and go one by one.