vigoo's software development blog

desert part 1 - features

Posted on February 19, 2024

Introduction

This is the first part of a series of blog posts about my serialization library, desert. I also gave an overview of this library on Functional Scala 2022 - you can check the talk on YouTube if interested.

In this post I'm going to give an overview of the features this serialization library provides, and then going to dive into the details of how it supports evolving data types.

Where is it coming from?

The idea of creating desert came after some serious disappointment in our previously chosen serialization library. It was used for serialization of both persistent Akka actors and for the distributed actor messages, and it turned out that just by updating the Scala version from 2.12 to 2.13 completely broke our serialization format.

None of the alternatives looked good enough to me - I wanted something that is code first and fits well to our functional Scala style. Support for multiple platforms or programming languages were not a requirement.

So I started thinking about what would a perfect serialization library look like, at least for our use cases? It was something that has first-class support for ADTs, for Scala's collection libraries (I don't want to see Scala lists serialized via Java reflection ever again!), with a focus of supporting evolution of the serialized data types. We knew that our persisted data and actor messages will change over time, and we had to be able to survive these changes without any downtime.

Features

Let's just go through all the features provided by the library before we talk about how exactly it supports these kind of changes in the serialized data structures.

desert is a Scala library. As probably expected, it captures the core concept of binary serialization though a simple trait called BinaryCodec[T]:

trait BinarySerializer[T] {
  def serialize(value: T)(implicit context: SerializationContext): Unit
  def contramap[U](f: U => T): BinarySerializer[U] = // ...
  def contramapOrFail[U](f: U => Either[DesertFailure, T]): BinarySerializer[U] = // ...
}

trait BinaryDeserializer[T] {
  def deserialize()(implicit ctx: DeserializationContext): T
  def map[U](f: T => U): BinaryDeserializer[U] = // ...
  def mapOrFail[U](f: T => Either[DesertFailure, U]): BinaryDeserializer[U] = // ...
}

trait BinaryCodec[T] extends BinarySerializer[T] with BinaryDeserializer[T]

These BinaryCodec instances should be made implicitly available for each type we need to serialize. There are multiple ways to create an instance of a binary codec:

Under the hood there is a simple BinaryInput / BinaryOutput abstraction which is extensible, by default implemented for Java InputStream and OutputStream.

On the lowest level, in addition to having an interface for serializing primitive types we also have support for variable length integer encoding and for gzip compression. Custom codecs can also use the built-in string deduplication feature, and encode cyclic graphs using support for storing references.

Sometimes you want to serialize only a part of your data structure - a real-world example we had was having a set of typed actor messages where only a subset of the cases were designed to be used between different nodes. Some cases were only used locally, and in those we would store things that are not serializable at all - for example open websocket connection handles. This is supported by desert by having the concept of both transient fields and transient constructors.

What if a field is not an ADT but contains a reference to an arbitrary type with a given interface? Or if we don't know the root type of a message, only a set of possible types which are otherwise unrelated? The library provides a type registry for this purpose. Every type registered into this will have an associated identifier, and in places where we don't know the exact type, we can use these to get the codec by it's unique ID from the type registry.

On the top level desert also comes with a set of integration modules. The following modules are available at the time of writing:

There are two more modules which implement the same core functionality, codec derivation, with different tradeoffs:

I wrote a detailed post about typeclass derivation a few months ago.

Data evolution

Let's see in details what it means that desert supports evolving data structures.

Primitives vs newtype wrappers

Let's start with a simple example: we are serializing a single Int. The default codec just uses the fixed width 32-bit representation of the integer:

val x: Int = 100

results in:

0 0 0 100

Imagine that later we decide that Int is just too generic, and what we have here is in fact a Coordinate. We can define a a newtype wrapper like the following:

final case class Coordinate(value: Int) extends AnyVal

and then define the binary codec either by using map and contramap on the integer codec, or by using the deriveForWrapper macro:

object Coordinate {
  implicit val codec: BinaryCodec[Coordinate] = DeriveBinaryCodec.deriveForWrapper
}

The binary representation of a Coordinate will be exactly the same as for an Int, so we are still fully backward and forward compatible regarding our serialization format:

val x: Coordinate = Coordinate(100)

results in:

0 0 0 100

Collections

First let's see what happens if we try to serialize a pair of coordinates:

val xy = (Coordinate(1), Coordinate(2))

results in:

0 0 0 0 1 0 0 0 2

the binary representation starts with a 0, which is an ADT header. We will talk about it later. The rest of the data is just a flat representation of the two coordinates, taking in total 9 bytes.

Now we start storing arrays of these coordinates:

val coordinates: Array[(Coordinate, Coordinate)] = 
  Array(
    (Coordinate(1), Coordinate(2)),
    (Coordinate(3), Coordinate(4)),
    (Coordinate(5), Coordinate(6))
  )

Arrays are serialized simply by writing the length of the array as a variable-length integer and then serializing all elements.

6 0 0 0 0 1 0 0 0 2 0 0 0 0 3 0 0 0 4 0 0 0 0 5 0 0 0 6

The variable-length integer encoding of 3 is 6, and that is simply followed by the three 9-byte long serialized representation of the coordinate pairs.

What if we decide we don't want to use Array but ZIO's Chunk instead? Or if we realize our data model is more precise if we talk about a set of coordinate pairs? Nothing! Desert uses the same encoding for all collection types, allowing us to always choose the best data type without being worried about breaking the serialization format. In some collections, such as linked lists, there is no way to know the number of elements without iterating through the whole data set. Desert supports these collection types by writing -1 as the number of elements, and then prefixing each element with a single byte where 1 represents we have a next element and 0 that we don't. This is actually exactly the same binary format as a series of Option[T] values where the first and only None represents the end of the sequence.

Records

Maybe using tuples of coordinates was a good idea in the beginning but as our data model evolves we want to introduced a named record type instead:

final case class Point(x: Coordinate, y: Coordinate)

We can use desert's codec derivation feature to get a binary codec for this type:

object Point {
  implicit val schema: Schema[Point] = DeriveSchema.gen
  implicit val codec: BinaryCodec[Point] = DerivedBinaryCodec.derive
}

When using desert-zio-scheme we also need to derive a Schema instance - this is not required when using the desert-shapeless version of the codec derivation.

Let's see how desert serializes an instance of this Point type:

val pt = Point(Coordinate(1), Coordinate(2))

results in:

0 0 0 0 1 0 0 0 2

This is exactly the same as the tuple's binary representation was, which probably isn't a big surprise as they are structurally equivalent. Still this is an important property as it allows us to replace any tuple with an equivalent record type and keeping the binary format exactly the same!

If we have to change a record's type, we can only change any of its fields if that field's new type has a compatible binary representation with the old one. All the cases described in this post are valid data evolution steps. Beside those there are a few special type of changes desert supports for records. Let's see!s

Adding a field

As a next step let's imagine our data type requires a new field. Let's add a z coordinate to our point:

final case class Point(x: Coordinate, y: Coordinate, z: Coordinate)
object Point {
  implicit val codec: BinaryCodec[Point] = DerivedBinaryCodec.derive 
}

val pt = Point(Coordinate(1), Coordinate(2), Coordinate(3))

Serializing this pt value results in:

0 0 0 0 1 0 0 0 2 0 0 0 3

If we try to read this value with the deserializer of our original Point type, it will read Point(Coordinate(1), Coordinate(2)), but the next deserialized value will be corrupt as the input stream will point to the beginning of the 0, 0, 0, 3 value. Similarly, if we would try to read a binary serialized with the old Point serializer, it would read the next four bytes from the data stream which, if even exists, belongs to some other serialized element.

The solution for this in desert is to explicitly document data evolution. This is done by listing each modification in an attribute called evolutionSteps:

@evolutionSteps(FieldAdded[Coordinate]("z", Coordinate(0)))
final case class Point(x: Coordinate, y: Coordinate, z: Coordinate)
object Point {
  implicit val codec: BinaryCodec[Point] = DerivedBinaryCodec.derive 
}

val pt = Point(Coordinate(1), Coordinate(2), Coordinate(3))

With this annotation, we mark z as a newly added field, and provide a default value for it which will be used in cases when reading an old version of the serialized data which did not have this field yet. Every time we change the data type we record the change as a new element in this attribute. There are other supported evolution step types as we will see soon.

But first let's see what changes in the binary representation of Point now that we added this attribute!

1 16 8 0 0 0 1 0 0 0 2 0 0 0 3

Now that we have an evolution step the first byte, which was always 0 before, becomes 1. Every evolution step increases this value, which is interpreted as the type's version. For each ADT which has a version other than 0, this first version byte is followed by a list of the binary encoding of the evolution steps. Here the 16 is the variable-length encoding of the value 8, which is the length of the "version 0" part of the data type. This is followed by 8 which is just the variable-length encoding of the value 4, and it represents the field added evolution step, encoding the newly added field's size.

With this format when the old deserializer reads the point, it knows it needs to skip additional 4 bytes after reading the x and y coordinates. Also when the new deserializer encounters an old point, that binary data will begin with 0, so the deserializer is aware that it's an older version and can set the deserialized value's z coordinate to the provided default.

By documenting the data type change we get full forward and backward compatibility in this case. The cost is that instead of 13 bytes, now each Point takes 15 bytes.

Making a field optional

Another special data type change is making an existing field optional. Staying with the previous example we could change our Point type like this:

@evolutionSteps(
  FieldAdded[Coordinate]("z", Coordinate(0)),
  FieldMadeOptional("z")
)
final case class Point(x: Coordinate, y: Coordinate, z: Option[Coordinate])
object Point {
  implicit val codec: BinaryCodec[Point] = DerivedBinaryCodec.derive 
}

val pt = Point(Coordinate(1), Coordinate(2), None)

This of course can no longer guarantee full forward and backward compatibility - but it can be useful as an intermediate step in getting rid of some unused parts of the data model, while still being able to access it when it's available from older serialized data.

This evolution step is represented by a variable-length integer -1 in the ADT header. All positive values are representing the field added case, with the actual value containing the size of the added field. -1 is a special marker for field removed, and it is followed by another variable-length integer encoding the field position which has been made optional. Then serializing the Option field, the integer gets prefixed by a 1 if the value was Some, or the whole option is serialized as a 0 if it was None.

The total serialized record of the above example would look like this:

2 16 2 1 1 0 0 0 1 0 0 0 2 0

The first byte is now 2 as we have two evolution steps. The next one still defines that the original part of the data is 8 bytes long, the third byte shows that this time the new z field is taking only 1 byte (as it was set to None). The header is now containing two more bytes, as described above: the first 1 means a field has been made optional, and the second points to the field.

This can be still loaded by the very first point serializer (or even as the coordinate pair tuple), as everything after the first two coordinates would be skipped. It can also be loaded as a Point with non-optional z coordinate, but only if the serialized data is a Some. So in the above example it would lead to a deserialization error. The change is fully backward compatible so our latest deserializer can still load all the variants we have seen before.

Removing a field

The final special data evolution step supported by the library is removing a field completely. This is more limited than the previous ones though - backward compatibility is easy, newer versions of the deserializer just have to skip the removed fields which they can easily do. But forward compatibility is only possible if the removed field was an option field - that's the only type desert can automatically provide a default value, None for.

The binary header for removing a field needs to store the actual field name because it cannot otherwise identify the field which is not actually in the rest of the data set. To make this more space-efficient, desert uses string deduplication and only needs to serialize the actual field name once.

Sum types

Scala 2 sealed trait hierarchies and Scala 3 enums are simply serialized with the same techniques mentioned above, but with a constructor ID serialized as a prefix to the binary. Constructor identifiers are associated in order - as the constructors appear in the source code. This means that adding new constructors is backward and forward compatible, as long as they are added as the last constructor. Otherwise the identifiers will be rearranged and binary compatibility breaks.

Transients

It is possible to make a previously non-transient field transient and maintain binary compatibility. The rules are the same as for removing a field.

Type registry

As mentioned earlier, a type registry can be used to associate identifiers to types, and then serialize arbitrary values using these identifiers. Maintaining the stability of this mapping is also very important when evolving data types. What if we want to delete a type which was added to the type registry because we never want to use it again, and we already migrated our serialized data and we are sure we will never encounter that ID again during deserialization?

We still cannot just simply remove the entry from the type registry, because it will break all the following identifiers as they get assigned sequentially. The library has a solution for this - it is possible to registry empty placeholders where we previously had an actual type - it will maintain the identifier order, but will lead to a runtime error when that identifier is encountered during deserialization.

Summary

In this post I summarized the key features of the desert serialization library, and explained in detail how it supports changes into the data model while trying to keep maximal backward and forward compatibility.

In the next post I will show how the same library can be implemented for Rust, how the Scala solution maps into different concepts in the other language and what difficulties I've encountered during the migration process.