An issue that came up in the past was that we serialized a huge amount of information in an event. The event contained a structure that in itself had a very innocent-looking property called TimeZoneInfo
:
After releasing the software, we noticed that the project was taking up an unusually large amount of space. After checking out a couple of persisted events, we found out that each time we used the struct Cycle
, we persisted some 6200 lines of serialized json. Of which, the 6000 lines were attributed to the TimeZoneInfo
.This severely impacted event serialization and deserialization. The issue came up after we had done the following assignment
We decided that in order to lower the amount of data, we needed to migrate the event store while keeping up a live version of the old one, to avoid downtime.
In order to avoid having downtime, we decided to create a single deployable service (let's call it Migrator) that subscribed to the same events as the original application service. However, the Migrator would write the events directly in the new event store. Furthermore, the Migrator would be responsible for once it boots, to start copying data over from the old event store while applying the needed changes. In our case, we needed to modify all events that had the Cycle
in them, and replace the TimeZoneInfo
with just a TimeZoneId
which is a simple string.
We changed the structure of the Cycle to this:
Data migration is the process of moving data from one system to another. And there are many reasons why a system may require such a move. To name most common ones:
Natural system evolution which requires the data to be optimized for performance or maintainability.
Legal issues where some parts of the data have to be deleted or encrypted
Bad data created by a bug in the system
Business reason. When businesses merge or split.
It is important that the business value of the data is not changed during the process.
There are many different strategies when and how to do data migration. You must carefully plan and execute because damages could be significant.
Depending on the data volume the migration process could take hours, even days. During that time there are many things which could fail and corrupt the data in a irreversible way. To avoid such scenarios you should always migrate the data into a new storage repository.
Always migrate the data into a new storage repository.
Make sure the migration process does not overwhelm the live system. You should be in control when the data is being migrated so you could pause the migration during peek times of the live system. To achieve this, use a separate process to run data migration. Always keep in mind that migrating data takes from your system resources and you must account for that.
Use a separate process to run data migration.
When you are migrating a
Create a separate process which migrates the existing data into the new data repository
Live system must push any new data to the migration service. Could be easily achieved by sending it to a message broker.