Events Patterns: Message Relay with Change Data Capture
In the previous article about the Outbox pattern I covered a problem of how to create events and store events safely. But the next step of the events exchange process is to transfer these events to other services. It is usually served by the message broker, but still there is a question how to pass events from the outbox table to the broker. That’s why an additional node, called Message Relay is needed.
Message Relay Pattern
So, we need to detect all the changes in one storage and copy them to another one. Of course this task is not new in software engineering, so the Message Relay is just a specific case of a more general concept called Change Data Capture (CDC). CDC is not a single pattern but rather a set of patterns and principles applied to this complicated task.
Change Data Capture is widely used in distributed systems world, because it can be applied to a big range of tasks, such as:
- Extract Transform Load (ETL) pipelines
- replication
- cache invalidation
- search index update
- third-party services integration
- notifications and alerting
The case with passing events to the message broker is a specific problem CDC may be applied for and it has its specific requirements, which make the solution more complex in some aspects and more simple in others. What are these specific features we need?
- “At least once” guarantee
We can’t lose any events, because it’s the main goal we started diving in this all. The mentioned before Outbox pattern is used exactly for this purpose and it would be pointless to lose events right on the next step.
2. Order is important
OrderCancelled event shouldn’t be pushed to the message broker before the OrderPaid event, otherwise it wouldn’t be handled by consumers properly. It will require additional efforts to handle such violations of ordering and sometimes it’s not even possible to do.
3. Immutable data
Fortunately events are immutable and can’t be changed or deleted which significantly simplifies CDC logic. Message relay should only detect insert operations and pass the same rows to the broker.
Types of CDC
Approaches to Change Data Captures are usually split by the way the change is detected and in this article I’ll discuss 3 possible options.
- Push-based
This way is based on the ability of the database to detect changes and perform some action after they are committed. Usually such actions are called triggers. Database trigger is a user-defined procedure automatically executed after certain events on a particular table, view or database. Triggers are supported by all the modern RDBMS’s (MySQL, Postgres, MS SQL Server, Oracle, etc.) . As for NoSQL databases, some of them, like MongoDB[2] or Amazon Dynamo DB[3], also support triggers or streams that allow the creation of the same behavior.
This approach has more drawbacks than perks.
Pros:
- as it may be seen from the picture, implementation is quite simple, at least no additional services are required
- probably it’s the most performant way, as changes are propagated right after they are committed
Cons:
- code in the database is usually not a good idea even if there is no business, but integration logic
- it may affect database performance
- database now should know about message broker existence, how to connect it, how to format events for it, etc.
- it should handle message broker outages and other errors
- if the system stores data in different databases, it should be implemented for each of them
2. Poll-based
Another approach is to have a separate service dedicated to event transfer from the database to the message broker. This service constantly polls an events table, looking for new records, pushes them into the broker and marks published events. This will probably require creation of additional columns in the events table to save information which events were already synchronized and which weren’t.
As the previous option (and probably everything in software engineering) this one has its own pluses and minuses.
Pros:
- doesn’t require any additional technologies support from the source database
- should be quite easy to implement
- database and message broker don’t know about each other
- since it’s a separate service, transfer logic may be changed without touching source and destination nodes
Cons:
- polling always means some delay, so it’s not so performant
- probably will require changes in the source table
3. Log-based
The third approach is based on the fact that most of the modern databases have their own log of operations. This log is intended for data consistency and may be replayed if the database was shut down incorrectly or on the replica node to keep it synchronized. Each database has its own term for this log: Write-Ahead Log for Postgres, Binary Log in MySQL, OpLog in Mongo, but the purpose is the same. As this log works for data replication, it also fits well for Change Data Capture, because the task is almost the same.
As always there are some pros and cons here.
Pros:
- near-real time performance
- no need to do additional requests to the database, since it works on the lower level
- no need to change table schema
- no dependency between the database and the message broker
Cons:
- log-tailing should be implemented for each database you have
- as it works on the lower level comparing to the poll-based approach, it’s more complicated to implement
Tools
Fortunately, it’s not necessary to always build a CDC-system from scratch. There are several solutions available, which you can choose and adapt for your needs.
Debezium is the most popular product on this market. It’s free, open-sourced, distributed and fast. Debezium is built on top of Apache Kafka and supports monitoring Postres, MySQL, MongoDB and SQL Server. Also there are preview versions of Oracle and Cassandra connectors.
Precisely is a company that provides a lot of tools and services for data integration. And one of its services is Connect, a tool that allows both batch and real-time data ingestion. The list of supported storage is big and includes not only all the popular relational databases, but also some legacy storages and even semi-structured data sources.
Meroxa is a data-orchestration platform with a built-in data-application framework Turbine, which promises you to take all the data streaming and integration responsibilities and allow developers to focus on their domain. It supports PostgreSQL, MySQL, MongoDB, Microsoft SQL Server, ElasticSearch and Azure Cosmos DB as sources and several options for destinations, like Snowflake, Redshift or another database.
At most once
If we try to imagine the simplest implementation of Message Relay, its work will include several steps:
- read new records from the events table
- push them to the message broker
- mark them as synchronized to ignore them next time
But nobody can guarantee that nothing will happen between the second and the third steps. The service may crash and after restart it will push the same events one more time. That’s why most of the tool and custom implementations provide only “at least once” guarantee. Even though it’s possible to implement “at most once” there also, but it will be non-trivial and may affect the performance, so if you want it, it’s better to leave it for the consumer (and this is a separate topic :) )
Conclusion
Which way to choose? It always should depend on your tasks requirements. If you want to have full control over the implementation, avoid vendor-lock and a small delay in event processing is not crucial, then probably a poll-based option is for you. The log-based option will give you better performance and reduce load on the database by avoiding repeating requests, but you will likely rely on some third-party solution. In any case having a reliable CDC-system in your architecture may be very important and help you building your distributed system more effectively.