Talks Tech #32: Leading Technical Transformations at Projects
Guest: Anastasiya Meishutovich, Senior Software Developer / Technical Lead at Duco
Anastasiya Meishutovich, Senior Engineer and Technical Lead of Reconciliation at Duco, shares her talk, “Leading Technical Transformation at Projects While Releasing New Features Regularly.” She discusses how her team used traffic mirroring between their existing and new service, keeping their transition seamless to the user, and the importance of transparency between teams.
Duco is a fast-growing company. We help our clients manage data at scale and solve consolidation, standardization, and reconciliation problems. The increasing number of clients, volumes of data uses submitted daily, and complexity of requests raise the need to improve performance by making fundamental changes in the architecture of our SaaS platform. At some point, our users started noticing some performance degradation. For example, when they work with substantial reconciliation results. After some immediate performance enhancements, we realized that it was not enough. The growing volume of data, diversity, complexity of filter conditions, and the growing number of active users is pushing us to a completely different architecture. The main challenge here was the access and filtering of data at scale. Our clients wanted to work with the reconciliation results with almost zero latency.
This is how the idea of creating a custom cash layer on the base of the ClickHouse database was born. The primary source of truth was still MySQL, our main database. In ClickHouse, only when we need it, we copy data from MySQL from our primary source and build on top of that temporary reconciliation result view in click apps. ClickHouse view, by default, is optimized for heavy rates and diverse filtering conditions. The nature of the ClickHouse database, a column-based database, impressed us with the performance of read queries specifically, but also the simplicity to work and integrate. We found it similar to how we work with MySQL. We had multiple spikes and proof of concept demonstrating dramatic improvements and potential.
The engineering team decided to move processing of reconciliation result requests to the new service on top of ClickHouse. This was a really large technical transformation project in the core of our application for improving performance scalability and yield benefits from adopting new technology. Despite the scope and risks, it was so crucial. Our engineering team had to guarantee zero aggression and no disruption for users. Users should see the result of this project. They should see better performance and no functional changes.
We were keeping in mind that there are other teams and changing code. They're changing code actively, and they're releasing new features actively as well. We don't want to block them. We started with detailed technical documentation, and with the enormous scope, we split it into deliverable chunks to avoid a waterfall approach. The very first milestone represented the minimum valuable product (MVP). In our case, it was showing static reconciliation results for just one type of process.
At that stage, we had to build a data model for the ClickHouse database, synchronize MySQL to get data from one database to another, implement API controllers, and allow those APIs to serve as a front end. Every following delivered chunk of logic enabled a new function to the user. New process types support adding and editing data, filtering or sorting data, recovery from issues, cleanup, et cetera. It was a minimal scope for each milestone.
Apart from functional aspects, we define non-functional requirements. We agreed on testing. We include performance, available and resilient suits. All of them were separate and separate suits because they had separate targets. We discussed the rollout plan; we had to define metrics. We had to answer the question, how do we know something is working? How do you know that something is performing well? In the beginning, it was just success-failure ratio, read-write ratio, time of round trip of request, and cash rebuild time. Also, some ClickHouse internal metrics, except for planning and estimation purposes. We used the technical document as a reference for other teams, for example, team infrastructure. We needed a testing environment with the preloaded data and support for going live. We had to communicate with them and explain what we needed, when, and the project. We also had to share with other teams because their services and pipeline plans might be affected. All our decisions aimed to minimize risk and develop in a way that we are confident with our leases. We set a goal of getting feedback and metrics as early as possible. Small direct iterations integrating changes early and frequently became essential.
The very first thing was introducing a feature switch. It allowed us to merge code and unit integration tests into the main branch but hide it from users until we were ready. The code was not being used because feature switches were off. Even the QA and production environment weren't prepared to substitute all functionalities. It was less than ideal for us because no interaction with logic happens, even if it's just to create an environment. We get zero feedback, zero nature of validation, and zero bugs, and other teams also could break our code without even noticing it. It was dangerous to leave it like that, and it was all because it wasn't actively used.
The immediate solution was writing some end-to-end tests to prevent code from being broken by other teams. Then we integrated an end-to-end test to be run as a part of the pipeline. With the two modes, the feature switch is on, and the feature switch is off. We were testing two flows when regression testing and feature testing. The main challenge was improving the changes we deliver; a, don't bring regression in functionality; and b, performance is improved. Proving is the keyword here because the testing tools were used at that point. They had quite some disadvantages. For example, unit and integration tests cover only specific parts of the functionality, end-to-end tests, and performance gutting tests. They heavily depend on data characteristics and volumes in the database, the types of requests, and their frequency.
There were so many elements in the equation that you could never identify them correctly. They are imperfect because the scenario reflects developers' understanding of how clients use the platform. With a growing number of diverse clients, it's impossible to guarantee. No QA database could be as extensive as a production database. It's costly to support such a vast database because different clients have different configuration databases. Summarizing everything to prove we achieved the goal, we needed production data, production volumes, and real-life scenarios simultaneously.
Traffic mirroring solves this problem for us and helps us to make this project successful. This is a concept about the production environment. We deployed a new service, and we set up a ClickHouse database. We exposed new API endpoints alongside the existing service and its API. Nothing changes for users, they still interact with the main API, and it delegates requests to the main service. Every request to the main service is now mirrored to the new service API, which means the same user request is processed twice by the leading service and the new service. The user still gets a response from the main service, from the primary database. It's zero risk because nothing changes from the user's perspective. Our new service works with real data and real requests. We log metrics and check each response's correctness by comparing that both responses are the same. We are gathering statistics in the background.
We had to ensure traffic mirroring won't affect our system functionality. We had to prevent errors and slow responses. Traffic mirroring had its feature switch to separate some necessary code adjustments from the feature implementation. In traffic mirroring mode, there were no modifications to the main database. We didn't want any changes because of our testing. In case of any errors in traffic mirroring logic, we caught it, logged it, and prevented any further propagation. We decided not to implement any real-time comparison tool because we were cautious of the possible latency it might bring.
Instead, we locked the metrics of both requests independently, and we could compare responses and their metrics by finding the same request ID in the log message. Two responses should have the same request ID and then they will be comparable. We pass log messages with elastic search, so those logs were available in Kibana and Grafana. We could visualize and build dashboards and use those metrics for monitoring and comparing automatically and observing trends. We established an upper limit of request processing time for the new service and created an alert based on that in Grafana. Before enabling traffic mirroring on production, just for one process, we tested Grafana dashboards for monitoring and alerting on QA environments. We didn't enable traffic mirroring for all processes for all clients simultaneously. We were enabling one process per time. It's a very incremental change.
Not everything went smoothly. There were so many small moves and steps that required a lot of communication and approvals. We were touching the core logic of our application, of our system. We had to be transparent to other teams and keep their awareness of what was happening. We underestimated the organizational overhead, and it slowed down the process a little bit. We learned about planning ahead and now communicate even earlier than we used to. Sometimes we had to apply many adjustments in the code to prevent database modification or to be careful with the effect of traffic mirroring. We worried about performance, no pointer errors, et cetera. Because of the specifics of our code base, we couldn't mirror every single request because we have some specific filtering requests and exceptional types. It would be costly to support everything. We decided to skip those because we don't need the perfect tool to cover all possible requests.
Don't let perfection be the enemy of your team. While developing the next stage every time, we gathered statistics from the traffic mirroring feature. Then we would repeat the cycle. We build, we release, we monitor in a low-risk way, and then we make the next milestone, we release, and then we see how things are performing. Eventually, we guaranteed personalized statistics from the production environment. We were able to prove the readiness of the service to be enabled fully. We had this opportunity to have statistics for clients before enabling the feature entirely. We could allow traffic mirroring first, ensure that the client case is ready for custom cash, and then enable custom cash for them. It was like a very personalized proving tool to lower the risk. Many of our clients already benefit from using the new cash service and ClickHouse database. We plan to expand the functionality even further now that we have a reliable process to achieve that.