Talks Tech #38: Scaling Federated Tech Stack: Managing Overly Diverse Technology

Talks Tech #38: Scaling Federated Tech Stack: Managing Overly Diverse Technology

Written by Lulu Cheng

Podcast

Women Who Code Talks Tech 38     |     SpotifyiTunesGoogleYouTubeText

Lulu Cheng, Software Engineer, Fintech, Data/AI/ML @ Block, shares her talk, “Scaling Federated Tech Stack: Managing Overly Diverse Technology.” She discusses the consequences of having too many or too few tools configured into your stack and how to find the balance for optimal performance.

Lulu Cheng, Software Engineer, Fintech, Data/AI/ML @ Block, shares her talk, “Scaling Federated Tech Stack: Managing Overly Diverse Technology.” She discusses the consequences of having too many or too few tools configured into your stack and how to find the balance for optimal performance.

Many people who work as software engineers have experience where they join a company and start with a group of people who are very good with Java. Then the company grows, and we add Python to this stack. The company grows again, adding Scala, Kotlin, Ruby, Go, et cetera. Now we have five, six, seven, or even eight languages in our tech stack. In a way, it’s reasonable and enables us to work with what we are most familiar with. Over time you’ll see languages start to drop.

While building, you’re not just making with the tool itself. It needs to integrate with other things. When building an application, it needs to have a testing framework. It needs to have a CI/CD pipeline. It needs to have a vulnerability scan. There might be internal credential management tools. You need a client library for that. There’s only one platform team. We don’t scale as technology gets added to the stack. We can only provide so much support. We are in a world where technology is constantly changing. We want to always add the newest and greatest technology to our tech stack. We also want to create good enough support. We want to ensure our developers are comfortable developing secure and safe software.

If you are working, you have probably gone through different migration projects. Not every tech stack change has to be a migration. I’ll talk much about data and machine learning and perhaps less about security or governance. It’s not to say this will not work for CI/CD pipelines, observability pipelines, or platforms. It’s because my background is very data and machine-learning intensive.

Platform engineering is a newer term. To my understanding, only in the past, I would say four or five years, I started hearing people referring to themselves as platform engineers. In a way, it’s not a new role. We have always needed people running apps or keeping the logging, monitoring, and data pipeline platforms up and running. That hasn’t changed. What’s changed is the expertise is required to run these platforms. Now, it’s easy to find a vendor providing hosted, say, hosted Kafka, Confluent has hosted Kafka, and Databricks has hosted Spark. I forgot what’s the company behind Cassandra, but there’s also a hosted Cassandra, hosted Accessor, and hosted Airflow.

A lot of things that DevOps used to need to do now they no longer need to do. We can focus more on integrating with our internal company’s business needs in platform engineering. For example, I work in FinTech. FinTech is a heavily regulated industry. A lot of times, our data is under a lot more scrutiny than some other industries. We need to make sure there are proper permissions processes. We need to make sure there’s encryption. We must ensure that their sharing is up to specific standards from platform to platform.

How do I get data from Kafka to Spark or from Kafka to Snowflake? As platform engineers, we figure it out for them and provide the best pathway. There are challenges in doing so. We are creating integrations between platforms. What if we want two streaming platforms that do slightly similar but slightly different things? How do we keep both evolving so that it can meet our security or governance needs? I think about these challenges as two extremes. On one extreme, we can add too many platform toolings to this stack. In contrast, we can end up with too little technology in the stack. The platform engineer can say, “No, we disapprove of any new tech stack because the functionality already exists. We only allow Kafka in our ecosystem. Everything else, you have to figure out how you can use what we have.”

There are two extremes. Both extremes aren’t good. As a platform team, we want to create that pathway to make adding tools to our existing tech stack easier or to consolidate without incurring a ton of work. How do we optimize our tech stack for a good enough development speed for the product and platform engineering teams? If we have two little tools, the product team might build their own, and it might slow them down. On the contrary, it’s tough to manage if we give them many tools over time. Eventually, it will become prolonged because the platform team only has so many people.

Engineers want to bring new tools to the stack for a reason. There’s new functionality, so we can save them time from rewriting that functionality. There might be a new security feature or a new governance feature. If you don’t bring new tools to your tech stack, you must reinvent the wheel repeatedly.

Security and governance are things where we need experts in the group to help us navigate what kind of encryption will give us enough confidence in terms of security and what type of vulnerability skin, frequency, or exposure. Security and governance are very important. They’re often overlooked in day-to-day life as a software engineer. Different tech stacks require different vulnerability skins, permission, and isolation models. When you have few tools, you can reach that consistency quickly because you only have so many. You must ensure you can apply similar standards to all these different tools.

When you have a lot of tools, it makes it hard. The more platforms you have, the more pathways you have to consider. When you have a ton of tools, it won’t be straightforward. Every device can potentially interact with each other. You have to think about those scenarios. Even though that’s not a day-to-day concern, it’s extremely important. When platform teams evaluate tech stack, we want to ensure it’s up to the standard. How do we have the right number of tools in our tech stack where it’s fast enough, maybe not the fastest, but also not the slowest? You want to be able to add or remove tools in a relatively easy way. I’ve worked on different platform engineering teams. I’ve experimented repeatedly and started to see that if we provide the right level of abstraction on our platforms, we could build with product teams or other external teams. We can do that so that it’s still somewhat consistent but allows them to experiment on the correct layer where they need to add different features.

Let’s take Kafka, for example. To run Kafka, you need some basic computing resources. That’s the lowest layer. That’s the infrastructure deployment. This lowest layer is for computing, provisioning, and maybe some basic networking. On top of that, you’d set it up so that your Kafka is running on a few EC2 Instances and within its private network, which is good enough. You need to allow your cluster to interact with other systems. You need to integrate with the company’s networking. There might be different security configurations,  maybe not at your base layer. You’re only saying, okay, only the machine role, the CI/CD role. They’re able to scale up or scale down the instances. Talking to different platforms, then you need additional IAM to specifically say this Kafka cluster can speak to X, Y, and Z under certain conditions.

On top of that, there’s another abstraction layer. Most product teams probably don’t worry as much about what instances run in Kafka. What networking and security features do they need to comply with? All they care about is can their application publish to this Kafka cluster. They don’t need to know everything about how Kafka works. They want to get their data from their application somewhere else. That’s another layer. Also, there’s automation and integration. Your data will not magically appear on S3 just because you published your data to Kafka. As a platform team, we often build all these integrations with different technologies. Let’s say you posted to Kafka and want it to appear in Snowflake. We need to develop some automation integration there. That’s another layer on top of it that most people probably won’t even notice.

It is important to abstract your platform architecture this way because often, when people want to bring in new features or consolidate some existing configuration, it can often be divided into one of these four layers. Let’s say the user wants to add another configuration option. You don’t need to create a new Kafka cluster. You need to modify the user configuration layer. Let’s say you want to bring a new tool to your existing ecosystem. You don’t need to redo the whole bottom three layers. You need to rethink how you will adapt to the automation integration with that new system. A good example is often. The feature request will not come from the platform engineering team. The product team often asks about it. They might sometimes need to touch the bottom two layers. Separating these layers from the upper layer allows them to contribute to your existing setup in a way where they still have to confine to the current rules. They’re not going to create a new network just because they want to change the deployment of the Kafka cluster slightly. They’re still going to use the existing networking rules, but they might be able to change the underlying parameter of your infrastructure layer. They no longer need to be networking experts. They no longer need to bring up a new stack. They have a way to contribute to your existing stack and modify it a little bit so they can enjoy that feature. By providing all these abstract layers, you allow the partnering teams to be able to contribute in a way they cannot deviate too much from your existing policy.