The Evolution of DevOps and the Birth of MLOps

September 6, 2022

Episode

MLOps Weekly Podcast

Chief Strategy Officer, DataStax

Listen on Spotify

‍

***Transcript:
[00:00:03.000] - Simba Khadder

Hey everyone, I'm Simba Khadder, and you're listening to the MLOps weekly podcast. This week, I'm so excited to be speaking with Sam Ramji. He is the Chief Strategy Officer at DataStax, DataStax being the leading provider for commercial solutions of Apache Cassandra. He's a 25-year veteran of the Silicon Valley and Seattle tech scenes.

[00:00:24.080] - Simba Khadder

He led Kubernetes and DevOps, and product management for Google Cloud. He founded the Cloud Foundry Foundation, and he helped build two multibillion-dollar markets: API management at Apigee, and an enterprise service bus at BEA Systems. He redefined Microsoft's open-source strategy from extinguish to embrace. He's an advisor to multiple companies, including Dell, Accenture, Observable, Fletch, Orbit, OSS Capital, the Linux Foundation, and ourselves here at Featureform.

[00:00:51.710] - Simba Khadder

He received his BS in Cognitive Science from UC San Diego and is still excited about AI and neuroscience and cognitive psychology. He's at the intersection of AI databases and more at his current role at DataStax, and I'm really excited to be able to speak to him today. Sam, it's so good to have you on the podcast today.

[00:01:09.820] - Sam Ramji

Simba, it's great to see you, man.

[00:01:11.260] - Simba Khadder

I would love to start... I gave a quick overview of what you've done, but in your own words, what brought you to DataStax? What made you want to work on this problem?

[00:01:20.670] - Sam Ramji

There's a couple of key things, and it was my experiences at Google and then at Autodesk. At Google, I was recruited to be Vice President of Product Management for GCP because of the expansion of Kubernetes.

[00:01:34.220] - Sam Ramji

They obviously needed a product management executive to look after a bunch of the businesses. But those businesses were basically doing fine, like the Compute Engine business; it was virtual machines. There was an amazing business in Firebase that had been acquired, and it was growing gangbusters.

[00:01:48.180] - Sam Ramji

But the exciting challenge was Kubernetes, and that got me into DevOps and DevOps infrastructure, which was awesome. I got to learn how my partner-in-crime, my engineering partner, Melody Meckfessel, who'd been at Google for about 14 years, had built a DevOps infrastructure that kept 44,000 engineers at Google super productive.

[00:02:08.950] - Sam Ramji

But as we tried to figure out, "How do we take this to the world?", we were like, "We're going to have to do this through Kubernetes." So Kubernetes is the first big factor. Kubernetes, of course, is designed for stateless workloads. It's all about moving containers, immutable infrastructure, moving those things back and forth into and out of production, and being able to scale up all these Docker-based workloads.

[00:02:30.090] - Sam Ramji

Then I went to Autodesk, where they needed a cloud platform for the world's largest engineering software company, because it was no longer good enough to have desktop software doing all of your industrial automation.

[00:02:41.690] - Sam Ramji

You needed to connect each desktop software user to another; you needed to be able to ask bigger questions about, "Hey, is my motorcycle project on time?", knowing that you've got a dozen mechanical engineers working on it. How do you get that to tell you truth? That's all about the data infrastructure.

[00:02:56.610] - Sam Ramji

We brought Kubernetes to Autodesk, and that was awesome. Autodesk had already built out this big infrastructure for data built on Apache Cassandra. So I had already seen Apache Cassandra many years before the security infrastructure for tokenization, token management at scale for Apigee, which I had the privilege of being Chief Strategy Officer. We took it public in 2015. So I had some experience with Cassandra, but it was way out of date.

[00:03:21.990] - Sam Ramji

Seeing what the Autodesk folks were doing with Cassandra was mind-boggling because we had nearly a thousand nodes, I think, when I left; five petabytes under management of the Cassandra ring. It was all of this live dynamic information connecting all these different applications and people predominately in construction, but also in mechanical engineering with supporting Autodesk Fusion.

[00:03:41.020] - Sam Ramji

It was also way too hard. We had a dozen FTEs in the data team looking after Apache Cassandra, and some of them were making sure it operated properly: instrumenting it with Amazon Web Services; making sure that developers could use it; building SDKs, libraries, reliability kits, all these different things to use it properly.

[00:04:00.550] - Sam Ramji

But the problem with doing that yourself means that you're not using those people to solve other harder problems. So I just felt Cassandra was so excellent, but it should be easier to use. Then, I got an opportunity to go work with a CEO who I'd worked with before at Apigee. He came to DataStax, and we started talking and I said, "Cassandra's awesome, and Cassandra plus Kubernetes could be really amazing, because what if you can start to solve stateless and stateful workloads on the same core infrastructure?" It can all be really smart. It can all scale really well, and you can do this lights-out operation.

[00:04:31.530] - Sam Ramji

So if you're a microservices developer or whatever you happen to be doing, you can trust that the data is going to be there, it's going to be really fast, and it's going to scale properly. That's what took me to DataStax.

[00:04:41.190] - Simba Khadder

That's awesome, and I love the... You have this mission or initiative. I'm sure, over time, that's changed, or it's expanded, or different things have taken priority. How are you thinking about MLOps in particular?

[00:04:55.500] - Sam Ramji

MLOps is a fascinating field. There's so much value being created in machine learning. There's lots of ways to measure it. Obviously, there's lots of investment. There's lots of startups being created. There's lots of people getting jobs in the field, and you're seeing old-line companies, train companies, home and do-it-yourself companies installing a Chief ML Officer or a Chief AI Officer. That's when you know something is really coming into the world.

[00:05:20.200] - Sam Ramji

MLOps itself, I think about in three ways. One is that it's really, really early. When I look at the practices of MLOps, and obviously we're borrowing the term from DevOps, MLOps sounds like it's maybe as mature as DevOps, but it's nowhere near. If you can imagine DevOps without Git, you would go, "Well, what is that?"

[00:05:39.770] - Sam Ramji

We didn't have DevOps when we were all on CVS and Subversion because there was a linearity to the flow of how you used software that you didn't have the ability to rewind your whole infrastructure. You didn't think, "Oh, this Git-based, merge-pull structure enables me to test out things really quickly in production, take it forward, and if it blows up, rewind it." That rewinding capability is core to DevOps; that's what makes it safe to go fast.

[00:06:06.510] - Sam Ramji

MLOps lacks infrastructural elements that would make it easier, like versioning of the data or a rewind capability. And there's so many different ways people do MLOps. It's really diverse; it's all over the map. So its one thing is that it's early.

[00:06:20.590] - Sam Ramji

The second thing is that MLOps is incredibly valuable because it does create an environment where you can predictably make better decision engines faster. Every business is basically a decision factory.

[00:06:33.480] - Sam Ramji

One hundred years ago, people thought of businesses as production factories: You make a part, you move the part along, you sell a product. But the vast majority of companies today employ most of their people as decision workers, so they're a decision factory.

[00:06:48.720] - Sam Ramji

The really powerful stuff is businesses that have been built like TikTok or Uber or John Deere, which are automating decisions through the models that are being emitted. The quicker that you can update or improve the model... So let's take a look at Spotify's MLOps system, as they come up with the new feature that has maybe a 1% or 2% better conversion or affinity than the prior model that they had.

[00:07:12.260] - Sam Ramji

They want to get that into production as fast as possible, because if they can create a better conversion, better utilization, their whole business gets better because they have hundreds of millions of users. That's the second piece; it's incredibly valuable because of the scale at which most businesses are operating.

[00:07:26.750] - Sam Ramji

Then, the third piece is that it's disconnected currently from the full lifecycle that we see that's needed to make companies really effective. We've got DevOps, and that's running happily at speed with a lot of maturity for microservices, back-ending all these apps that are getting powered.

[00:07:46.450] - Sam Ramji

Then you've got data engineering, which is pulling data unwillingly out of microservices. You basically have to go and poke a hole, pull out the data, ask the microservices team for help, or just take it right out of the infrastructure. So data engineering is a really hard job because a lot of the feeds break.

[00:08:03.430] - Sam Ramji

But once you've got that from data engineering, you're taking it in, ideally, to MLOps and giving it to data scientists in a way that can be really useful. But you're not done yet, because when the model's in production, you need what often is called ModelOps, because models break in production in ways that software doesn't because the two paradigms are so different.

[00:08:24.220] - Sam Ramji

But the final piece is that we haven't really linked ModelOps all the way back to DevOps. So if you think about this at an enterprise level or as an organization level, what's the latency between your ability to create a new insight that gets embodied in an automated decision model and get that into the user experience? In most organizations, that's really long.

[00:08:45.500] - Sam Ramji

So the apotheosis of MLOps will be—let's say, five, six, seven years from now—it all gets really coherent as it's no longer early, as the tools support rewinding. We'll end up checking out our total cycle time across the entire company, from new model to user experience, and we'll try to get down to, like, a day. Currently, that's probably 3-4 months on average. So that's where we're going with the MLOps world.

[00:09:13.800] - Simba Khadder

I had never really thought of the Git-versus-Subversion. I never really thought of the fact that the way Git is designed as part of what allowed... I'm sure there's a lot of factors, like the cloud became a thing. There's a lot of things that came together around the same time, but it is interesting that it's almost was like you needed the right abstractions to be able to build on.

[00:09:33.350] - Simba Khadder

So a lot of DevOps was an abstraction problem. What is the right abstraction? How do we think about code changes? If we think of them as commits, and we think of it as this tree-like thing, like it's merge-in, and we have this whatever, then once we have that, we can build this idea of each commit can have a build associated with it.

[00:09:55.200] - Sam Ramji

Yeah, atomicity.

[00:09:56.240] - Simba Khadder

Yeah, exactly, and I think it's interesting with MLOps... I think of DevOps. You mentioned this; at Google, you can look at Borg, and you can look at all these things that they've been doing forever, and you think, "Wow, even before DevOps was a term, Google knew how to do DevOps." Google was great at it.

[00:10:13.880] - Simba Khadder

In my experience, and I'm curious if you have seen otherwise, but I have yet to meet a company where I saw that they're doing MLOps and I was like, "Wow, this is what everyone should be doing; it's perfect." Whereas with DevOps, I feel like there were a few companies that really had it nailed down, and they were just sharing their learnings to the world.

[00:10:29.590] - Simba Khadder

I think in MLOps, part of why it's been so chaotic a little bit in the early days is that there's so many different companies have had such different approaches, starting from different places, and very little information sharing. So we just have completely different views of the world; there isn't really like, "Hey, this is a gold standard," and, "What's the gold standard, but generic?" There is no gold standard, so we're all just figuring it out as we go along.

[00:10:52.400] - Simba Khadder

Do you see that, too?

[00:10:53.560] - Sam Ramji

I do, and I think it's a much larger and much more complicated space, and I think it'll probably take us a lot longer to sort out. Here's what I mean: DevOps is really about computation, and computation is big, but it's bounded. MLOps is about cognition, and cognition is actually kind of unbounded.

[00:11:11.570] - Sam Ramji

There's no limit to the number of new ideas you can have, new ways that you can analyze data, new ways that you can get the value, new things that you could choose to predict. Think about, "What is a feature?" A complex feature is part of a multidimensional space that you can see patterns in through the power of whatever you're using: whatever Bayesian model, or something that's a generative adversarial network or a DNN or who knows what.

[00:11:37.390] - Sam Ramji

You can suddenly get some kind of predictive power off of previously uninspected or uncorrelated dimensions. You can keep doing that for a long time. So the space of what we're doing with ML is cognition, and it's much bigger than computation.

[00:11:52.380] - Sam Ramji

I'll make one path dependency point that you alluded to about DevOps. Yes, Git, absolutely essential, because it changes the style of play of the development team. Subversion and CVS: They made branching easy. Very different problem to say, "Can I branch?"

[00:12:09.360] - Sam Ramji

Git was all about making merging possible. Merging was really hard, so this ability to merge quickly means that you can have a lot more change, you can have a lot more different people contributing, and then you can bring it in.

[00:12:20.310] - Sam Ramji

You mentioned Borg. In about 2007, 2008, Google contributed the core Linux capabilities that led to containers. Then you've got Solomon Hykes getting excited about that, coming over from France in 2009 and creating what would eventually become Docker.

[00:12:36.200] - Sam Ramji

Then you've got Kubernetes picking up the ability to go and take Docker into deployment. And all along the way, you've also got technologies like Jenkins. So between your ability to contribute code through a Git model, your ability to package the artifact into a container, your ability to have a deployment pipeline through Jenkins, and then using something like Spinnaker or Kubernetes to be able to get those into production environments—there's a long path.

[00:13:00.890] - Sam Ramji

It took a lot of people a lot of time to practice a lot of things in a lot of messy ways, to make what we now have today as something pretty clean that you can almost buy off the shelf, as it were. You can kick up a cloud service; any reasonable cloud service is going to give you a CI/CD environment that won't be terrible, and the practices of the DevOps community are well-documented. There's lots of YouTube videos; there's plenty of documentation. You can get into it really quick.

[00:13:24.910] - Sam Ramji

So I think the path ahead for MLOps is going to be really exciting. I think there's going to be a lot of activity and a lot of money in it, and there's going to be a lot of change over the next few years as this community of data scientists and data engineers and software engineers all come at business-critical problems and try to figure it out really fast. So I'm pretty stoked to be an observer on the way.

[00:13:44.620] - Simba Khadder

What do you see as... The DevOps-versus-MLOps analogy comes up a lot. We've heard it a lot. I think a lot of people remembered early DevOps stages, but you were a key player in it; you were a lot of Kubernetes. So I guess my question is... I have a lot of places I want to go, but firstly, what do you see as key similarities in the goal of DevOps and MLOps? How are they fundamentally similar?

[00:14:09.860] - Sam Ramji

Fundamentally, it's about cycle time reduction and predictability. Both of these things come out of the whole school of the Toyota Production System developed post-World War II, W. Edwards Deming, and that whole concept that became called lean manufacturing. And when you start taking lean manufacturing ideas, and you airlift them into software engineering, you end up with DevOps. And as you take those same ideas, and you bring them into machine learning engineering, you'll end up with MLOps.

[00:14:38.830] - Sam Ramji

So I think that's a core similarity. You're trying to eliminate waste, and we measure waste by saying anything that doesn't actually produce value is waste: Muda, in the TPS terminology. Then, you try to bring that down over time. So you've got concepts like Kaizen and Kaikaku. Kaizen is your daily improvement; you can get 1% improvement per day, you get 1000% in a year.

[00:14:59.250] - Sam Ramji

Kaikaku is your breaking improvements, where you suddenly have an insight. You say, "We need to change the architecture; it's going to be really expensive, but we could get 1000% improvement in one move."

[00:15:07.730] - Sam Ramji

Then you've got the core, which is takt, T-A-K-T: the cycle time. You're always measuring your end-to-end cycle time and trying to bring that down, as long as you can do it with that same very low error rate. That's the core of anything that has "Ops" in it. We should be explicitly saying, "We're bringing the spirit of lean manufacturing into the work we're doing."

[00:15:29.960] - Simba Khadder

I really like that. One thing that came up in a recent chat with someone from LinkedIn was this difference between... He was saying "MLOps" and "ML infrastructure," and he was treating them as different concepts. I don't know if he was even... I think, just in his head, that's how he did it, but that was unique.

[00:15:45.680] - Simba Khadder

I think the same way about it because, like you said, Ops is an iteration problem. Infrastructure can help; you might need the infrastructure to allow you to do such things. The Toyota example: It's like, "we need specific machines to be able to build a car," and if those machines get way better, then you can do it better. But that's not really the Ops problem.

[00:16:07.370] - Simba Khadder

The Ops problem is more, "How do we get all of us to work together? How do we structure this organization and workflow so that we can be as productive as possible?"

[00:16:16.940] - Sam Ramji

Absolutely. It's all about the workflow, and it's about the team being in harmony. One of the beautiful things about the Toyota Production System is that every single worker had a cord that they could pull that would stop the entire line, so any cyclical defect would be analyzed immediately.

[00:16:33.020] - Sam Ramji

The worker's job was to say, "I got a piece that wasn't conformed to standards. Let me stop, because that is a process problem." So the entire factory floor would converge on the station, and everybody would work it out. It could take two seconds, could take two minutes, could take two hours; but they would stop and get that defect out, and then they'd go back to work.

[00:16:52.530] - Sam Ramji

This is that Kaizen process of, "You blame the process, not the people," and that's the core of it. We're all trying to make all of this go faster together. None of us is more valuable than each other. We're all trying to go far together.

[00:17:05.410] - Simba Khadder

I love that. The way you put it, I feel like a lot of that ethos that you're sharing feels like it has been kind of lost when people think of casual. Software is different, obviously, but I like the way you put it because it sounds like a team I'd want to be on; that sounds great. I feel like making things process-oriented, not making them... Every individual has equal power, but it's not about the individuals.

[00:17:33.550] - Simba Khadder

It's not like, "We need the million smartest people." It's more like, "We need people who want to be on this team, who want to work together, who want to trust each other and be able to make the best possible thing via the process."

[00:17:46.820] - Simba Khadder

One thing we say here a lot, or I say to my team a lot, is that some things—let's say, finding product-market fit—you can set as a goal, but it's not something where you're like, "Are we closer or less close?"

[00:17:59.730] - Simba Khadder

It's a binary thing, to an extent, and one thing that I've come to the conclusion of the last company is that the only thing you can really control is your habits. You just have to set up the right habits that will optimize for getting better. So we just have to trust the process. We can come and change it, but as long as we execute and we come up with a process that we think will get us there, that's the best we can do.

[00:18:23.770] - Sam Ramji

I love it. I think that process orientation is how we've created the civilization we have today. Science is about making things progressively less wrong. We have an idea; it's probably wrong. How can we make it progressively less wrong? I think it's best stated by a quote often attributed to Thomas Edison. I don't know if it's apocryphal or if he actually said it. He said, "I haven't failed; I've simply found 10,000 ways that don't work."

[00:18:50.640] - Simba Khadder

That's awesome. I found a lot of those, as any reasonable startup, I'm sure, has. Let's bring it back to the DevOps things. We talked about how it's about process. It's about giving the workflow to allow it to be, and I think that's the first axiom of what I want to build to of the DevOps-versus-MLOps. The other thing is, why is it different? Why isn't MLOps just a feature of DevOps? Why does it have to be its own category?

[00:19:15.940] - Sam Ramji

This is where I want to take it back to that distinction between computation and cognition. If you think about some of the work that was done even in the 80s to figure out how do we write better software.

[00:19:26.100] - Sam Ramji

There was this idea that you could have provably correct code and people came up with techniques like a Pi calculus to be able to do meta analysis of the code and determine are there any loops in it? Is the code going to kill itself? Is it probably right?

[00:19:40.060] - Sam Ramji

That is possible only in something as relatively bounded and tractable as computation. However, think about where we are today as a society. How many bad cognition examples can you think of in the first 10 seconds as I say it?

[00:19:54.950] - Sam Ramji

Just think about what you saw in your social media feed, anything you read in the news. There's a whole bunch of bad cognition. Now, these are not a human beings who are really any different from you or me structurally.

[00:20:06.700] - Sam Ramji

If you were to analyze our genome, if you were to analyze the weights of our neural synapses of how we parse visual information, you would not be able to find any different basis to assess the this person's got really strong cognitive outcomes. This other person's got really flawed cognitive outcomes.

[00:20:24.750] - Sam Ramji

So much of it is arising from data. What are we consuming? How do we determine that the data is correct. What happens when a feed of data that we trusted suddenly goes bad? These are really freakishly hard problems that we're really not conditioned to even have a good philosophical basis for managing.

[00:20:41.960] - Sam Ramji

So there's so much almost basic research we have to do to say, what does it mean to have an MLOps defect? Was the model wrong? Maybe the model was totally great. But the data skewed. The data got inbound, that you got a bad feed in the data and now you're trusting the model in production to feed your dashboard or to feed your content or people are seeing it directly.

[00:21:01.310] - Sam Ramji

So you don't have a production down problem, like when you have a computation flaw and something fails and the system collapses, you have something worse. You have a silent failure where you're now putting garbage on the screen that your whole system is extremely confident is correct.

[00:21:16.940] - Sam Ramji

This, I think, is the piece that really freaks me out about MLOps and I see people working on data observability and data monitoring, and there are all these tools that come into it. But the fundamental nature of it is it's not computation, it's cognition.

[00:21:29.300] - Sam Ramji

The way that we think about data requires us to all be a lot more skilled around math, projects like great expectations are great examples of how we can start to embrace a different part of math than we've had to when we were writing traditional computational software programs. When we're dealing with systems like models that are coming out of inferential logic, that are learning the patterns of the data, how do we bound the expectations on the inputs?

[00:21:54.290] - Sam Ramji

Those are all new competencies that we have to develop together. The fact that it's new tells me that's different about MLOps.

[00:22:02.170] - Simba Khadder

I love the way you put. I think it just clicked to me that difference. The idea of it's coming from the data, a philosophical view of it would be... I don't even know. The story goes, let's say someone's in front of a judge. It's like, "Hey, I couldn't do anything about it." If everything is predetermined to an extent then, should I go to jail? Is it my fault? Or was it all the data, all the things that happened to me that forced this?

[00:22:29.830] - Simba Khadder

There's nothing I could do. I was just sole victim of circumstance of the universe. Obviously it's very different. But for a model, it's similar. It's like, this did exactly what... Given how it was wired and the data that went into it, the order went into it, and all these things, it's almost deterministic, but it's deterministic in a way that's almost impossible to understand.

[00:22:49.750] - Simba Khadder

You can't look at the way to the vault, I'll just add point one to that way and... Boom, we're done. That's that key difference. It also makes regulation. There were things like hard. It sounds easy and nice, but how do you... The fact that you can't do it means the whole, what is correct in the model?

[00:23:06.570] - Simba Khadder

This thing is a perfect model. We dealt with this... I dealt with this in my last company we're doing recommender systems. Part of what drew me to recommender systems specifically was with computer vision, there's a little bit more of a right answer. You can look at a picture that that's a bird, but not bird.

[00:23:21.400] - Simba Khadder

But with recommender systems, there's no such thing as the perfect recommender system or the perfect recommendation. We would do things like, we would always make sure there's at least two models of production running A/B test because if you... You can almost pigeonhole people if you just have one.

[00:23:39.590] - Simba Khadder

You have to maintain... The thing you're really aiming for, recommended system is serendipity and serendipity isn't something you can really measure. I think that's like you said, with DevOps and with computing and whatever.

[00:23:51.620] - Simba Khadder

Most of the time you can say, hey, you can write a unit test. You can't unit test a model. That's maybe the cleanest way to imagine it as someone listening to this is like, think of how you would unit test your recommender system. You're just going to be well, let's make sure that it includes this recommendation, which is obvious.

[00:24:07.570] - Simba Khadder

If it doesn't, then we should look at it. That's the best you can do in some cases. I love that way of thinking about it. What do you... I guess, do you have anything to add to that?

[00:24:17.120] - Sam Ramji

I would just say it's another sign of MLOps being pretty early. Is that's our best solution to this is really to have a human in the loop. We've gotten to the point in DevOps where we really don't worry about having humans in the loop.

[00:24:28.840] - Sam Ramji

We have fuzz testers. We've even got AI powered things like GitHub Copilot. We've got security mitigation systems that will read the code, determine that there's probably a security bug and automatically send you a PR, which you can accept or that you can put it on autopilot and it just automatically accepts the PR.

[00:24:45.540] - Sam Ramji

Smoke tested, looks against all the prior versions of the software, make sure there's no regressions, puts into production. You can generally trust these kinds of systems in the computational environment.

[00:24:55.660] - Sam Ramji

The human in the loop piece is a lot like how we used to do web testing 20 years ago. When we push live code to Ofoto where I was director of engineering back in 2001. We'd make sure that we were all sitting using the software, that the product managers were all pushing transactions through the workflow to make sure that we hadn't broken anything because we didn't have that absolutely bulletproof automated system.

[00:25:19.700] - Sam Ramji

So we do have to have humans looking at the sensibility. We're using cognitive experts, people to look at the cognitive processes, the models and make sure that they're still coming up with things that aren't stupid, wrong or illegal.

[00:25:34.210] - Simba Khadder

That makes a lot of sense. It's just like, we've made all this stuff. Actually, you're also making me think, one reason why I love distributed systems and for some reason it was to me, it felt like so obvious to going from distributed systems to machine learning, is that there's the same gray area problem.

[00:25:48.300] - Simba Khadder

In a distributed system, you can get to it's correct. [inaudible 00:25:53] is correct. You can prove it. Most things are very... It's almost like a grey area. Because I'm all like this could just cut this way or this wire could cut off or you could lose this network activity or it can come and come back.

[00:26:04.860] - Simba Khadder

A problem space is way more... To me was way more interesting because it felt more alive in the same way that a model... Not alive and aware about live and it's dynamic, very dynamic. It just can't really be perfect. You'll always find some really crazy failure condition that will happen at the worst possible time.

[00:26:24.930] - Simba Khadder

We just I think USCS, two went down yesterday from when we're having this chat and all kinds of stuff went down. These are companies that know DevOp and do it well. But there's always something. There's always a big enough French that's going to take out the engine.

[00:26:39.160] - Sam Ramji

Sometimes it's the small enough French. Sometimes it's a component that is working with as specified. There's so much beauty in complexity. I think each of us as a human is attracted to living things. What you're saying about distributed systems is that they start to behave like living things.

[00:26:55.910] - Sam Ramji

They don't have to be intelligent. They don't have to be conscious, but just being alive is fascinating. You can watch the feeds, you can visualize the system, but you still have these unexpected consequences.

[00:27:06.650] - Sam Ramji

One of the biggest outages in Google's history, as I understand it. I wasn't there at the time, happened in 2009. What had happened was, a common component that was pretty useful for a few different things called stubby S-T-U-B-B-Y operated within its predicted bounce.

[00:27:22.800] - Sam Ramji

It promised an SLO, I think of like four half nights. So 99.9995% uptime. But it was actually such an outrageously reliable component. It typically behaved at about six nights. It turned out a whole bunch of engineers had just taken for granted that this thing was never, ever wrong. Nothing would ever go down that was in it.

[00:27:45.170] - Sam Ramji

They'd taken it as an unconsidered dependency. So it was in all these systems that it probably shouldn't have been. The coated leak because it just never caused a problem. There was... Anybody made a bad decision, but it was just like there's air that you can breathe. The sun rises in the morning and Stubby is up.

[00:28:01.540] - Sam Ramji

Stubby fell to four and a half nights. The cascading failures around Google were apparently spectacular. As part of the new chaos testing system, you could call it. They started inducing faults into Stubby, even though it didn't have the need to fail to periodically reduce it to its SLO rather than over performing.

[00:28:21.090] - Sam Ramji

That ended up helping them in an ongoing way, flush out these kinds of issues. But it's exactly that just bizarre thing. Like, "Oh, this thing was working too well." So the system went down.

[00:28:31.630] - Simba Khadder

I'm sure that if we were to do this again in 10 years and talk about moles, we would have the exact same startup. This mole was always right. It was such an easy machine learning problem. We just never assumed but this thing would be wrong.

[00:28:43.930] - Simba Khadder

Then, yeah, we gave a completely wrong financial projection to our board because this thing was completely wrong and it turns out that if you give it this exact number, it just happens to cause this crazy number to pop out.

[00:28:56.570] - Sam Ramji

So you think about where ML is being deployed, you could think about that being a very practical problem people are having today. It's not crazy to put ML into loan approvals. What's happening right now with micro loans and with some of the fintech companies that are breaking out of the old banking industry.

[00:29:14.300] - Sam Ramji

You're like, "Hey, you don't have to wait three days for loan approval, we'll give it to you in 30 seconds." How is that happening? That's almost certainly a model that is doing that. There's not somebody who's clicking, yes, and thinking through this every 30 seconds. That just ain't so.

[00:29:27.810] - Sam Ramji

But what if all of a sudden, a bunch of loans that it should have said no to over a period of as little as an hour, it just said yes to them all. How would you know? What are the consequences? How would you walk it back? What's the auditability? What's the accountability? What's the provability of where the data went bad?

[00:29:43.350] - Sam Ramji

We will have this all solved in five or 10 years. But right now, I think it's a really exciting cutting edge of real problems that frankly when I studied artificial intelligence and cognitive science back in '89 to '94, we never anticipated that it would become mission critical for an enterprise.

[00:30:03.380] - Simba Khadder

You remind me of... I had a conversation with someone who was... Actually [inaudible 00:30:08] very playful to recommend their stuff. One thing he said, which I know was numb, but just the wording was really interesting was, "Building recommended systems this isn't the struggle with search in general."

[00:30:19.980] - Simba Khadder

We've been almost... It's almost again. Except the generations done by model, the generative model is people. The people are at adversial network because no matter what you do on the search engine, someone's going to find a way and game it.

[00:30:34.480] - Simba Khadder

The hardest adversial model to take on is like human ingenuity because there's a whole suck to society that's my entire focus is like, how do I get top page on Google? They will think of unique things like that, even a model would never have fought to try.

[00:30:52.590] - Simba Khadder

So it's going to give me that same feeling because, I was like, "Oh, I didn't think of that." It's also like, A, for a lot of these models, for Mon, people go out of there and try to get them even shouldn't. So you also have to deal with that, and how to get that.

[00:31:05.220] - Sam Ramji

You almost need data science for data science. I got to work with Benjamin Treynor at Google, who is the father of SRE as a discipline. As I learned from Melody Meckfessel talk about the tools that we built for the Google SREs, they look more and more to me like data scientist toolkits that were deployed to production problems.

[00:31:23.500] - Sam Ramji

They're looking for outliers. They trade these amazing different graphs. They're applying math to the logs to figure out what might go wrong or what had gone wrong. That's data science for engineering.

[00:31:34.730] - Sam Ramji

But to figure out what happened wrong with your model, it takes one data science to build the model, but it might take another data scientist to look at the operational characteristics and figure out, how would I know that somebody had gained my loan model and that the predictions were wrong? So there's a lot of interesting stuff ahead like this?

[00:31:50.560] - Simba Khadder

I think so as well. I want to take the conversation somewhat little different. One thing I think a lot about it, well, I just was, "Oh, I'll give you a little story back when I was at the MLOps a happy hour thing."

[00:32:00.340] - Simba Khadder

Something that came up a lot was there's so many stars in the space and they were talking about that cloud of companies, which is true, there's a lot of companies in the space. But also I remember looking at the Martech map and I go to the sales tech map and I'm like, "Honestly, it's really not that bad."

[00:32:15.210] - Simba Khadder

I'm like, they're very much worse. But I remember with DevOps, DevOps, nowadays, a handful of companies that are like the DevOps companies, they're all amazing. We all of them are public now and they're amazing businesses.

[00:32:27.960] - Simba Khadder

In the early days, it was very similar I think. It was like a wild wild west. Kubernetes is an example we can say as a winner. Like HashiCorp as an example, is like a winner in that space, like GitLab, there's a lot of examples of projects and companies that won and obviously, some are going to bring some loss, but DevOps just played out.

[00:32:47.790] - Simba Khadder

So we can now look back and backtest and see what worked and what didn't and why things happened. Why do you think it could be about 90 degree about any of these companies can be about... You can speak about all the companies, but why did the winners weigh in at DevOps?

[00:32:59.870] - Simba Khadder

Why did it play out the way it did? Why is Kubernetes now like the tool? Because there's many other projects in this space?

[00:33:05.910] - Sam Ramji

I think there are two key components. One is what we talked about before on technological path dependency. You could just as well ask, why didn't Docker Swarm win? Docker Swarm was out before Kubernetes.

[00:33:18.140] - Sam Ramji

Why was it Kubernetes? So there's path dependencies that led to Kubernetes being successful, which is that they weren't trying to solve the container problem at the same time. So they had the extra bandwidth to take all of their hard won internal production practices from Borg, which they've learned at scale and do nothing but Qube.

[00:33:35.320] - Sam Ramji

So between a company that's trying to do two things and one that's trying to do one, you can always bet on the company that's going to just do one. So Qube was going to end up being that pure play.

[00:33:44.220] - Sam Ramji

But the deeper answer is it's about the people. So when Ben Treynor explained SRE, he said SRE is what you get when you deploy a software engineer into production operations. Because we don't want to sit at the command line and the ticketing writing desk and keep doing the same thing over and over again.

[00:34:02.930] - Sam Ramji

You do that to a software engineer after their first eight hour day, they'll turn in their resignation. But as software engineers, we are people who would rather spend 40 hours writing a calculator program than doing like four hours of homework.

[00:34:15.280] - Sam Ramji

So it's all about automation. What Ben and the SRE team learned was that you should really make sure that you're creating the programming affordances. That you're creating the clean clear interfaces, that somebody who's a software engineer by mentality and an automation professional by job function is going to love.

[00:34:33.310] - Sam Ramji

That is the thing that is also consistent about other things that people love, like HashiCorp and is the inverse of people, things that people hate, like Jenkins. Jenkins is everywhere, but you'll never find anybody who loves it.

[00:34:45.660] - Sam Ramji

The problem there is the affordances in Jenkins are configurations. Configurations are confusing as anything. They don't submit themselves to debuggers. There's no compilation errors. There's just all these accidental problems of complexity that you just have to do by beating your head into a wall.

[00:35:02.690] - Sam Ramji

So it's just like you accelerated the operations problem but made it hurt more. But you could glue everything together and it works. But Qube and HashiCorp, I want to really acknowledge how amazing what Mitchell, Armon and Dave have built there, is each individual piece was a lot like a Unix program.

[00:35:21.080] - Sam Ramji

It was highly opinionated. It was a small piece and it was loosely joint. Its affordances were pointed at somebody who had an engineering mindset, solving an operational problem. They didn't force you to adopt 11-7 things. They're like, you could just take this one thing. You're probably using 11-7 other things, but just take this one thing.

[00:35:39.030] - Sam Ramji

If it's working great. Then they would build another thing that was not integrated in its technology, but it was integrated in its philosophy. You have another thing that's pure and works really well and is open source and it can go really fast.

[00:35:53.510] - Sam Ramji

I think that sense of clustering around a particular user, an operations problem that's being replaced by automation, which is being written by engineers, that was a really clear market thesis of where this stuff would go. Then figuring out how do I just delight that engineer and give them more and more little pieces that they can bring into their tool belt.

[00:36:16.630] - Sam Ramji

A lot of people in business talk about I want to get share of wallet. But I would say when you're trying to solve these problems, you want share of belt. Is the tool on your belt? Do you pull it out 10 times a day? Do you really like it?

[00:36:28.630] - Sam Ramji

That's what absolutely nailed it. That was what the Kubernetes team got right. It's what HashiCorp got right about enabling those human beings in this moment of great stress and great technological transition.

[00:36:39.870] - Simba Khadder

That's such an amazing way to... A lot of people talk about find your problem and solve it and what's interesting about how HashiCorp is, they don't solve one problem. They solve a set of problems, but they solve a set of problems that the same person would have.

[00:36:52.950] - Sam Ramji

They only tried to ever solve one problem at a time with one tool.

[00:36:56.050] - Simba Khadder

It's super interesting when you talk about HashiCorp and the tool kit idea because a lot of startup wisdom is find the problem and solve it really well and with HashiCorp, if you think of HashiCorp, and you didn't know HashiCorp was an amazing business. I was like, "Yeah, it's this company, they have nine open source tools and they're loosely tied together, do you want to give them $100 million for your new next round?" You'd be like, "No."

[00:37:18.320] - Simba Khadder

That sounds like not a business that's going to be amazing. Obviously, it is. It's because we are not thinking about one problem, we're thinking about one person and the set of problems you would have.

[00:37:28.310] - Sam Ramji

There's so much power in focusing in on a community, one member at a time where that community is. No community is homogeneous, but there's a solid core. So the point that I would to drive all of our listeners towards is, product market fit is not abstract.

[00:37:44.170] - Sam Ramji

You might think about it smaller as tool user fit. Does the user love the tool? If so, you create an amazing window for your company. Because let's pair HashiCorp with their logical inverse. So what is HashiCorp? It's a range of tools. They're all open source and they're targeted openly at anybody who's trying to do operational automation in this massive transition to cloud-based distributed systems.

[00:38:11.900] - Sam Ramji

Snowflake has no open source, massive company, super cloud, and does a whole range of data warehousing, business analysis, business intelligence at scale. It's 100% proprietary. It's marketed, it's sold. It's led in a very classic enterprise fashion. But what they have going for them is the system actually works.

[00:38:32.510] - Sam Ramji

The product works. It scale sales, it's cost effective and it really satisfies those business analysts that were let down by the prior era technology. Both companies, the proprietary one and the open source one at IPO were valued at 52X revenue, which if you study startups, is mind boggling.

[00:38:51.090] - Sam Ramji

It represents so much confidence on the part of the investment market that those companies have not just good, not just great, not just outstanding, but spectacular growth opportunities. I think we can take a lot of heart in the fact that this open source, community oriented, technical practitioner centric company had one of the best valuations of all time and is performing outstandingly.

[00:39:14.680] - Sam Ramji

So the more we satisfy people the more openly we do it, the better longitudinal business expectations we can have. That's why we had the opportunity to build these great companies today.

[00:39:24.910] - Simba Khadder

The Snowflake is just HashiCorp. It's interesting that... You're right. Snowflake is, the opposite HashiCorp, but it is in many ways the EPO stuff is very different. I think for MLOps, where a lot of companies are building the Snowflake style, they're proprietary, they're big, they're platforms and there are a lot of companies around the HashCorp style, I think we're doing that, we have that virtual feature store idea... I mean feature form and Terraform person.

[00:39:50.520] - Simba Khadder

Even that there's a parallel. I think what's interesting is to think about what those companies' ideas. What really makes that company unique? Snowflake isn't... It's drop in replacement. It's like we have better tech than everyone else. We just work out the box better. It's still using SQL. You're pretty much doing everything the same way, but wis is just a better engine for you.

[00:40:10.710] - Simba Khadder

For HashiCorp, their IP is not... What they build is extremely hard to build, don't get me wrong. But I think the IP is more of the abstractions they build and the interfaces they build Kubernetes too. Obviously Kubernetes is not an easy product to build. But why you mentioned before, the interfaces that's what want out.

[00:40:29.160] - Simba Khadder

The interfaces come from sometimes individuals. One is the individual's problem solve that problem. You can't throw 500 engineers and build a better Kubernetes. You need those few people who just get it and they build the right interface. You can't really scale the interface it's building problem.

[00:40:44.090] - Simba Khadder

It's being opinionated where you need to be opinioned and not where you don't. Giving the configurability where it makes sense to and being very opinion where it doesn't. Also and ,DevOps everyone does it so differently, that you don't want a platform. You want the ability to fit it to yourself.

[00:41:00.010] - Simba Khadder

Machine learning, what we see, if you're doing computer vision or you're a bank fraud detection or you're a small company, you're also doing fraud detection, but you're 100 person startup, there isn't a platform that can satisfy all three of those people without being in Jenkins style. It's awful to use. Is that a fair takeaway on what you said?

[00:41:17.640] - Sam Ramji

I really think it is. I think to summarize, I'd say there's something almost magical about the ability to engage, a large community with a piece of technology that you can iterate very rapidly. So if we were to reconceptualize HashiCorp, I think it's not a corporation, it's a community.

[00:41:34.580] - Sam Ramji

So if you said what is HashiCorp? Who is Hashicorp? It's every HashiCorp employ. It's every piece of their software. It's every user of their software. If you conceptualize it that way, you could strongly say, what's their IP?

[00:41:45.910] - Sam Ramji

No organization on the planet knows cloud operations better than HashiCorp. That's super powerful because the connectedness of all the humans, the cycle time reduction of getting that software out in the open, used repeatedly, it makes the interface better and better, faster and then you end up with this beautiful merge where the user and the tool boundary disappears.

[00:42:09.560] - Simba Khadder

Sam, this has been such an amazing conversation. I feel like we could go into so much more. We might have to pull you back on and happen after one of these, but thanks so much for hoping on and having this conversation.

[00:42:18.620] - Sam Ramji

It's such a privilege to talk with you, Simba. I really appreciate it.

‍

The Evolution of DevOps and the Birth of MLOps

MLOps Weekly Podcast

Related Listening

MLOps and Feature Stores in 2025 with Ben Epstein

Bridging Software Engineering and MLOps with Paul lusztin of Decoding ML

From Recession to Al Boom: Venture Capital Perspectives with Gautam Krishnamurthi

Building the Future of ML Platforms with Ketan Umare

Ready to get started?

PRODUCT

RESOURCES

COMPANY

PRICING

DOCS