Hi, I'm Simba Khadder and you're listening to the MLOpsweekly podcast. Today I'm going to be speaking with Chris White, the CTO ofPrefect. Chris began his journey into data tooling while he was getting his PhDin mathematics at the University of Texas at Austin.
He then became a data science manager at Capital One andjoined Prefect as the first employee in 2018. Chris has been a teacher, aresearcher, a data scientist, a software engineer, a sales engineer, and asecurity officer, just to name a few things.
These experiences feed into his leadership at Prefect. He'sa fellow surfer, so it was good to run into each other in the data space.Chris, it's so great to have you on the show today.
Yeah, thanks for having me. Happy to be here.
I know I just gave a little brief bio of yourself, but I'dlove to hear in your words, what was your journey to get to Prefect today?
So a little long and circuitous, but starts basically when Iwas in grad school. I got my PhD in math at UT Austin under Rachel Ward, andthe work that we were doing was definitely theoretical. It was pure math, butit was related to signal processing, compressed sensing, and optimization.
It was right around the time when a lot of people wereleaving academia to go into finance and machine learning. Of course, I got alittle interested, and our work was very related to a lot of the research goinginto that. Started coding a little bit more, getting in the weeds there, andthen doing some consulting for various...
Did some algorithms consulting, did some actual justmanaging and munching data. Just got interested in that. One of the things thatI struggle with a lot when I was finishing up my PhD was whether to go industryor academia. That's the big question.
I do really love math. I really love math research. I lovedit all, but I did find myself gravitating towards just fire hoses of problemsmore than big picture research vision. Industry, especially at that time, wasproducing problems daily.
I made the switch and I went to Capital One as a datascientist. It was right when Capital One was doing their big cloud migrationfrom on-prem servers to AWS. Also got a little lucky to be a part of all that.I got to see both worlds, the old world and the new world, and what were thepain points of doing that migration.
On day one, I was writing white papers, building predictivemodels, justifying those models to regulators, all of that fun, pure datascience. But because of this transition that was happening, I ended up helpingmy team out a lot by just writing lots of tooling for these things. They wereon new platforms they didn't fully understand, so I helped there.
I was teaching courses about some of this stuff, which isjust really cool thing that Capital One lets people do. Anyway, long storyshort, I got deeper and deeper into actually just writing software that helpedunlock efficiencies in data teams, which either is data science or businessanalysts or data engineers.
By the time I left Capital One, I was on a team that wasbuilding a platform that connected data scientists and the finished models thatthey would build to business analysts who would then interact with those modelswith a Python SDK that we built on a platform that we maintained.
You can see the early seeds of the user experience that wecare about. The pre effect is like making sure that both highly technical userscan get what they need out of it, but also making sure that people who aren'tclassically considered super technical can interact with it.
Anyways, during that time, got really involved in open sourcein the Python world, Dask in particular. That is how I ended up meetingJeremiah, who's the founder and CEO of Prefect, and he hadn't started thecompany yet. He was just playing around with this idea. He used to be on theAirflow PMC. He was at the time when he was experimenting with this.
He had built this tool for himself because Airflow justwasn't meeting his needs. It wasn't scalable enough. The scheduler just fellover all the time. He needed a lot more ad hoc parametrized interactions withhis workflows. That's a hallmark, I think, of data science is it's not really abatch thing.
It's just something happens. It could just be me wanting torun an experiment and I want to see the output through some parameters in. Hehad come across my desk work. He wanted to make sure dask founder-ledwas afirst class citizen.
People could write highly scalable workflows. So we startedpairing together a little bit. Next thing you know, he's like, Hey, I'mstarting a company and I'd love for you to join as the founding engineer andmade the switch. It was right when I was moving to California, too. Just a tonof things were changing.
That's always a little scary, but definitely never lookedback. It was an awesome decision. It's been a fun, wild ride ever since.
That's awesome. It's so interesting to me to see. There's ahandful of companies, Capital One being one of them, not one people think of asreally tech forward. But I know, and you obviously know, some of thesecompanies are really good at this stuff. They've been doing it forever.
They have a lot of resources invested in doing data scienceand machine learning well. Arguably, their problem space is even harder. Yousaid regulators, I know what goes into that. We have users who we have to dealwith that stuff, and it's not easy.
I have a theory about Capital One, actually. One of thosethings that surprises people, Capital One is still founder led. Rich Fairbankswas the founder of the company and is still the CEO. I think that reallytranslates to the culture that they have where he still just cares a lot aboutwhat he's doing. It's not just a CEO job to him.
I didn't know that either. I know they're relatively newcompared to the banks, obviously. Jp Morgan is a strong guy. But yeah, thatmakes sense. I think this is true. There's a lot of companies I learnedrecently about Intuit's founders, the CEO.
Interesting. Yeah, and I hear Argo came out of Intuit,right?
There's a lot of these companies that... I guess in thatcase, the CEO is also technical. It's not what you'd expect, but it doesdramatically shift culture when leadership remembers those days and remembersbuilding it up. That's crazy. That's cool.
What do you think the biggest shift has been from being inthe belly of the machine in a couple of months, like being the first employeeat a small startup, which obviously is not much bigger?
Yeah. You learn a lot quicker just because you have to.There's just a lot of stuff happening and you have to move quickly. I thinkprobably the biggest thing, though, is the ambiguity that you have to face dayto day in a startup. Even if you're an engineer and not necessarily aleadership, there's an infinite number of things you could probably do, and allof them are not very well specified.
They all require a little bit of intuition and justnavigating that, whether it's prioritizing something, picking the right designfor something, whether or not to introduce a feature that a user requestbecause you don't know if that shifts what your product focus is.
All of that stuff is not stuff you really have to deal withwhen you're in a larger company. All of the initiatives get handed down to you.Maybe your team, you get obviously some creativity, but with this, it's likethe world is your oyster. You can technically do anything.
It's a really good point. I remember because I've been thefound...This second company. I've been the founder for longer than I haven'tbeen the founder in terms of my career. I'm still used to ambiguity, but italmost took a while for me to understand why you wouldn't be, why that's hard.
I had to actually learn how to think from the perspective ofsomeone who doesn't crave that. When I saw them do it, I'm like, cool, that'sopportunity. I'm going to run that way.
But it's definitely a...
It stresses some people out.
Yeah, it does. I think it stresses everyone out until youget used to it and you start to think differently. I'm very North Staroriented. What are we trying to do here? That this frame every problem that'spositioned. If we don't know what to do, well, what seems like it's like makingour North Star's aims that are North Star. Let's just do that and then we canfigure it out later if we're wrong.
I think that part too is really key. I always like to talkabout this as being comfortable just being wrong and knowing that when you'remaking a decision, it doesn't have to be completely right. It just has to bedirectionally correct.
That's usually enough so that you can make some progress,iterate on it later, figure out if it was wrong, retrace your steps, but youjust have to get comfortable with that. One of the things I always tell ourengineers and it's in our tech standards, just saying I don't know isimportant. Say it all the time. It's okay.
I think it's funny, even you saying that, because I'm soused to, as a founder, you're wrong most of the time. It's something you'recompletely used to. It's almost like when you write too many times, I've hadthis happen where I'm like, everything's going well, and that's freaking meout.
Right. This can't be true.
Yeah. It's like the house is too quiet. Where are the kids?Where's the dog?
Right. There's no birds outside. What's going on?
Yeah, exactly. Let's really get back into orchestration. Youmentioned Jeremiah was already working on Airflow. Orchestration is not a newproblem. Arguably, the most used orchestrator in the world is Kron.
Maybe first, could you describe the problem space and alsomaybe define why is it hard? Why are there so many orchestrators? Obviously,the problem is hard to solve, but I don't think it's obvious to everyone why.
Yeah. It's it's tricky. Orchestration, just as a word, Idon't think actually defines the space at all. Easy example there. Kubernetesis a container orchestrator. But Kubernetes is not competitive with Prefect. Werun everything on Kubernetes. Our users do, too. It's great.
Orchestration is just about managing the life cycle anddependencies of some unit of compute. Kubernetes cares a lot about computeresources and container run times and scheduling those from a resourceperspective, restarting them if something is going on, checking in on theirhealth in various ways and giving you ways to configure all of that.
I think that is one of the things that makes the spacechallenging is defining your world and how you're going to approach it isreally hard because you could theoretically do anything.
I think another part of orchestration and one of the reasonsI don't always think of Kron exactly as an orchestrator, although I know peopledo, is that especially nowadays, it's a lot about glueing together dependenciesbetween lots of systems and getting observability into all of thoseinteractions.
Kron only knows about a single, basically a CLI entry pointand it doesn't know anything about dependencies or anything else that's allbaked into your code. It doesn't help you remove any code related to that.
I think an orchestrator does somehow help you remove triggerlogic for when this thing should run, makes that really easy, helps you managethe fact that this should always run after this. See that dependency enforcedin a dashboard somewhere, and then just glueing together all of these differentruntime environments or tools or APIs or web hooks, whatever it is that youhappen to be gluing together.
That's where orchestrators live. But like I said, that'sjust such a huge space but that's what makes it hard. I think the thing thatpeople get wrong a lot of the time in orchestration is, and one of the reasonsI think there's so many tools in this space right now is they misidentify theoutput of a workflow as the thing being orchestrated.
For example, I see people call themselves a dataorchestrator. If you look at it, the only thing they're governing is some pieceof Python logic. That's not orchestrating data because you don't have fullcontrol over that data. It's just an output of the workflow. I think that canbe really misleading, I think, to users because they probably get moreconfidence than they should in what the tool is actually doing.
I think that's one of the things that we just really leaninto. If you go read anything that we produce, we always talk about how openended we try to keep the system. It's really about orchestrating Python runtimes, functions, scripts, containers, even. It just so happens that Python isone of the most popular tools for everything involved with data.
Obviously, a lot of our hooks and everything are gearedtowards data practitioners, data engineers and data scientists, all of ourprecast, et cetera. But we still, at the end of the day, think of ourselves asa Python orchestrator, and that's our universe.
I really like that you mentioned lifecycle. I think that's akey point here. We could be considered an orchestrator. Some people are like,Oh, you're not an orchestrator. I mean, if it helps your mental model, we're anorchestrator plus metadata specifically focused on the feature engineeringlifecycle. That's the crucial part.
It's more you choose your life cycle and like you said, ifyou pick your life cycle, there are things that you know to be true. In ourworld, we know that features are used for training and for inference. We canshortcut and abstract away some of the generic logic of making that happen.That's where we become more valuable. I think that's what you're getting at. Isthat fair? Is that what you're saying?
Yeah, it's 100 % fair. Then when you do that, when you'rein, for example, the feature engineering space, it's important to connect it tothe fact that it gets used in these models or whatever the case may be. But youwouldn't call yourself a model orchestrator because that's where yourgovernance crosses some boundary.
All you can do now is observe what happens afterwards.Another way that I like to think about it is, what code does that tool let youavoid writing? That's another thing, the reason I think data orchestration,when I see some people calling themselves that, I don't think quite makes sensebecause you're still writing just as much code about your data.
It's your Python runtime that might get cleaned up a littlebit. You're not really orchestrating data.
Just to be super clear in the Prefect case, what is the codethat they wouldn't be writing a user of Prefect?
Yeah. It's anything at all related to operationalizing andtracking that code over time. Starting with where does the code even live? It'sin this GitHub repository on this branch. That's the thing I care about. I wantto schedule it to run at these times.
With each of those runs, I want these parameter values toget passed. I want to track those values in that configuration that got passed.I want it to sometimes run maybe in this remote cluster, sometimes on my leftlaptop and I want to know which runs ran in those different places.
I want to make sure that it's more resilient than maybe youcould have written. For example, just like restarting it if the process dies orif Kubernetes[inaudible 00:15:12] a pod or something like that. That's morethings. You have to write something bigger than Kubernetes to tap into.
Our agent will manage that for you. Then I think a big pieceof it is all of the observability that comes with this. You have code, you wantto put it in the world and you want to confirm that it's doing the things thatyou expect it to do within the time frames or SLAs that you care about.
That can get tricky really quickly. All of the logs you haveto ingest, you have to go to search them, correlate them with errors, look attracebacks, just get alerts if schedules are delayed, and all of that stuff issomething I always talk about is like emergent complexity.
Each individual slice of that feels easy, but after fourmonths of building that system, all of a sudden you get all these featurerequests and things and all you're doing now is building an orchestrationsystem. You're not doing data engineering anymore.
Yeah. You all decided to go open source, but it also solvesthe problem of, well, if you have something we don't do, you can alwayscontribute it. But chances are we don't do it.
Like you said, you're not a data orchestrator, you're aPython runtime orchestrator. You mentioned one of the key differences in thinkingof those two things is observability. What is the key differences and thinkingof yourself, what does that allow you to do being a Python runtime orchestratorversus a data orchestrator?
I don't think there is a true... I think maybe DBT is theclosest thing to a data orchestrator where it actually helps you track thesedependent SQL statements that you're writing and how all of the pieces fittogether. But data...
This gets, I guess, into a lot of different things. I don'teven think data orchestration is really a thing that anyone is doing at allright now. I think it's a really hard problem. In order to really do it, youhave to gain some amount of control over data sources, sinks, and all of thedifferent[ databases that people are using.
That's just a hard platform problem to solve and to getpeople to buy into. What focusing on being just a Python function orchestratorallows us to do is first and foremost satisfy a million different use cases.
We have people who use our product for just replacingAirflow, standard ETL, batch jobs, just scheduling that stuff, to datascientists doing ML ops type of work where they're doing some basic experimenttracking, lots of really ad hoc job runs.
Totally different paradigm, running in these really large,oftentimes, DAS clusters and increasingly array clusters and managing that.That's just a scale that a lot of other data orchestrators don't really satisfybecause data doesn't typically break down into 200,000 units the way a datascience job might.
Then we also have people who use us for lots of random otherthings. We have some people who have used us for CICD. Why not? We managedependencies between jobs and alert you on failure. That's a lot of what CICDis. We have other people who...
We have one customer who set us up as this automatedonboarding workflow system for employees. I don't totally understand it. Ihaven't seen the code, but it's just a business use case. It's not anythingelse and that's okay.
Bread and butter, definitely data engineering, data science,focusing on just a generic Python structure gives us the opportunity to satisfyall of these other things, which is really cool.
That's really interesting. I think that really highlightsit. I think it's mostly the last one, the onboarding tool. That's not in anyway a data. It's truly just an orchestration problem. It's just defining theworkflow.
On that point, I think to just separate us a little bitmore, a data orchestrator, most likely, is going to make an assumption that allof your tasks, or whatever they call them, have outputs. That already puts youin a different world, whereas this onboarding tool, I'm not totally sure if ithas outputs.
I think it's just a check mark that the run is associatedwith this person and you go through a sequence of tasks, but they don'tnecessarily have outputs. They just succeed or they don't.
Interesting. That makes a lot of sense. I know talking toyou, [inaudible 00:19:32] , I know it's more of a hack together way to do itthat people have been doing for a while. Airflow is... I know it's very big.It's not super old, but it's not new.
It's pretty old. It's almost a decade old at this point.
I guess my question is, why now with Prefect? Did somethingchange or was it more of just took this long until someone actually built abetter system? Or did something happen recently that you think drove the needfor something like Prefect?
That's a really good question. I don't think that there's areally specific event necessarily. I think one of the big challenges that weobserved, at least when we started Prefect, was this two different worlds werehappening.
There's a data engineering world using Airflow, and thenthere's the data scientist world that had just tons of custom stuff around allof their model builds and how they tracked all of that. Bridging that gap sothat both teams, if they were on separate teams, maybe could speak the samelanguage and talk about the tool in a common way.
Then actually share code with each other that connects thosedifferent pipelines in a meaningful way. I would just say that need was one ofthe motivating needs that we started with when we were building these things.Then also letting people manage their compute environment just more than Airflowdoes.
I'm trying to think of an example, just where your tasks runand who has the authority to manage those dependencies is something thatAirflow takes a very firm ownership approach over, which means the scheduler becomesthe bottleneck for anything that you want to do.
Whereas in our world, we will let you configure thingscalled task runners that will say, Okay, that now can manage the dependency.Our Scheduler doesn't need to get involved and then that's how you can start toscale up.
You say, Oh, I have a desk cluster. I want to run 100,000things on it, and I want to see each one of those things in the UI. Easy to do.We hand over control of desk and just make sure everything's coming home. Justscale and then bridging those worlds is I guess my short answer.
I wonder if that bridge... The bridge is interesting becausethere is also a new rise, obviously, in MLOps, and that's driven by MLgoingoutside of just a research group in the corner and -
It's maturing into a real thing.
Yeah. We're like now, every team has a data scientist. It'snot a weird thing to have a data scientist on a product team. I guess that getsme to another question. I've seen Prefects show up on some of those MLOpsclouds. I've seen you on DataOps clouds. Maybe I'll see you on the HR cloud oneof these days based on that use case.
I know you don't necessarily fit directly into one, but I'dlove to hear how you think about the categorization of Prefect.
I genuinely think that we fit into all of those categoriesin the sense that if a user is using Prefect in that context to solve aproblem, then in a very real sense, we do become an MLOps solution.
There are ways that you can use us in that way and wesupport it. I think there's a future state in which we do create a little bitmore specialization so that the dashboards become a little bit more meaningfuland you don't have to set up so much metadata.
Just going back to the Python thing, I joke that we're aPython ops tool. Because people write machine learning in Python a lot, webecome an MLOps tool because people manage data pipelines, we become theDataOps tool, etc.
I think on those two categories, there's something that,especially whenever we set up this conversation, everything that I was thinkinga lot about is just MLOps versus DataOps. What do those two even categoriesmean?
MLOps, I think, is relatively clear. It is all aboutmanaging the lifecycle of a model ,build, and deployment. From experimentation,ingestion of data, tracking all of your training, to deploying it somewherethat it can actually be used, and then making sure that your inputs aren'tdrifting over time.
It's a very well defined world. It's a lot of stuff, ofcourse, and it's a closed world in the sense that you own the inputs and canmanage everything about it. Whereas DataOps is such a huge category and it's acompletely open system. What I mean by that is, the data that comes into youruniverse starts out, you can't control that at all. It's completely out of yourcontrol.
Then how people interact, like when you're in a machinelearning world, you have a finite list of tools and places you're going to runstuff. You might need GPUs or something like that. You're always over hererunning in that place where people can track costs.
With DataOps, there's a million and one different way to,for example, get access to a database, run a query, get the output of thatquery and do stuff with it. You might not go through your platform and actuallyjust connect directly to your database, download the results as a CSV, and nowthey're outside of the scope of anything.
It's such an open system that I think DataOps becomes areally hard category to both define and ever have a tool that owns it in anyway. Whereas MLOps obviously has an explosion of tools in it.
It's funny. I have a very similar view. I actually thinkthat the accepted view, like the more common view... Well, there's two commonviews I see. One is that there are two separate things. One is that MLOps isactually a subset of DataOps. But I had Stefan on the show a little while backnow and he was like, DataOps is a subset of MLOps.
I think it was a little bit of just exaggering.
Just for a take, yeah.
But I think what you're saying is probably closer to true,and I think what he was getting at, which is going back to the lifecyclecomment you made, there is an ML lifecycle versus a data lifecycle. But it'sreally dependent on use case.
It almost feels like it should be AnalyticsOps and MLOps andthat thing. DataOps is a more generic layer, but it also is so generic, butit's almost not really a defined problem. It's just generic plug ins, but maybeMLOps tools and then AnalyticsOps tools should use.
Yes, exactly. One of the things that we're doing to tryto... I don't want to say we're going into the DataOps, but people alreadythink that we're in DataOps. But to try to address this open endedness of thatsystem is we have this concept of a spectrum, the coordination spectrum is whatwe call it.
It goes from orchestration where you actually own criticalstate of a workflow, maybe it's its schedule, maybe it's the outputs of it,whatever the case, the code itself. Then you control that to the observabilityside where you're very much a passive consumer of information and you justorganize that information in intuitive, interesting ways.
Then you can connect these two things. You can say, wheneverI observe or don't observe these things, then do this stuff, which then putsyou on the orchestration side. The reason I'm talking about that is one of ourgoals here is for people to have basically endpoints where they can sendinformation like time series, things that are happening in their data stack andbe able to see it side by side with a lot of the running manage processes inthe orchestration layer.
We've designed the schema and everything to be very openended. You can just, for example, say, this query is being run by this personon this database in this table. Now this query, you could just fire hose thatstuff at us, will then put together a world view of like, Oh, so you have thisdatabase and it has these tables and these users interact with those tables.
Here it all is for you. Maybe one of your workflows startsfailing and then you can see, Oh, this user ran a really long running queryright before that workflow failed. Let me reach out to them and see if they didsomething weird or let me go look at the query or whatever. But having thisObservability API is what we're calling it. We're trying to just help peopleorganize a lot of this open endedness in one centralized place.
Would you say that in general, you orient towards keepingthe API simple or adding more functionality?
Yeah, we try really hard sometimes to a fault, to keep theAPI as always just building blocks and then clients then have to implementlogic on top of them. So our UI, for example, will patch together fourdifferent endpoints just to create a restart button or something like that.It's just a design bias that we have.
What makes sense. I almost predicted that you were going tosay that just based on what you said because I know it's a very differentcompany, but I get air table vibes. It's almost like a spreadsheet is thisgeneric abstraction. If you did it really well, a lot of use cases could justuse it. You don't need a specialized tool for everything.
In your case, it's like, hey, this idea of an orchestrator,I know an orchestrator is not a well defined thing, but Python orchestrator anda framework in Python to define things and have things run.
That's, if done well, is general enough where it solves sucha wide variety of problems that if you do it simple or have been finding somevery overly customized tool for the job, just use Prefect and make it work foryou. Is that a fair take away?
Hundred percent fair. T hat's the way that we always eventry to pitch it is that we can grow with your use cases and with your team'sfocus. You don't have to worry too much about that lock in.
This is super fascinating. I personally learned a lot fromthis. I think I would have definitely been guilty of calling Prefect a DataOpstool before this conversation. I'd love to end, let's say, if you had someonelisten to this like me where, this is awesome, where like, I need to go tell myteam about this. They need to go give a two sentence tweet pitch, whatever.What do you think the tweet l ave take away of this conversation should be?
Oh, wow. This is the most challenging question that you'veasked because I'm terrible at Twitter and I talk a lot if you haven't noticed.I think if you, anyone who finds themselves wanting to operationalize somePython code should check us out.
One of the things, this is getting away from tweet link, butone of the things that we have really focused on in our 2.0 push is this ideaof incremental adoption. What I mean by that is how much Prefect code youshould have to write, care about, and the concepts you need to know should beproportional to how complicated of a use case or sub linear even and how itscales with it.
One thing that you can use Prefect for is, if you're alreadyrunning a script on Kron, you can drop one line code change that just decoratesyour main entry point function, and now you just have a dashboard that you canuse to look at to see what your failures are, when your thing is running, whenit's succeeding, etc.
You already now get a little bit of metadata with a fundashboard on top of it all just using the open source. You don't have to standup any infrastructure to do this. It's all happening for you.
Layering Observability, essentially, on top of an existingorchestrator.
Exactly. Then next thing you know, you're going to want tolayer in some retries, and now you're starting to move into the Prefect world.But it doesn't have to be data. Any type of Python code you're trying tooperationalize.
That's great. You have an open source repo, but I'll includethe link to it so people can check it out.
This has been great. Chris, thanks so much for hopping onwith me today.
Yeah, thanks for having me. This was really fun.
From overviews to niche applications and everything in between, explore current discussion and commentary on feature management.