[00:00:06.070] - Simba Khadder
Hey, this is Simba Khadder, and you're listening to the MLOps weekly podcast. This week I'm joined by Mark Freeman. Mark is a community health advocate turned data engineer. He received his masters from Stanford School of Medicine. He later founded On The Mark Data, where he connects brands with data professionals through his content.
[00:00:24.120] - Simba Khadder
On the data and ML front, Mark has worked with numerous startups where he's put machine learning models in production, integrated data analytics of these products and led migrations to improve data infrastructure.
[00:00:35.710] - Simba Khadder
Mark, so great to have you on the podcast today.
[00:00:37.860] - Mark
Excited to be here.
[00:00:39.610] - Simba Khadder
Well, I've given the quick intro on you. I'd love to hear in your words how your career has progressed over time.
[00:00:48.470] - Mark
Where I'm currently at was not in my direction of where I thought my career was going. My story starts off where I thought I was going to be a doctor. I studied undergrad, preparing to take all the science classes. Then when I got into grad school, I did my Masters in community health and prevention research over at Stanford Med.
[00:01:05.100] - Mark
That program was potentially to get me more prepared for medical school. Then being in there I realised, oh, my gosh, I hate this actually. I do not enjoy being in grad school. I love health care, I love learning, but grad school is just not the right environment for me. Imagine four more years of that through medical school, that just wasn't going to work.
[00:01:23.920] - Mark
Being at Stanford, I had this unique experience where I was exposed to tech because it's in Silicon Valley, exposed to business, exposed to health care. Then because my program was very quant focused, I was learning how to code an R and do the statistics. Data science ended up being one of the top choices from that. I was like, wow, I can still pursue health care. I can still focus on trying to solve social problems, but I can scale it up via business and data. I just became obsessed.
[00:01:52.990] - Mark
It ended up being a blessing in disguise because data ended up being my actual true passion where I just can't stop thinking about it. The roles I've been in as I've been a data analyst, I've been a data scientist, I moved on to be a data engineer, and now my next career role that I'm starting soon is going to be DevRel, specifically for the data engineering space of how can I create content and community around data engineers and talk about the problems I experience.
[00:02:17.100] - Mark
My career has been really varied. I can go really into the specific parts if you're interested. But what I really like about it is I've touched a lot of parts of the data lifecycle, which has been really fun.
[00:02:27.520] - Simba Khadder
I feel like a lot of data engineers come from either software background or true research backgrounds, or maybe they were always data engineers. There's a lot of ways, but it seems from your path, your first code was an R. You're definitely coming from that path.
[00:02:44.250] - Mark
[00:02:44.250] - Simba Khadder
Tell me about that. I guess how different is it to be writing our code as someone who's going to med school or whatever versus being a true data scientist? Is it similar, is it the same?
[00:02:57.760] - Mark
It's completely different in the sense that when I was writing R code, especially in grad school, academic code is not the best code. My goal is just to get the output for a statistical analysis and make sure the research design is really good, but I'm not thinking about how to write great code.
[00:03:15.110] - Mark
I looked back to my code when I was a data analyst at Stanford doing clinical research for them, and my code is just a super long script, no functions really that much. It's just 400 lines of actions happening to get an output. That ends up getting published, not the code, but the output of it. That is really bad code.
[00:03:37.090] - Mark
If I went into a job and did that, the software engineers would just get out. I'm not merging this into our trunk. For me, a big shift was actually going from, alright, this is how I write code in academia to do analysis. This is how I write effective code to help build a product.
[00:03:55.720] - Mark
That actually took a few years. The first iteration was actually learning Python. One of my first jobs out of grad school wasn't even in data. It was in operations because I just didn't have the confidence to get the data job I want. But I knew I still wanted to be a data scientist.
[00:04:10.380] - Mark
In operations, there's a lot of Excel workflows. I started automating all the workflows in Python. I taught myself Python over a course of a couple of months, started implementing those. The next stage was actually becoming a data scientist.
[00:04:23.960] - Mark
Now, other people have to read my code. It's not just other people who aren't technical. Other people have to read my code, so now I'm being very intentional about how do I make my code easy to understand?
[00:04:33.460] - Mark
For data science, it's more so exploratory and making insights relevant and so following a certain pattern. I was still in health care when I did data science. Many times because in health care, it has to fall like research best practices, so making that very clear for that.
[00:04:49.500] - Mark
Then for my next data science role, it was for a HR tech company, and that was very product focused. That's when I really started to learn how to write really effective code because that was one of my goals. I said, I want to learn what ML looks like in production. I want to learn what data science looks like in production. I want to work with the production code base.
[00:05:06.880] - Mark
My manager is really amazing and makes sure I partner with software engineering a lot. That's when I actually learned how to code properly because I couldn't merge my code until a past code review. Code review is probably one of the most effective ways to learn how to write really great code, both doing code review and having your code reviewed by people that are much more talented than you.
[00:05:27.740] - Mark
That's when I learned how to do unit tests, classes, logging, really strong doc streams, how to abstract things, how should I put it within the code base. That really came from thinking about from data science, less from insights, to data science as a product itself, how my data processes are going to be used in the product?
[00:05:47.440] - Simba Khadder
It's interesting that you had this progression from at first, the code didn't matter, it was just output. It sounds like even an analyst's role, that was true. You have a Jira ticket or something, it's like, hey, I have this question, answer it. You're like, I'll go answer it.
[00:06:03.040] - Simba Khadder
Then you said as a data scientist, you started sharing code and collaborating, and then you're writing code for other people as well. You mentioned one way that really helped was to actually code reviews. Also talk about production code, how production code is very different from experimentation code. I actually want to break that down.
[00:06:21.680] - Simba Khadder
Let's first talk about maybe the collaboration and experimentation amongst data scientists. How does that look like? I know everyone does it differently, but I'm just curious from your experience, if I'm a data scientist, or I want to be a data scientist, or I'm a data engineer working with data scientists, how do they collaborate? Are we messaging on Slack? Are we doing meetings? Why do we even collaborate?
[00:06:44.660] - Mark
Actually, I want to clarify the question. Are you talking about how data scientists work with stakeholders in general or with other data scientists? 'Cause they're different.
[00:06:52.050] - Simba Khadder
I asked about data scientists with other data scientists. I'm also curious about the stakeholder piece. If you can answer both, that's even better.
[00:06:58.120] - Mark
Yeah, I can answer both. I think a framework that I created... Long story short, for my first data science roles, I just wasn't effective and I made a big mistake. From that, I created a framework to make sure I never made that big mistake again. I call it the TRY framework, which stands for talk, requirements, iterate, build, and evangelise.
[00:07:17.540] - Mark
This process is really going through how can I create effective data projects that drive value? A big part of this is very low vis decoding. It's actually the stakeholder management getting requirements on all those things.
[00:07:31.160] - Mark
The talk stage is, say for instance, I have a request from a stakeholder and saying like, hey, we want XYZ insight. Well, the analogy I give is that most of these non-data people don't know what to ask for. It's so abstract to them. The way I describe that is if I asked you to tell me about one row in Excel sheet, that's really easy. If I asked you to tell me about a million rows in a sheet, all of a sudden your mind just blows and you're just like, how do I even manage that?
[00:07:57.650] - Mark
That's what our stakeholders are experiencing through that when they ask us questions. Our job as a data professional is to help guide our stakeholders to better questions. I'll get a request and I'll push back. I'm like, why are you answering this question? What business value is this driving? Is this a new feature coming out? Why is this metric more important?
[00:08:17.560] - Mark
Then from there, I can try to assess what's the scoping of trying to solve such a problem? To answer or create a data product to someone along those lines, is what would this require?
[00:08:27.800] - Mark
More importantly is what data assets exist and with those data assets, what's the level of quality for those data assets? Do I trust this data? Do I need to validate and work with software engineering to understand, hey, this is how you generate this data in these logs? This is how it goes through our system. This is how it hits our data warehouse. Can I trust this?
[00:08:45.520] - Mark
In addition, I'm going to the business stakeholders to understand what's the business logic you believe for that. There's a lot of upfront work before you even code and start with the data.
[00:08:53.630] - Mark
From there, this is where the data scientists start coming in. When I do the scoping, it's like, alright, if I am doing some statistical analysis, what is the best approach to do this? I might do some exploratory analysis to understand the distribution of the data.
[00:09:06.680] - Mark
Then from there, I'll go talk to my team like, hey, this is the approach. I'm thinking about doing this type of model. I'm thinking about doing this type of analysis. I want to think about this type of population, so I'll filter these things out and create that process and initial code for that after I prove this is the direction we want.
[00:09:24.580] - Mark
This is when other data scientists become really, really helpful in that workflow is that I need to talk to them, see what am I missing, what other approaches am I doing? 'Cause you can get to an answer in multiple ways and there's pros and cons of different ones.
[00:09:39.930] - Mark
You may do a different statistical analysis. You may have a different assumption. When I'm working in data scientists, my goal is, these are my assumptions. Would you make different assumptions? If so, how would you approach this? That's really key.
[00:09:54.080] - Mark
I think you can't have an ego going into this because... That's part of your managers as well to create a space where you have that psychological safety to be like, I'm okay to be wrong, whereas a team want to get right. What's the best approach?
[00:10:06.520] - Mark
Then once you get that analysis, you get that done, you're pretty happy with it. Then you go back to your stakeholders and show the results. One of the biggest mistakes I think a lot of technical people do is that here are the results, here's the data, and they think it's self-apparent.
[00:10:20.260] - Mark
It's not to them. You have to really massage and think through how do you communicate these results. For the tribe framework, so I talked about the build is the last step, the evangelise step, is I try to answer three things.
[00:10:31.650] - Mark
What was your pain point? How did I solve your pain point with this? Then how can you use it today, whether it's a product feature or insight? If I answer those three questions to the key stakeholders, it's an instant win.
[00:10:45.380] - Simba Khadder
It's such a good point. I noticed a big difference between junior people, especially junior data scientists and more senior data scientists. Junior data scientists will give you the facts and they're like, okay, it's your job to form an opinion.
[00:11:02.920] - Simba Khadder
More senior people understand that, hey, here are the facts. Here's how I read this. Here's my opinion. There's a balance because you don't want to be like, it's this. We need to do this thing 'cause you might... The data is never... It's very rare that you will find real causation. It's almost always correlation.
[00:11:24.390] - Mark
The statistical phrase is all models are wrong, but some are really useful. That's always fall back, too.
[00:11:31.420] - Simba Khadder
I think it's such a... I feel this even with designers and lawyers, more senior lawyers, they'll be like, hey, this is what you're going to want to do. More junior lawyers, they'll ask me like, should we include this language or not? I'm like, I don't know. It's your job.
[00:11:46.800] - Simba Khadder
It's true of designers, too, where it should look like this or this. I'm like, I don't know. I want it to look good and get this across. It's your job to like... I can see it from the opposite ends 'cause there I'm pretty much just a dumb stakeholder. I don't know anything about design. It should look good.
[00:12:05.640] - Simba Khadder
I think as data scientists, it's really important to remember, one, how much ownership you actually have. People just want an answer and chances are you're going to be able to come up with a stronger opinion, someone who is a trained data scientist, than your stakeholder who is very good at their respective function, but isn't living data like you do.
[00:12:32.100] - Mark
Definitely. One key thing I want to highlight is if I'm asking those clarifying questions at the end, that means I failed because I'm not meeting your needs. I'm not aware of your needs. That's why I spend so much time upfront doing that scoping and discovery and talking to my stakeholders, because there's two things.
[00:12:47.820] - Mark
One, I deeply understand the problem. I can deliver on that. But two, it builds trust where they feel as if, Mark got me. I know he's going to look out after our problem, make sure he delivers 'cause he's been so thorough upfront.
[00:12:59.560] - Mark
That trust component is really big, so when you do give your numbers, they're more likely to trust it. When you do give results, they're more likely to trust it. Because you've done that upfront questions, you can actually say, hey, this is another consideration. As our project evolved, I remember you saying these are these key metrics. You didn't ask for this, but here's this also addition perspective. You can start for seeing other questions they'll have as well with that.
[00:13:22.360] - Mark
That's why I spend so much time on the collaborative communication piece upfront more than you think you need to, and you'll be set up for success and we communicate it down the line.
[00:13:33.400] - Simba Khadder
I know we've talked deeply through your mental framework on it. Harvard tooling, does this even fit in DataOps? Is this a whole never thing? How do we think about this tooling in relation to all these workflow problems?
[00:13:49.700] - Mark
Definitely. I think what I was describing is very tactical is on the data science side. On the DataOps side and for context, my introduction, DataOps through data engineering. I want to highlight, I'm not a DataOps expert. I see the value of it, and so I started creating content around it to interview these experts in this space.
[00:14:09.980] - Mark
But from what I've seen is, for data engineering and to be successful, to trust the numbers, is like, as a data scientist, how can I have the data to do what I do? That's actually how I got into data engineering. My last role, the data was not up to par and I couldn't do my job.
[00:14:25.030] - Mark
As a data scientist, I just said, hey, I know how to code in Python. I'm one of the best coders on our team for that. I'm just going to dive in and fix this. Go upstream and actually resolve this. Fix our data warehouse, fix our ETL pipelines. That's how I got into data engineering in a way for that. That's how I got into data engineering.
[00:14:42.240] - Mark
Then from there, I just realised how reactive data engineering is 'cause when data goes down, I can't do my job. If I can't do my job as a data scientist, I can't work for my stakeholders. If my data is bad quality or there's a hidden error, either they don't trust the data or a fire happens from my results 'cause they were wrong.
[00:15:03.080] - Mark
Where I think tooling comes in for the DataOps side is helping data teams become less reactive, being aware of the risk and opportunities of their data assets, and being able to move accordingly, whether it's their data roadmap, their product roadmap, or just their individual projects.
[00:15:21.840] - Mark
I think the tooling really for that process I was talking about, that collaboration piece, I don't think there's really much tooling for that. That's just more personal business experience that I put into a framework for myself. Maybe Jira might be something, but I don't think it really solves that.
[00:15:37.390] - Mark
But the reason why I'm going through all that is to account for the fact that data is hard to understand, data is really abstract, and data has an affinity for entropy. It's going to go towards chaos at all times. DataOps comes in a way to create frameworks and guardrails, to protect and understand your data and help you be less reactive.
[00:15:59.740] - Simba Khadder
We've talked about data scientist stakeholder collaboration. We've talked about data scientist collaboration. We're starting to touch on the data engineer part. How did data scientists and data engineers work together?
[00:16:11.940] - Mark
That's the interesting component 'cause for a while I was a data scientist masquerading as a data engineer before I got the actual data engineer title. I was like, how do I work it myself? But within that time, I shifted it from I'm a data scientist on the data science team to I'm a data engineer on the data science team.
[00:16:30.510] - Mark
My stakeholders shifted from business stakeholders of like, what insights do you want? To my stakeholders were data scientists and saying, what are your stakeholders asking you? What is possible for you? What is impossible for you right now? More importantly, what's extremely hard for you? You have to jump through a whole bunch of hoops to handle.
[00:16:50.080] - Mark
Now, my collaboration went into like, how can I empower the data science team to be more impactful? One of the hidden challenges of that is that as a data scientist, I'm very front and forward in the company. It's easy for me to show success in the company 'cause it's very visible.
[00:17:05.280] - Mark
When you move to data engineering, you're in the background supporting the other people who are shining. How do you show your impact in the company on a broader scale is much harder in the day engineering side.
[00:17:15.760] - Mark
That's where the collaboration data scientists work happens is that you really talk about how do we empower these downstream data consumers. Whether it's bringing a new data assets through a pipeline, whether it's restructuring the data warehouse to make it easier for them to understand things, and getting those use cases from them, saying, hey, this is what I did after you implement this new data feature, where you implemented this new pipeline. I was able to answer these questions for the stakeholders.
[00:17:42.520] - Mark
That's where that collaboration with other data scientists comes into it. It moves from creating the insights to empowering people to create insights.
[00:17:49.600] - Simba Khadder
Got it. It's almost like the data engineers, stakeholders, the data scientists, and data analysts. Those stakeholders, it depends. It could be marketing, it could be product, it could be finance. It could be a lot of people.
[00:18:00.690] - Mark
Definitely. Something I did early on, too, is to foresee what the data science would do. I would go throughout the organisation and talk to business stakeholders. Be like, hey, what questions are you trying to answer? What's really hard for you to figure out your next steps on? What data are you currently referencing to make those decisions?
[00:18:20.410] - Mark
They'll either say they don't have any data or they're like, I use this from these three different sources and it's really hard. That was like, bingo. I know I can take these three different data silos and bring them together within the data warehouse, and that will provide ultimate value to my data science stakeholders to answer those questions for the leaders.
[00:18:38.300] - Mark
I try to bridge those gaps as many as possible. It's like identifying where are the data silos and bringing them into a place where they're all in one place, where data scientists, they analysts, they analyse.
[00:18:48.620] - Simba Khadder
Some teams I talk to, data scientists do very little... They don't write Spark code. They're really truly working off of the nice, clean final data sets. In other places, it's like, I'm writing like, Flink jobs in Java. There's everything in between. The right way depends on the company. It depends on a lot of things. But if you had to prescribe a default, should data scientists be building these tables? Should data engineers always be doing it? Where's the line?
[00:19:24.680] - Mark
I think that's a good question. The reason being because it's a really hard question that I think everyone, even myself, is confused on. The reason I say that is that it's not really dependent on the title, it's really dependent on the company and more specifically, the data maturity of the company. If you're a really data-mature company, say, for instance, they're a company that uses feature form and has feature store, are you there more on the mature side of things for that?
[00:19:51.080] - Mark
That probably makes sense that they have data engineers and stuff like that. But if you're at an early stage startup like I was back, where the data scientists became the data engineer, the data scientists are going to be doing a lot more because you have less people and less resources and your focus is proof of value.
[00:20:06.220] - Mark
That these data intense applications are worth the business' value to invest in, as compared to a more mature company where they see the ROI already and they can say, if we invest in this, we can see this revenue in return.
[00:20:18.600] - Mark
I think it really comes down to, one, what's the data maturity? Data maturity doesn't mean size of the company. You have a really large company that'd be data-immature. But the data maturity of the company, how bought in is the company into data and how it drives revenue, especially now in this different market?
[00:20:36.620] - Mark
Then from there, you'll get a really good sense of what's the expectations. Will you see this more specialized where data scientists are able to focus on generating insights, creating models, and doing this more R&D aspect of it? Or if they're very early, being very experienced and generalizable where they can do the software engineering, they can do the data science stuff.
[00:20:56.590] - Mark
I've worked in companies where I was using Spark. There were a small startup. I've worked in other companies where it was super small data and we're just using SQL. It's so company-dependent.
[00:21:07.680] - Simba Khadder
I guess similarly on the... Maybe moving in a bit to the infrastructure side, a few questions. One, in some companies I talked to, data engineers, they own the cluster. They own the Spark cluster, they own whatever, Snowflake. It's like theirs.
[00:21:22.730] - Simba Khadder
In other companies, it's IT and then they use it. In some companies I found, there's a lot of different infrastructure. It's like we have 20 Spark clusters. In other companies, it's like Snowflake, everything. Everything is in Snowflake. Literally, our models are running in Snowflake.
[00:21:39.000] - Mark
That sounds very expensive.
[00:21:40.760] - Simba Khadder
Yeah. There is value in simplicity. Two, it's always a trade off. Where do you think the world is going? Is everyone just going to be on Snowflake? Is there going to be more of a diversification of data infrastructure? I'll have a vector database and a graph database. Are you going to have 20 Spark clusters? Are you going to have this one giant one? Where are things moving in your opinion around the infrastructure?
[00:22:04.290] - Mark
Two people I really look up to in the data engineering and DataOps data infrastructure spaces is Joe Reis, Fundamentals of Data Engineering book with Matt Housley. Then hopefully I say his name right, Juan Sequoia. The main theme they've been talking about for 2023, so I don't want to say it's my theme, it's their theme that I was just like, you're right, is, show me the money. Where they're saying the future is whatever drives revenue for the business.
[00:22:31.760] - Mark
I think the past 5, 10 years, we've seen this loose data workflows where we're like, we're just going to throw money at this problem. That's how you end up with these very expensive ELT pipelines going to a data lake that we think we can use data for, and we have all these complicated things on top of it and just throwing money and cloud resources at it. People are going to be more critical of it because capital is much more expensive now.
[00:22:57.540] - Mark
Now, I think where the direction is going is, I think it comes back to a build versus buy. Where the cost towards that is, your infrastructure is going to reflect what drives value for the business, not what's the best practices that everyone's doing because all these companies have this modern data stack infrastructure, so we're going to do it.
[00:23:16.500] - Mark
I think we're going to be more critical of how we're using our money, how we're optimizing for it, what's our competitive advantage, so therefore we build that. Then what is something that is necessary but not our competitive advantage, so we're going to buy that. I think that's really going to be determining the direction for that infrastructure.
[00:23:35.380] - Mark
Another person I talked to is Ethan Aaron, who is CEO of Portable. He's really cool to talk to because his background is in mergers and acquisitions. I actually was talking exactly about this. What he was saying is actually, it's less of this conglomerate, these platforms can come together. But instead, he was arguing that essentially it's like this is going to be less R&D and innovation.
[00:24:02.120] - Mark
People are going to be doing the more boring choices because it's safer to drive revenue. That's to be something to be aware for when you're bringing these things together. I think there's going to be more data tools, but people are going to be way more selective of choosing what data tools to do. I think the big thing to look out for is which data companies are moved from point solutions to actually being a platform to integrate all these tools together. That's why I'd be paying attention to.
[00:24:26.990] - Simba Khadder
What do you think is going to happen there? Do you think that there's going to be more platformification, or do you think there's going to be more unbundling in the DataOps space?
[00:24:37.220] - Mark
Specifically in the DataOps space? That one's a really hard question for that because I think DataOps has been around for much longer. They've been talking about for a while, but it's just now it's getting attention. I feel like the past five years, very focused, like MLOps, ML. I think people are really starting to pay attention now. They're like, "Oh, wow, we spent all this money on ML, and actually our data is bad, and that's the reason why ML is just not working." Now there's a huge attention.
[00:25:04.550] - Mark
I think there's going to be a lot of DataOps and data tools proliferating right now. I always go back to Matt Tuck's mad diagram, which was the state of data in ML, and every year it gets larger and larger, and I think DataOps is going to contribute to that. I don't see this bundling really happening. It's still way too early to say.
[00:25:28.080] - Mark
But from what I've seen is that there's so much pain in the market in dealing with this. When there's a lot of pain, people think there's a lot of opportunity to build things around that. That's why you see all these data observability, data lineage, orchestration, all these tools popping up, open source, closed source. I feel like there's a new data tool every single day. I think it's still going to ramp up.
[00:25:52.660] - Simba Khadder
Quote I heard recently, which has really stuck with me is, at a sufficient scale, everything is a logistics problem. I think that really applies to data. In the end, it's less about how do we move bits and how do we run these functions, and it becomes much more about, how do we manage these assets? How do we use our people and how do we make them productive?
[00:26:18.080] - Simba Khadder
It just becomes way more about managing assets and teams than it does data. It's almost like, oh, yeah, that's what we happen to be doing. Is that fair? Do you feel like the DataOps tools are...
[00:26:28.540] - Mark
I think I would make an adjustment to that is that I think there's going to be a hyper focus now on the data and the assets you have. It's moving less away from being data, just being some resource they use to power your business, to now data itself is a product and you need to treat it as such and you need to elevate it as such. With that is hence why there's tools coming into place to keep track of your products.
[00:26:53.850] - Mark
Because if it's just a resource, you're not going to care as much. All right, we have these tools. But now this data and this specific data asset is tied to this revenue stream, especially if it's a successful ML model, where it's a recommendation system, something that's like driving sales.
[00:27:07.820] - Mark
That data becomes very important and you're going to want to create tools to monitor that data because there are some companies where they can track for, if this model is down or this data is wrong, it impacts this much revenue every single day. That's a big deal to companies. I think that's where tooling and especially DataOps comes into play.
[00:27:23.700] - Simba Khadder
Yeah, I think that makes a lot of sense. One of my really good friends is a PM for data. He laughs when he has to describe what that means. He's like, Yeah, I'm a product manager and my product is data." I see and I agree with it, but maybe for our listeners, what does it mean to treat data as a product and not as an asset?
[00:27:40.460] - Mark
That's a great question. I think the biggest thing is that when you have a product versus a resource, there are expectations of that product that will be in a certain way at a certain time, and that there's thresholds that you care about. In addition, there's metrics you care about.
[00:27:58.120] - Mark
If you're just pumping data into a data lake and you're like, this data exists, you're not paying attention to it. But if it's actually driving a product and there's like, all right, this data quality went down by this much. Upstream, they changed the way they format it, and this impacts this thing here. Our refresh rate, is it meeting your needs of our downstream users?
[00:28:18.100] - Mark
Then you start asking really critical questions of, what does this data need to have for it to meet a threshold of success for this revenue stream, for this product? I think that's where we're thinking of data as a product changes because then all of a sudden you start caring about what's coming in, how fresh is it, how realistic is it to the business, what things happen to it along the way. There's a whole list of things, but I think it really changes the emphasis from, this is something we use, versus, this is something that's driving value and you pay attention to.
[00:28:50.630] - Simba Khadder
Maybe take me in a different place now. You've talked a lot about your data science tool and your data engineering experience. Nowadays, I know you've been putting on amazing content on LinkedIn. If our listeners aren't already following Mark, you should be. You have a great newsletter called the Scaling DataOps. Firstly, what prompted the move? Why did you start writing content? Why did you start creating a newsletter?
[00:29:13.320] - Mark
There's twofold. One, there's an interest and curiosity, and then there's also a selfish reason. I'll start with the selfish reason. I talked to a career coach and I was like, "You're saying if you want to elevate your career, you need to start talking to leaders more to learn from them." I was like, "How do I talk to leaders? How do I message people on LinkedIn and they want to talk to me?" If I have a newsletter and create a platform to highlight them, they're more willing to talk to me.
[00:29:35.330] - Mark
That's why I created the newsletter for a selfish reason. But then the second question is, I could have chose any topic, so why did I choose DataOps? The reason being is that, from my experience of the last startup I was at, I just saw the pain of having bad data processes and how it slowed down product and how it slowed down insights and made it really hard to trust things.
[00:30:01.140] - Mark
In addition, I was seeing this big move recently with Andrew Ng with his data-centric AI where he was really talking about the focus is in improving the ML models. We have these large language models now, these large tech companies who have the money to throw in the power for that.
[00:30:16.150] - Mark
There's no competitive advantage for that aspect if you're a small company. But where you can have a really big competitive advantage is having really strong data to put into these models. I think that's where a lot of companies can differentiate that is how can you curate very valuable data that's something that other people just don't have? That's what really got me interested in data engineering and move from data science to engineering.
[00:30:40.720] - Mark
But more importantly is how do you create amazing curated data sets and ensure that they stay that way, and DataOps just seemed like the thing that kept on popping up. The more I read about it, I'm like, this just makes sense in the data infrastructure space. I want to really devote my time here.
[00:30:58.330] - Simba Khadder
Well, you talked about models. The way I've been putting it when people ask is, the models have continued to become more and more commoditized. They've become way better, but they've also become way more generic. Like you were saying, we're moving from doing things for the sake of doing things, like, "Oh, I'm going to tune this model and get a point something percent performance boost. I can say I did that.
[00:31:24.600] - Simba Khadder
More like, hey, actually, we care about what does that actually mean? It could just be revenue. How much money is this affecting or something else? You have a metric. Turns out, in my experience, most of our actual gains at that mattered came from feature engineering and coming up with better signal, like seriously taking data and pulling signal out of it.
[00:31:44.730] - Simba Khadder
I found that as data scientists, a lot of our job has become taking our domain knowledge and learning it from the company, too, and taking the raw data and it's just like crossing those into signal. Our job is just essentially injecting the main knowledge from the business into the data. The models just work. They just take that and they just perform magic.
[00:32:08.200] - Mark
Definitely. I would love, if you're okay with me, me turning tables a little bit, asking you a question so I can learn a little bit more. Again, you're really focused on feature stores. How do you feel data engineering best integrates with those feature stores, given that you're talking about how feature engineering is the biggest driver? I agree with you because curated data is doing that. Based on my answer, what do you think I'm missing from the feature store perspective?
[00:32:33.030] - Simba Khadder
I think there's the question that I actually asked you first, which was, where's the data signed to split with the data engineer? Where is that? The answer that we found in practice is that, like you said, depends entirely on the business. It depends on the company, depends on the team they've built, and literally who they hired first might completely affect that split.
[00:32:55.300] - Simba Khadder
Feature engineering starts as early as picking the data, finding it, cleaning it up, just the stuff that looks more like what data engineers do. Then there's the feature engineering side, which is things like scaling, doing just things that would seem very random to a data engineer. I think data engineers, the way they think, they build metrics. With metrics, there's usually a source of truth.
[00:33:19.460] - Simba Khadder
Typically, metrics, they are used in a format that looks like a spreadsheet or a slide deck or a chart or a BI tool of some kind. That's where it's feeding into. For machine learning, we're fitting into models. We might do these weird transformations where it's like, hey, let's say I'm making an MRR, like a monthly recurring revenue feature. For this model, I want to cut any contracts where the price is less than $1,000 a month, and I only want to do the ones in the US.
[00:33:54.600] - Simba Khadder
I want to remove anyone that's too many deviations away from the mean. You start doing all this stuff that you would just say that the data engineer, you would never build that table. If a data scientist sent you, "Hey, can you build this for me?" You'd look at them like they were insane.
[00:34:09.440] - Simba Khadder
The cleaning part and imputation and some of the more basics, it's just like finding the data, joining it together. It depends on the company, whoever the data engineer or the data scientist does it. But the last mile for sure is done by the data scientist, almost always. It would be very strange otherwise. The problems that we see...
[00:34:28.240] - Simba Khadder
At my last company, we had a Google Doc that was like the master list of sequel snippets that were really useful.
[00:34:36.300] - Mark
That sounds painful.
[00:34:38.380] - Simba Khadder
Yeah, a lot of companies do. They have notebooks and they have this giant... I met a guy who was a quant, and he told me that from the day he joined the company to the day he left, all of his work was done on a single notebook. He had this massive notebook and so everything that he's ever done was there. He can copy and paste and reference it.
[00:34:59.760] - Simba Khadder
We just come up with all these tricks. We call our notebooks, present or whatever, experiment, _V6, _final. I think the problem that we see is much more about the versioning, orchestration, and a lot of it is organizational problems. It's people problems. It's just how do we work together well? How do we share things? How do I even keep track of what I did?
[00:35:23.960] - Simba Khadder
In my opinion, it's much less about the Spark problems. You said it with DataOps, too. A lot of the problems come from treating data as a product and not so much of how do I run this unique computation or how do I handle this much data? That's a solved problem, but it's not why you need DataOps. I think in MLOps, it's the same thing.
[00:35:47.190] - Simba Khadder
Now you're training features as a product, where before, a feature was entirely a mapping in the data scientist's brain. Where in any company that has features, you actually write the word feature? Maybe the table name, but truly it doesn't exist as a concept. It's entirely a made up abstraction in data scientist's head. Featuredforms' whole goal is like, hey, let's build these abstractions, make these first class entities in the data science and ML workflow, and essentially build the workflow on top of that. Anyway, I digress.
[00:36:18.900] - Mark
That was really good. One of the points I really like you brought up, and I think really goes back to the data engineer, data science component, also how it ties to DataOps, is that the last mile problem, where that's where data scientists excel. When I shift to data engineering, I'm not giving those questions anymore. I had to go to data science, what are those last mile problems working on?
[00:36:40.240] - Mark
My goal wasn't to create the perfect data set for the end of that. My goal is to create a perfect data set to give them the room to explore what that last mile should look like. Again, going back to DataOps tooling, a big piece that's really is data lineage. Where was this data source? How did it get here? What are the various transformations they had on there?
[00:37:03.140] - Mark
Not knowing how that data came to be in the data warehouse where most data scientists are working at, that can be very dangerous because whether there's some transformations or assumptions are so upstream, they just weren't aware of. That just totally messes things up.
[00:37:16.480] - Mark
Tying it all together, data engineers, while I'm still in this conversation, data engineers really preparing those data assets that can enable data scientists to create those last mile components and DataOps tooling is like the guardrails and framework to understand how that data works from end to end.
[00:37:34.340] - Simba Khadder
I think that's a great way to think of it. I think that's the piece. I think that's the answer. That's the answer and how that actually looks depends on your company. Like you said, as nice as that sounds, sometimes as a data scientist, you do have to go and sometimes for good reason. It's like you have something that's so random, seemingly random of a computation, you have to get way back to the original stream.
[00:37:59.120] - Simba Khadder
It's like we have to go that far back before it's untouched enough, I can actually do the thing I want to do. Let's say you want to figure out, what's the average time between events per user? That might require you to go pretty far back down the chain. It's like all the cleaned up stuff has probably scrubbed that signal.
[00:38:18.720] - Simba Khadder
As a data scientist doing ML in particular, you're far more likely to find yourself building, let's call it weird features, from a data in this perspective. Like you said, with your newsletter, a lot of the focus has been learning from leaders. You flipped it on me, you got to learn also my take on things.
[00:38:38.630] - Simba Khadder
If you had to pick maybe a couple of points or three points of things that you've learned from these data leaders you've talked to that you have instilled in how you do data science and data engineering, what are the main takeaways you're finding?
[00:38:54.330] - Mark
I think a hidden gem on my newsletter that now, because I can see the numbers of what people view, is my first interview for the newsletter with Christopher Bergh, who is the CEO of DataKitchen, which is a DataOps platform. Very few people have seen it because he was one of the first people on there, so it didn't go out to many people.
[00:39:14.600] - Mark
It's a hidden gem because he talks about his process of why he pursued DataOps, what's the value to data engineers, and why should companies care. In addition, I have a whole bunch of resources as well, like the first articles talking about DataOps and what was the problem that came about. One of my biggest takeaways was Chris Bergh said, "I'm here to help data folks say, ewe. When they see data that is nasty, rather than it being hidden and causing problems unbeknownst to you, you being able to see it and say, that is a problem. We should not be this reactive. We should not be this stressed. How can we go upstream to solve this?" That's really stuck with me. When I asked him, "What would be your first steps of implementing DataOps within the company?"
[00:40:03.380] - Mark
Because one of the challenges of getting data infrastructure buy in, especially on the data engineering side, is that it's an investment for our future selves. When you're in a startup, your future self seems so far away, you just want to focus on building features. Product features, we talk about different type of product features. How do you get the buy in to actually invest in healthier, more stable data when it's not an immediate payout?
[00:40:32.090] - Mark
That goes back to our earlier conversation where sticking thermometers into the different areas of your data lifecycle and actually measuring what is going right, what is going wrong, that will fundamentally change how you think about your data.
[00:40:48.800] - Mark
I think he really made the argument as how you start shifting towards actually caring about the metrics of your data, because then you can quantify and understand where in the data lifecycle is really struggling. More importantly is what you can do to fix it. That's really stuck with me. It's like asking, are you just measuring what's there? More importantly is, how do you get buy in from leadership to care about measuring?
[00:41:14.440] - Simba Khadder
It's a self thing that it's a trade off I constantly have to think about of paying down the debt and then also continuing moving forward. I think everyone who... I mean, even here I see everyone has this trade off to make and in our heads, we like to make it seem like it's really obvious, like, oh, you need to pay down the debt, or whatever. But in reality, it's really hard. It's really hard, how do you resource-manage is a really hard problem. I feel like we could keep going for so much longer. I love your chat, Mark. Thanks again for hopping on. I really appreciate it.
[00:41:48.840] - Mark
Thanks so much for having me. I really enjoyed it. Thanks for allow me to ask questions from you as well so I can learn and hopefully your audience can learn because you have so much to provide and I always love talking to you.
[00:41:58.590] - Simba Khadder
Thanks so much, Mark.
On The Mark Data: https://www.onthemarkdata.com/
From overviews to niche applications and everything in between, explore current discussion and commentary on feature management.