MLOps
Weekly Podcast

Episode 
7
Scaling AutoML with Nirman Dave
CEO, Obviously AI
white featureform logo

Scaling AutoML with Nirman Dave

July 26, 2022

Listen on Spotify

Transcript:

Simba Khadder: [00:00:06] Hey, I'm Simba Khadder and you're listening to the MLOps weekly podcast. This week, I'm really excited to be chatting with Nirman Dave. He's a Forbes 30 under 30 recipient and the co-founder and CEO of Obviously AI, which is no code auto company. Prior to Obviously AI, Nirman built the AI Infrastructure StreamLabs. Nirman, it's so awesome to have you on show today.


Nirman Dave: [00:00:26] Likewise. Thanks for having me.


Simba Khadder: [00:00:27] I'd like to start by just learning a bit about your journey to where you are today. What got you into machine learning in particular?


Nirman Dave: [00:00:33] That's pretty interesting. When I was in college, I never wanted to do a machine learning. I was very intrigued by data science and algorithms and really kind of going into the nitty gritty of it. One day we had an option to take a random course and you weren't going to get graded on it, so it's not going to affect your score. So for just shits and giggles, I ended up taking a PhD level course on building neural networks from scratch using simply just NumPy. The entire course was about building neural networks purely from NumPy, as I might as well do it because they're not going to grade me on it. It's not going to matter. So during that course, it absolutely blew my mind how neural networks function. Got really deep into things like activation functions, hidden layers, all that fun stuff, and really coded it from pure scratch. When I did that, it truly changed my entire perspective on how I think and view machine learning. And after that, I ended up doing more machine learning courses and really became super into programming from that standpoint and really went into a bunch of different other machine learning fields of study. That's kind of how I just randomly got into it. It wasn't really planned, but it's been pretty exciting so far.


Simba Khadder: [00:01:38] I feel like knowing you for as long as I have, I'm not surprised that you went for the PhD level course, but I'm like, Cool man, I'm going to go do surfing. Obviously AI, you started the company, you're building all of them out. Of all the, I guess, huge range of things you could have worked on the machine learning. What made you decide to work on AutoML?


Nirman Dave: [00:01:57] Great question. I used to work as a data scientist at a company called Stream Labs as you mentioned. When I joined them, there were about CVs, about 70 people, and I was the only data science person at the company. And so everyone that was not technically right, like the sales analyst, the marketing analyst, even entry level data analysts on [] these guys would come up to me and they would say, Hey, can you help me build a predictive model for churn or retention? And would kind of really go into kind of looking at which customers are active, which are new, which have kind of resurrected and which are likely to churn and things of that nature. It was pretty important for Stream labs at the time because the entire platform was built on influencers. If an influencer leaves Stream Labs streaming service, they're taking thousands of other followers with them. So these things became very, very important for Stream Labs, and I became a naturally part of it. As I started to work. It was very clear that the need for AI and data science projects was there, but the talent wasn't right. They had hired me. Who was I? I was just kind of like some dude that was just in his college kind of doing that internship. That hired me. And it wasn't really about the budget or something, but it was really the fact that kind of finding the right kind of talent with the right kind of industry background and domain expertise coming together, that's really tough. And kind of getting that streamline is very, very tough.


[00:03:09] That's when it really started, is can we build a tool that allows anyone that's a domain expert, that's an entry level analyst that's been in the industry for a bit without a technical data science background to build their own models without writing code. So that's where the idea of obviously AI really started. We could have done obviously in multiple different directions. We could have focused on vision AI or natural language processing, bunch of different things. But it was very clear from the beginning that the majority of data science problems that exist today are mostly in kind of like very niche supervised learning, tabular data, that kind of focus. So that's where we decided to kind of double down on AutoML as our key expertise, kind of built it out from scratch and really kind of build something that was kind of breaking industry standards. Today we are the fastest, most precise tool in the market, and that's something that we've always aspired to continue.


Simba Khadder: [00:03:58] AutoML I've always found to be such an interesting space. And your story, what you talked about was the story of, hey, analysts at a company, some data science that can do a little bit of machine learning. But AutoML would kind of give me superpowers, I guess, for [] of a better way to put it. Nowadays, obviously that's still true in a lot of cases. I'm curious about I guess as people progress and as now we're more machine learning teams, what's the future of AutoML? Is AutoML replace like all custom machine learning? They live side by side? How does it look?


Nirman Dave: [00:04:29] Great question. We like to see really [] Obviously AI and AutoML industry in general as more like a calculator for data scientists. Think of it as like a calculator for accounting teams. It wasn't really necessarily replacing any of the work that the accounting teams did, but it accelerated it. So we really see that being very similar to accelerating data science teams or even beginner level data scientists to do the work really quickly in terms of where we fit. It's pretty interesting because I have the privilege to talk to so many data analysts every day. We talk to at least a 100. Either it be users on free trial or customers. I get to talk to a bunch of those folks. And what we really learned is that there are two things that are happening that are very interesting to look at. First one is what we call commoditization of AI, which is probably what you've heard a lot in this space. But what that really means is back in the day or a couple of years back, companies really cared about things like proprietary AI models. Majority of that thought process has changed a little bit. Most companies today, they don't really care about proprietary models. They care about something that can really accelerate and give them ROI very quickly in terms of AI models.


[00:05:35] So with AutoML, that's really where it fits in and it can very much coexist with other proprietary models that might be there, like face recognition or something like that. But majority of business use cases are really just there to kind of how can we expedite them and how can we get to ROI quickly? And that's where we really see AutoML fitting very well. And the second one, as I talked about, is democratization of AI, which is the fact that companies don't have the resources to kind of bring onboard PhD machine learning engineers and find the right kind of engineers that have the right kind of domain expertise. So a lot of companies are doing today is they're just kind of getting these entry level analysts like Stream Labs got me. And with AutoML where it fits very well is that these entry level analysts now have the superpower of PhD machine learning engineer to do things very quickly. So that's how we really see AutoML fitting in. It's not necessarily replacing any algorithms, but it can really co-exist with custom algorithms or very, very proprietary algorithms that the companies might have.


Simba Khadder: [00:06:30] I love your calculator analogy. I think that's interesting. I don't know if I've really heard it, put that way before. I think that it's very interesting. It's also knowing when and where to use it and what to plug in into the calculator is the hard part. The actual calculation kind of becomes like, who cares like throwing the calculator. That's super interesting. Like you said, there's so many use cases for AutoML. I was just curious and you talked about how many people you've worked with. Could you share like a crazy story of someone doing AutoML for something you just had never considered?


Nirman Dave: [00:07:01] Hell yeah. You have so many use cases that we've come across. I'm going to try to see if I can name a few. So first one is we worked with this cookie company just like insomnia cookies, where they have hundreds of stores across the country. It's a very similar cookie company. They use us to predict how many cookies are wasted per flavor, per day, per store. Previously, they used to waste around like 36 items, and now that is kind of decreased to ten. And it sounds like a small number, but when you look at it at scale, that is a little over 100,000 cookies wasted in the ingredients. That's been pretty exciting. We have an agricultural company that uses us to predict the yield of crops on a farm that was never expected. Similarly, micro-lending companies using us to predict credit risk, and then a slew of SaaS companies using us to predict things like churn, retention, upsell, things like that. So very horizontal, very interesting use cases. But I say the one that's always taken me aback has been the agricultural one, where they actually put in details about the fertilizer, the PH of the soil, things like that, and then use that to predict crop yield on the farm. That's one of my personal favorites.


Simba Khadder: [00:08:01] So you know what cookies they have left in every single store?


Nirman Dave: [00:08:05] I can kind of tell that.


Simba Khadder: [00:08:08] That's awesome. That's a cool story. I love that a lot. People talk about democratization of AI or whatever, but cookie company using it to decide what cookies are out of stock. That's democratization of AI. That's awesome. Maybe you could share more about how AutoML works? Not so much like what's algorithms. Whatever. More like do you have separate models? If I'm using the Obviously do I just throw in my data or do I say, Hey, this is time series data? How do I work on Obviously AI?


Nirman Dave: [00:08:34] Great question. So the way Obviously AI works is anyone can bring their own data sets, multiple different data sources. You can connect that to obviously either through spreadsheets or databases like Snowflake, MySQL or whatever you use. Once you can integrate those data sources, you have the option to kind of merge those data sources if there's a way you would like to merge them. And then from that point onwards, you can decide if you want to build a AutoML model that includes classification or regression where you're predicting either a number or an X, or you want to build a time series model where you're looking at how value has changed over time so you can choose which route you want to take. From that point onwards, you can either simply pick the values you want to predict for and go ahead and hit, okay? And our system takes care of everything end to end, or you have the ability to go really deep and say, Hey, I want to auto impute some of these values. I want to handle the up sample down sampling. I want these kind of hybrid parameters to be there in the model. You have that option to kind of really dig deep if you want to.


[00:09:25] But most of our users really quickly, when they're getting started for the first time, they're just kind of going through a simple process of saying, This is what I want to predict. Now, the funny part is, once that happens, the system automatically builds a model in less than a minute. That's why we are the fastest, most precise tool in the market. And the way we do it is very interesting. It's very different from traditional AutoML structure. So traditionally AutoML structure says that you're going to go through thousands of machine learning models and they're going to pick one that has high accuracy and display it to the customer. It's a brute forcing method and 90% of the time, 90% of those algorithms are useless, right? You're kind of just building them for the sake of just seeing if there's an optimization for a percentage increase in accuracy, things like that. It's kind of not that super helpful.


[00:10:07] So with Obviously AI would use something called Edge Sharp AutoML. What that means is that we actually look at the dataset you're bringing in, the size of the dataset, different properties like that. We look at the use case that you're accommodating for and then based on that, it's actually going to shortlist the top five algorithms to use. So this could be a neural network, a random force classifier, [] whatever that is. And then it runs each of those algorithms with thousands of different hyper parameter combinations and picks one that performs the best. That's why we can give you something that is extremely fast and higher accuracy than any of the tools in the market. The other thing is also that we have a strong focus on supervised learning, on tabular data. We don't do any kind of deep learning, audio vision kind of thing. That's been a more strategic business decision that we've taken over years is to really focus on that supervised learning piece that allows our customers to have the best experience when they're using tabular data for their business. So that's kind of how we think about AutoML. That's how it works in the back end.


Simba Khadder: [00:11:00] It must be really interesting for you from like hiring and building this because you're kind of hiring data scientists to build the model that builds the model. How does your workflow look like internally? Do you test on like datasets you have? And also how different is it from a traditional process? Like is it different or is it kind of look similar?


Nirman Dave: [00:11:18] So it's pretty interesting because we're not constantly building AI models for internal use, right? We're not building models to predict something because to do that, we just use our own tool. So the models that the data centers are really building our models that are designed to be productionized to everyone, designed to be productionized for the users. So what they're really engaged in most of the time on is saying, okay, there are a bunch of these model templates, right? Like a template for a neural network could be a simple neural network with 5 [] layers. Let's say that's a template. Then they really spend a lot of time defining what are the hyper parameters that are going to be tuned and how is that search going to look like? Is it going to be a grid search? It's going to be different type of search on the back end. What does that specific piece look like? And that's really what we're productionizing. So when it goes to the customer, it's essentially an algorithm that is going to get customized for them automatically. So that's where we spend a lot of our time. Our tech stack, however, looks very similar to what you would see in industry, right? So we've got [], we've got Jupyter Notebooks, TensorFlow, those kind of things happening on the back end. So that's where some of the data center is really engaging.


Simba Khadder: [00:12:20] Did you have to build any custom tooling like for how maybe data datasets versioning? I'm just curious, have you had to build anything custom to allow you to iterate as you've gotten more scale?


Nirman Dave: [00:12:29] So not yet. We have plans of customizing some of the internal workflow that we're doing, but that's something that we haven't been hitting on just yet, so we'll probably be getting there soon.


Simba Khadder: [00:12:38] So like a lot of what you're doing, you're kind of taking things more or less off the shelf. Like you said, you're using Spark, using TensorFlow, etc. You've been able to kind of build the systems you need to get that to work. And this AutoML use case.


Nirman Dave: [00:12:50] Correct. I mean, it doesn't really make sense to kind of build TensorFlow from scratch or build neural networks from scratch. Even though it was fun in that PhD class that I was in, it doesn't make sense on industry everyday basis. Where the real value comes in is the ability to automatically find best hyper parameter combinations, do it extremely quickly and give the best ROI the best accuracy for the user. So that's kind of where we focus on the most then kind of reinventing the wheel on the other side. Got it all the way.


Simba Khadder: [00:13:17] Got it. Well the way Obviously AI works is it does is there ever an idea of like a real time model? Like the idea of the model is constantly deployed?


Nirman Dave: [00:13:24] 100%.


Simba Khadder: [00:13:24] So you do deploy models, too?


Nirman Dave: [00:13:26] Yes, that's correct. So kind of stepping back here, a typical analyst journey is about 60% of the time is spent prepping the data. That is kind of any kind of transformation work that you're doing on the data sets, all that fun stuff for about 70% of the time is there. 20% of the time is actually building models, right? We're trying out multiple different AI models, tuning hyper parameters, manually checking what works, what doesn't work. 10% of the time is arguing with the DevOps engineer on why it's not deployed correctly, because something's going wrong. And all in all, it takes about 6 to 8 months to get to anything in production, right? So the entire idea behind bviously AI is not just to do some predictions and get back to you, but really to help you in that end to end process is to say, okay, models are built, they are automatically deployed in a single click and they keep getting better over time, which means if you've integrated it directly with your snowflake or your database is pulling the latest data, retraining the models, continuously getting better, you're seeing that version control version change, accuracy change over the versions as well, things like that. So I think that piece is the most critical one because just building a model isn't relevant for most of our users today. It's really putting it to use in a way that they can seamlessly use it and that other people on their team can seamlessly use it is where we really focus on.


Simba Khadder: [00:14:37] That's awesome and for that kind of stuff as well have you had to like implement any sort of MLOps tooling for like surveying and what kind of stuff do you use?


Nirman Dave: [00:14:44] Yeah, so we use a bunch of different like MLOps workflow on the back end. That's not my best expertise. That's going to be something that my CTO are protecting and talk most about. But as far as I know, we use [] clusters to kind of deploy all our models and that's kind of what we did.


Simba Khadder: [00:14:59] That's awesome. So continuing on MLOps a bit when you were in Stream Labs you mentioned. And I guess a lot of your work was building these models, getting them productionized. Nowadays, you know, you've seen this across many, many, many different companies you've seen over the years, too. What's changed over time? What's, I guess, happening in the market from your perspective?


Nirman Dave: [00:15:18] I think one of the major shifts that's been happening is how clear it has become in terms of how AI use cases have become more clear. Previously, a lot of the workflow that was happening in the market was very exploratory, but people were like, okay, let's kind of throw in a few algorithms, really see what it comes up with. Now it's like we want to achieve X and how do we get there? Kind of a structure. So that's something that we've seen a lot. In terms of tooling the important other thing that we've seen is previously a lot of data analysts talk about analysts in specific really cared about upping their background and knowledge in machine learning code. So they'd be like, Hey, I would want to learn TensorFlow and I want to learn how to deploy these models. What does that look like? So they would spend a lot of time and energy over there. Now that is really shifted to saying, I probably don't need to learn TensorFlow. How can I learn how to evaluate the models the best? I think that shift has really happened where people are saying, I don't want to get into the nitty gritty details of things, but I need enough knowledge to evaluate if whatever I'm doing is in the right direction. So I think those are some of the key changes that I've seen from doing my time at Stream Labs, where a lot of people would want to engage in the nitty gritty. Now an average data analyst at any company says, I don't want to engage in the nitty gritty, I just want to get to this result and I want all the information in front of me if I need it. So that's kind of one of the key changes that we've seen so far.


Simba Khadder: [00:16:39] Yeah, it's super interesting that you say that way, too, because a lot of people we've had on the show come from an MLOps background or more like deep data science background, not as much analyst, but they're saying the same thing but from a different perspective. But it kind of follows that AI is kind of moved out of the lab, so to speak. Like it's like people are actually doing things with machine learning. It's almost like it's not boring yet, but it's like on its way to becoming boring, like what web dev was and like the dot com boom. Where it's just like, Oh, this is like the coolest thing ever. And now it's like, I have to write [] No offense to any full set people listening. Yeah, that's really fascinating. You kind of put it that way. I guess one thing that I found, I built recommender systems before my training. We met and I was working on recommender systems. A lot of the work. What I did as a data scientist was taking my domain knowledge and trying to kind of inject it into the features or the architecture of the model. And sometimes I move embeddings it kind of both. And I always felt like that was the goal. Like the goal was like the one thing you can't abstract away is the domain knowledge.


Nirman Dave: [00:17:47] Absolutely.


Simba Khadder: [00:17:48] But like you said, the calculator analogy, it's like once you have a domain knowledge, that's something that you can pick up and learn as the data scientist, the goal is like, Can I give you an interface so you can plug in that domain knowledge and it will kind of output model. Is that fair to think of it that way?


Nirman Dave: [00:18:05] Exactly. That's exactly where the AutoML space is today. Right. Going back to the calculator analogy, which is an accountant still needs to know how a profit and loss sheet works or how a balance sheet works. But they still need to know what to put in and how to really differentiate the line items and take it from there. So they still need to know those details. They still need to know your business to truly, really understand the details of all the work that you're doing. Only the calculation part becomes so streamlined that it's just an extension of their own self. And we're seeing very much similar with, as you pointed out, data analysts. Today the data analysts still need to have that domain expertise. When you come in and you bring in a data set, we have a company we work with that was doing loan repayment predictions and it took a lot of domain knowledge to really define what are the stages of loan repayment? Is it repaid, is it defaulted? And then there are multiple steps in the middle that come through, but that can only come through domain expertise. Once you have that, then you can run a classification model really quickly to kind of maybe test what kind of user is coming in and what are their predicted default rates, things like that. But the ability to kind of really thoroughly think of how we're going to put this data together and how am I going to evaluate the outcomes that does require domain knowledge. And that's where the accountant analogy comes in and the calculator analogy comes in. So very similar to what you're saying, that's really where the world is going towards today.


Simba Khadder: [00:19:22] What about now? What does the day of an analyst look like?


Nirman Dave: [00:19:26] Great question. So [] analyst is typically they come into work and they've been assigned a project to kick off, let's say, loan repayment prediction. One of the first things that I'm going to sit down and do is they're going to look at, OK, where is all my data set? What all data do we collect, what data do we have? What data should we be putting into the model? Do I want to put in demographics like age, gender, location, income? We want to put in other things into the model. That's one of the first decisions that I'm making. Second thing that they're working on is really saying, how do I pull this data together and how to bring it in in a place that's going to be really meaningful for me. So that's kind of one of the key places that are going to be spending time. Then they're actually defining different AB tests. They're saying, okay, let's actually build a model that's going to include H. Let's build another model that's not going to include it, but it's going to include location, things like that. So defining those AB tests becomes really important for a day in their life because that's really going to then define the business outcome that you're looking for. And then with AutoML, they run all of these models extremely quickly, obviously in minutes.


[00:20:22] So that building these models in a minute, they're seeing the results. Then they spend a lot of time evaluating the results, looking at precision, recall, F1 score and a curve, seeing if all of these things are kind of aligned, looking at, let's say, the confusion matrix, making sure that there are less false positives, false negatives, kind of really evaluating the models itself, trying out different models. Let's say Obviously AI automatically picked XT Boost, but they're like, no, no, no, let me see if there's a different result with the random forest that can come up. So those are the kind of things that they really engage in on day to day basis. And then they say, hey, finally after doing all these tests, this seems to be the model that works the best. This is the model that, let's say, doesn't have certain features or these are the tweaks that we've done. This is the test that we have set up that seems to work and now let's kind of productionize it. So that process going from here's a problem to what data to use to really building out tests, then saying that the models are the best models. Those kind of decisions are what they're making on a day to day basis.


Simba Khadder: [00:21:17] Yeah, it's super interesting. I think it's also interesting. One thing that we've run into with users of ours that you sort out now is they start to see features and the data as the domain knowledge. Like that's where you can inject your domain knowledge. And so what you're describing is very similar. I mean, there's kind of two parts to it. One part is the domain knowledge, what really makes sense. The other part is this organizational knowledge. So much work goes into just like what data exists, where is it? How can I use it? How do I even get it out of SAP so I can fit it into whatever? So does that kind of vibe of how you think about the space too? Do you see kind of the features as being the part that the data analyst kind of that's their ownership, that's what they try to do.


Nirman Dave: [00:22:00] 100%. That's probably where I really love the direction that you guys are working on is really that part of really thinking through features becomes critical for an analyst journey and really like again, like an accountant thinking through the line items, what is the business, what are the line items that matter which are important? Very similarly for an analyst is the features. So 100% of you.


Simba Khadder: [00:22:21] That's awesome. I feel like I could talk to you all day about ML, I'm sure, so much more I could continue to learn about it. But for people listening at home, if you have to give a tweet leave takeaway of AutoML, if someone does go back and be like, hey, this is the tweet that encapsulates this podcast, what would they write? What do you think?


Nirman Dave: [00:22:38] If you're a modern data analyst, AutoML is your new calculator.


Simba Khadder: [00:22:43] I love that. Yeah, it wasn't on purpose, but it really struck me. I really like that analogy. It's so awesome to have you on. It's great to catch up. I love what you're working on. I'll have some links so people can go check it out. Thanks again for coming. Sharing more about not only AutoML but also how it works under the hood of us today.


Nirman Dave: [00:23:00] Absolutely. Thank you so much for having me.

Related Listening

From overviews to niche applications and everything in between, explore current discussion and commentary on feature management.

explore our resources