The LLM Revolution & the Future of Data with Josh Wills
Guest Bio: Josh Wills is an investor and advisor specializing in data and machine learning infrastructure. He was formerly the head of data engineering at Slack, the director of data science at Cloudera, and a software engineer at Google.
Hey, everyone. Simba Khadder here with the MLOps Weekly Podcast. Today I have thepleasure of speaking with Josh Wills. Josh is an investor and advisor whospecializes in data and machine learning infrastructure. He was formerly Headof Data Engineering at Slack, the Director of Data Science at Cloudera, and asoftware engineer at Google. He also is famous for his hot takes on Twitter.Josh, great to have you here today.
Simba, thank you so much for having me.
I just gave a quick introduction on you, but I would love tostart by... You've worked across Google, you've worked at Slack, you've workedat Cloudera, you've done a lot of stuff in your career. I would love to learnabout maybe some of the hardest data problems you faced or had to solve of yourcareer.
Oh, man. Hardest data problems I've had to solve. That's a-
Or some interesting one you think that people would enjoy.
That's a tough one. I don't know. I've talked about a few ofthem before. I think the hardest data engineering challenge I ever had wasrebuilding Slack's search indexing pipeline, which I talked about a little bithere and there before and stuff. I feel like as I reflect back on my career,that was still the Mount Everest problem for me. Or it's something even harder,maybe K2, an even technically more difficult mountain climb is a betteranalogy.
Just because of the scale of the problem, hundreds andhundreds of terabytes, it's got to be up on the order of petabytes of data toindex right now. When you're dealing with data sets that large, you encounterliterally every single thing that can possibly go wrong. And it's stuff that isthings that are one in a trillion things happen to you because you'reprocessing a trillion records. Every single thing that can go wrong goes wrong.I guess in terms of the technical challenge, in terms of the impact, in termsof improving the performance of Slack search and stuff like that, it's still, Ithink, the most meaningful problem to me. But I think that's from a technicalchallenge perspective.
From a people challenge perspective, which I think is... Ifyou talk to most people who've done this stuff for a while, they would say thepeople problems are far more persistent and far more difficult and stuff likethat.
The thing I'm proudest of is introducing at Slack, veryearly on in my tenure there, so right when I joined back in October of 2015,the very first thing I did was introducing the notion of I think what peoplecall now a data contract that tied our production, web application systems andthe data that they generated to send to our data warehouse and using thriftschemas and locking that stuff down really early. It's one of those things thatit prevents so many problems down the line and so many challenges. Getting thatdone is maybe something that organizationally speaking, I'm proudest off, Ithink. Yeah, I don't know. What else would you like to talk about?
It's so much more. I have a billion questions about thesearch index. But I actually want to jump into data contracts because it's beena hot topic as of late.
Well, first, how would you define a data contract?
It's a great question. My own personal definition, which isa little different than everybody else's, is a data contract is an integrationtest. It's an integration test. I call it an integration... Integration test, Ifind it to be the best analogy for it because an integration test is somethingthat's obviously well understood in software systems, and we've been doing fora very long time. I have multiple production components. I test each of themindividually, obviously, but I need to make sure that they work together beforeI push changes to production or continuous integration, that's what continuousintegration is.
For me, that's really what data contracts are. Datacontracts are a signal, first and foremost, that your data warehouse and yourdata pipelines are production infrastructure, which is not true a lot ofplaces, and that's fine. There's a lot of places where I just need to do somebasic reporting, and if the reports go down for a day, it's not the end of theworld. Then on the other hand, you have everyone else in the world who's doinghardcore machine learning stuff, where if the data pipelines go down, thenproduction goes down with it, or starts degrading in really these nasty ways.
If you're in that state, if you're in a situation where yourdata warehouse, your data pipelines are production with a capital P, therefore,you must have integration tests between your upstream production systems andyour downstream production systems, just as you would for any other set ofcomponents.
The trick, I think, why we need the term data contracts orwhy this is hard is that it's been incredibly difficult to really do properintegration testing between the world of production and the world of data. AtSlack, we had a web application that was initially in PHP and over timemigrated to Hack, which is Facebook's improved PHP implementation.
Then we had a downstream data warehousing system, which wasa fairly classic Netflix-style data lake, Parquet files and S3, lots of Spark,lots of Hive, lots of Presto. These are two completely different, absolutelygigantic engineering systems. Moving data between these two systems involves a Kafkabroker that's going to process 250,000 events per second. It's got all thismassive data processing infrastructure around it.
And so how do you come up with a way to do a relativelyfast, relatively lightweight set of tests and verification checks between thesetwo systems so you can ensure that when you make a change upstream, you're notbreaking stuff downstream? That's the trick, and I think that's the area wherea lot of people are still searching.
When I see people doing data contract stuff, even now in2023, they're still pretty much inventing their own way of doing it. They'reinventing their own interface definition language. They're inventing their ownset of tests and stuff like that. That's just where we are. We haven't settledon a standard for this yet, or we simply have not made this easy in any way,shape, or form for people to do without a huge amount of engineering effort,and that sucks. It's basically like, "This is not a good place to beanyway."
So we've had great expectations for a while.
We had great expectations. Absolutely, that's right.
And we've had even in Kafka, we've had schemas in Kafka fora while.
What you're describing seems like it's simultaneous to bothof those and more. I just would love to understand. If someone's like,"Hey, I have a Kafka schema and I have great expectations," is that adata contract? Is it not? What's missing from those two?
Okay, it's awesome. Fantastic question, Simba. The keydifferentiator for me, what I think differentiates from your standard Kafkaschema, great expectations, DBT tests, whatever it is you do, is basicallywhere is that test happening? Is that test happening prior to a change going toproduction? At Slack, you could not push a change to production unless the dataschema test passed. It simply would not go through. It would fail. It wouldblock your deploy. Or is it the case that you don't find out about the changeto the schema or the great expectation test failure until 24 hours later whenthe data pipeline is running? That is the key differentiator to me.
If the tests happen before the push to production and canblock the push to production, it is a data contract like capital D, capital C.It has teeth, it enforces a blocking change. Whereas if it happens 24 hourslater, then it's an audit, it's a test, it's a check. Again, it's not to sayit's not important. It's not to say we don't need to do it. We do. Weabsolutely do because stuff is still going to get through. But for me, it'slike that really is that prior to changing the system in production, we makethe check. That's the key quality I think for me.
Again, because the data infrastructure has been so massiveand so big for so long, it's just been hard to do that. It's just been hard tolike, you can't realistically run a five-tran data extraction, a wholeSnowflake pipeline on every single production change when you're doingcondition [inaudible 00:08:15]. No one has that much time and money. You couldnever get anything done. That to me is why we have not done this historically,so yeah. That to me is the clear differentiator, and to me, it's like, if youhave those checks ahead of time, then your data warehouse is production, and ifyou don't have those checks ahead of time, then your data warehouse, whilestill important, is not production. It just isn't. Period. Yeah, that's me.
I think what I'm seeing a lot of recently, and I think we'reseeing this paradigm shift where there's almost like has been this dichotomybetween production data pipelines and almost like experimentation, where we'rejust learning about the data, understanding it, analyzing it, playing aroundwith it, especially in ML, there's a very clear experimentation step before youget to production. I feel like tools have always picked one side of the fence.Are you a production tool or are you a production tool? What's been missing, inmy opinion, is these workflow tools that make sense on both ends. They are whatyou would be doing in experimentation, but they're inherently thinking aboutproductionizing.
I completely agree with you, I think. I think that's exactlyright. I feel that tension. I think a lot of folks do. This is something HamelHussein and I, who does a lot of notebook stuff and nbdev and stuff like that,have talked about a lot because he is deeply interested in this divide. I thinka large part of it, Simba, is that software engineers, generally speaking, donot grok the experimental interactive nature of a lot of data work, especiallya lot of machine learning work. It just does not make sense to them because itdoesn't describe their work. They use an IDE, they don't use a notebook, and itjust does not compute. It just doesn't.
Simultaneously, I think folks who do a lot ofexperimentation and interactive development stuff do not have a great mentalmodel for how to do automation and reproducibility and stuff very well, sowe're stuck here. These two worlds just completely talking past each other, andit's deeply, deeply frustrating for folks. I know it's frustrating foreverybody. I should probably be doing more here. I feel I've been fortunate tolive on both sides of this divide.
I don't know the solution here. I'm open to suggestion, Iguess, is what I want to say here. I'm not saying this is not an easy problem.If it was, we would have solved it already. It's legit hard. How do you respectand enable and support that experimental iterative, try it, just get almostlike the flow state in some sense of working with a data set, working with amodel, while also being militant about reproducibility? It's just hard, man.It's just hard. Do you have thoughts here? Have you all thought about this? Ijust be curious. I don't mean to turn the interview back around on you, but I'mvery open to ideas here is what I want to say.
Yeah. I mean, it's a big promise of what we call the virtualfeature stories to solve that piece for ML. The whole concept is it should bewhen you're iterating, all that we, let's call it, let's say force upon thedata scientist, is that you use this almost function framework. Rather thanjust writing your query or your payment transformation or whatever, rawnotebook, you just tab it, give it a function, that function name becomes thename of that transformation. You can later add versioning and other things toform productionize it.
The goal would be over time that the iteration, thedeployment, there would be slightly different modes, but I almost think of itor liken it to a Django or something, where it's just like if you follow thisframework, it will automatically make it very easy to productionize whilemaking it feel like you're running this presentation code. Now, the differencebetween a Django and what we're doing is that, like you said, Django is alinear process. I need this new REST call, it does XYZ, where with data scienceand with feature engineering, there's a lot of like...
If someone told me that, "Hey, I spent a month goingdown this project," and I realized with the data we have and everything,it'd actually be impossible to create a model that does this, I'd be like,"Great, that's a good use of time. We figured out this thing'simpossible." If a software engineer told me, "Hey, I took a monthdoing this, and we're throwing it all away," I would be like, "Dude,no, you can't." [crosstalk 00:12:33].
That's right. See, that's exactly it. That's the key rightthere. That's a great point. Yeah, totally. I'm thinking of like, I find itvery instructive to think of extreme cases of people tackling this problem. Ithink a lot of Netflix in this way, and the work they have done to makenotebooks production things. It's Paper Mill, I think is the name of theirtool, and they have some other stuff like that.
Then the other thing they did, which I just still boggles mymind, was just to get the feature data for their training pipelines. They wouldliterally query the production systems from the training environment. It's justthe thing that just sounds absolutely insane to me, but it's part and parcel oftheir chaos monkey engineering culture where, "Hey, the machine learningteam is going to do 90,000 RPCs to your service in an hour. Hope that'scool." And it's just like, yeah, that's just what they do. Anyway, I get akick out of stuff like that. I don't recommend anyone do that. It's absolutelyfascinating the ways that they think about to tackle these problems and stuff.Anyway, yeah.
Well, maybe a question there would be where do notebooks fitin? Do notebooks fit in production? Obviously, I think we both agree thatthey're an integral part of the experimentation pipeline.
I mean, absolutely. Where do they fit in or should they goin production?
I think it's like you're talking to me, Simba, when I'mmid-conversion. I have been a long-time notebook hater. I've been veryanti-notebooks for a really long time, and I am basically slowly coming around,I think. Obviously, a lot of other people have been messing around with a lotof large language models a lot, and doing this stuff on their notebook is justpure joy. It's like me and my Jarvis hanging out, hacking together. You knowwhat I mean? I'm just loving it, and so it's changing the way I think aboutthis stuff.
I have not quite become the full-throated advocate convert.There's no evangelist like the convert or whatever the quote is. I haven'tquite gotten there yet, but I do very much feel like I was wrong. Notebooks areactually great, and especially in the large language model-centric future Ibelieve we are headed towards, they are going to play an outsized role in howwe work, not just as data people, machine learning people, but just as people.As human beings, I suspect that some notebook-like thing is going to become abigger way that we work going forward.
In that world, I think for me, it's time to reexaminesystems like paper mail. It's time to really start thinking hard about how dowe make these things... I don't know. How do I make these things work as muchlike the boring, good old reliable cron jobs that I've written in Python thatI've been running for years and years and years? But I don't know that I havethe answer yet, other than say that I was wrong and I'm coming around. And I'm,again, super interested in figuring out ways to solve this.
I'm curious to get your feedback. I'm similar. I actually amlike-
Kind of the same thing?
Yeah. I was like a very like... Part of it, I think, isbecause I was in ML for building recommender systems. We're doing a hundredmillion MAU. We're doing all kinds of transformer-based or user embedding andtraining. I'm embedding doing all kinds of fun stuff. I come from Black HatGoogle. At Google, I wrote both PHP and X86 at different points in my career.
That's drawing. That's wow. Okay, good. That's two reallytruly terrible programming languages for you to know. I'm kidding.
I've worked on both. I've worked on... It's like thehorseshoe.
Yeah, got it. Okay.
So I came from like, I didn't use notebooks, I used Pythonfiles. That's how I worked. And I definitely was converted to like, "Hey,notebooks are just such a better way to do this." But then I was in thesame boat of like, "But these things should be nowhere nearproduction." My take now, which again, I'm curious to get here, isnotebooks are where you should be building things, but you should be takingthese artifacts that you create and exporting them somewhere else.
Do you buy that?
I broadly do. I think that's where I've ended up with thisstuff. And again, just hanging out with folks has introduced me to nbdev andCORDA and all these other things of mechanisms for taking my exploratory stuffand then crystallizing it into some structure that's designed for reproducibilityand runability without me sitting at the keyboard to guide the cell executionand stuff like that. I think that's right. I think it's just like, I don't feellike I've seen yet is, what's the right gesture, almost, in the way that GitHubintroduced the pull request as their fundamental innovation, the social actionyou can take.
I have to be honest with you, whenever I hear about or readabout a framework or something, immediately I'm just seized up with fearbecause I feel like it's going to harsh my vibe. It's going to mess up myworkflow. It's not going to be like... It's going to be this constraint that'sgoing to... It's one of those things where to a certain extent, a framework cangive you freedom. The constraint can set you free in other ways and stuff likethat.
But it's just my instinctive response of like, "Hey,Notebook is this free-for-all exploratory whatever, and you want to come alongand impose rules and strictures on me," and it's like, "I'm not goingto be able to come up with the next great neural architecture because I'm goingto be stuck." You know what I mean? Which isn't rational, but it's just myemotional reaction to that stuff. I can't imagine. I'm vocalizing this. I don'tfeel like I'm the first person to feel this way, though. Does that make sense?
Yeah. I think it's totally true. I think what it is, we'veall seen the mad landscape thing. I always joke, if you take that and you haveto narrow it down to products that data scientists love to use, actually trulylove to use-
Truly love to use, exactly.
-it would be like 10. It just would cut down dramatically.
I think people who build Dev tools are engineers. Engineers,as much as we like to pretend we're rational creatures, we are very, veryemotionally driven-
Very much so.
-and we feel like, "Well, I like it this way, or I wantit this way, so therefore, I'm going to do it this way. And if you don't agreewith me, you're dumb."
I think that lots of the frameworks that get built, ingeneral, they tend to be overengineered. Because again, the other thing that welike to do is if someone's like, "How have they did this?" We'relike, "Cool, we can do that." And we never ask ourselves,"Should we do that?" That's because we can, doesn't mean we should.And I think that a lot of times, like, we just did this meet up with Sebastianat FastAPI, and one thing I like about FastAPI is it's very simple. It's verylightweight. It doesn't feel like it's getting in the way.
It's just fantastic. I love it so much. Absolutely. You'retalking about tools, 10 tools people love. FastAPI, without a doubt. Onehundred percent in my tent. Absolutely.
I think most products wouldn't fit that. I don't thinkthat's necessarily that, hey, frameworks are a bad thing. It's just getting aframework right. API design is one of the hardest problems.
People like to jump into and be like, "Well, just addthis function call and it's done." It's an art and it's a craft, andpeople who are really good at it are few and far between. You see the samepeople who build Go, whether you like it or not, are also similar people whoare huge on Unix. That's how rare it is to find good API people. We have to gosource people from olden days of [inaudible 00:20:16].
Yeah, exactly. Richie, absolutely. Like Rob Pike, yeah,totally. Yeah, I know what you mean. Sad but true, Simba. Sad but true. Alas, Ilike to think we can do better here. But okay, I hear you. I hear what you'resaying. That's fair.
Yeah, I think that's all it is. I don't think it's to saythat frameworks are bad. I just think having really good frameworks is reallyhard, and it's so hard that there's probably one in every hundred that evenare-
One of my weird system... Also, like former Googler. I wasthere from 2007-2011. I had the privilege... I had two weird privileges atGoogle. One was Rob Pike did my first code review at Google. First code review.I added a couple of libraries to [inaudible 00:21:02] for doing various kindsof non-parametric correlation calculations, like Spearman correlation and thenstuff like that, for some work I was doing.
That code review, Simba, went on for 30 days. Thirty days,and maybe 50 odd revisions, Rob Pike. I'll never [inaudible 00:21:19] the lastone he did [inaudible 00:21:20] change. I had some bounds checking function Iwas using, and he had me reverse the arguments because he liked the way itlooked better. I was like... It was amazing, Simba, that I did not quit afterthat code review. Looking back on it, if I had any... The good news, I wasyoung and I had a very fragile ego and stuff. In retrospect, I probably shouldhave quit, but I didn't.
The other thing, I was invited to the very first technicaltalk from Rob Pike and Russ Cox about Go. I got to go. It's like all the mostsenior engineers of the company and also me, I'm also there for some reason,basically. I remember just the decidedly meh reaction from so many people inKegel. We're like, "Yeah, go, it's okay." I guess they may havegotten some things right and stuff like that.
Overtime, obviously, Go has been incredibly successful andhas completely found its community and stuff like that. I think that's theother part here, is the tool has to fit the hand. A lot of those hardcore C++developers were just not the people that were... They were just not going toadopt Go. It was just never going to happen. They're all like, "Whatever,not that good." But then it found its community, and as it did, it grewaccordingly. It's like figuring out that match between the tool and thecommunity to wield it. It's just so critical and so hard to get right.
Yeah. Actually, funny enough, a funny connection, I workedon the same floor as Rob Pike for a while. He was pretty close to me, so I usedto play [inaudible 00:22:52] in the microkitchen. I luckily never had to dealwith his code reviews, but was not surprised that that [crosstalk 00:22:59]would be the experience.
Absolutely brutal, humbling, humiliating experience withouta doubt. Yes, exactly.
Yeah, for sure. But also you need to be that much of aperfectionist to build these sorts of APIs. It doesn't mean that you should bebuilding everything, but if you're building, it's this funny balance.
I go back and forth and I like to think that Google broke meof my egotistical identification with my source code of where... It broke me ofthe notion that the code I wrote was an extension of my own personality andthen anything wrong with it was something that was deeply wrong with me.Google, to its credit, in a basic training way, broke me of that belief andshowed me that it's not. This is a company. The source code is our product. Weall build it together. We are all responsible for it, and so on and so forth.Just one of those useful life lessons we take away from these horribleexperiences.
Yeah, I wasn't in Google very long either for probablysounds like similar reasons.
I was there for almost four years. Is that not very long? Idon't know. Is that not?
I think that's pretty long.
It's probably pretty long, right? It felt long. I'm notgoing to lie, it felt long.
I want to jump into another... I want to take the conversationto LLMs because it's a hot topic. Well, first, tell me about them. Is thistransformational? Is this a whole new paradigm we're dealing with? What's goingon? What's your take?
Yeah, I mean, yes. The answer is yes. The hype around LLMsis so off the charts right now that you have to properly calibrate yourself interms of where you are in the hype cycle. I guess I divide this into the twomain framings. I think of this, this is as big of a deal as a mobile phone.This is the next dominant paradigm that will reshape society and stuff likethat in the same way that the mobile phone did 15 years ago. That's one levelof hype, and that's a pretty big level. That's a lot of hype. Mobile phones,really big deal. Change a lot of stuff.
The next level of hype, though, and I have friends whoinhabit this level, is this is electricity. This is akin to us in the lightbulb or Edison and Tesla and stuff, way back in the day. It's that level ofimpactful. I have some friends who are on that level. I think I've maybeflirted a bit with the electricity level of transformativeness. I don't feellike I'm there right now. I'm definitely well above the, this is bigger thanthe mobile phone level of hype, but I'm not quite on the, this is electricitylevel of hype. That's where I'm hanging out right now. Somewhere in the muddymiddle in between those two extremes. It's a very big deal. I think it's goingto reshape all kinds of things in ways that are very difficult to predict.
Just in terms of our own lives, it seems fairly obvious tome that we will all have... I mentioned the Jarvis thing before becauseobviously I pay for a GPT-4 subscription, and so I hang out a good part of theday talking to my GPT-4 instances. We're hacking away on stuff. I feel likehaving that Jarvis level in the Iron Man sense relationship for everybody onour phones in the next year or two is just a given. I find it fascinating tothink through the implications of that. Is it going to be like, if you and Iwant to do a podcast together, is it going to be our Jarvises coordinating thepodcast? Is it going to be the Jarvises doing the research ahead of time? Willthe Jarvises be here with us on the podcast weighing in on things? Do you and Ieven need to be here? Can the Jarvises just do it for us? All this stuff. Thisis the stuff I find myself wondering about these days.
Yeah, I can't wait for my Jarvis to be like, "Simba,that's factually wrong. I just looked [crosstalk 00:26:42]." Yeah,exactly.
Knowing myself as I do, my Jarvis will be absolutelyhypercritical and borderline cruel against me. If I could come up with asoftware manifestation of my imposter syndrome, that's exactly what it wouldbe. That's great.
That would be a good app name for it to imposter.
I like the analogy electricity versus the mobile phone.
I think what's interesting there, which makes it maybealmost like... It actually made the analogy even more interesting in my head iselectricity, like you mentioned, you go back to Edison. That's a long time ago,and it took a while before... I don't think even then there were people like,"Hey, this is the transformative of us like oven." [crosstalk00:27:32].
The wheel, exactly.
Yeah, exactly. I wonder how much... As things accelerate,maybe it's like the mobile phone now, but we'll look back. This was such aprimitive form of what is to come. It's how I think of electricity where...Obviously electricity was huge foundational change, but I don't know if I'veever until this conversation really thought of it as more foundational than themobile phone as much as we just continuously build on [crosstalk 00:28:02] theshoulders of science and over time.
For me, the book I'm going to recommend here, which I love,is called Empires of Light. There wasactually a really bad movie based on Empiresof Light. It was called The CurrentWar about the war between Edison and Westinghouse/Tesla over direct currentversus alternating current. Again, I worked at a company called WeaveGrid for acouple of years after I left Slack. I was selling software for utilities formanaging electric vehicles. I have a weird nerdy obsession with the utilitygrid and the history of the grid and stuff like that. That fed into my desireto work on this kind of stuff.
Yeah, Empires of Light,absolutely a fantastic book about what the early stages of the industrial useof electricity and how it changed society. The thing, I guess, that was mostfascinating to me about, it was really... What electricity did first was itallowed us to create time, in the sense that electric light made days longer.That's just one of those things that's hard to wrap your head around at first.It literally created time. Hours of the day that were not available for doingthings suddenly were available for doing things, which is amazing.
The other thing which is remarkable about it is how long ittook for these innovations to reach everywhere. The start of the book istalking about J.P. Morgan having the first electrified house in New York Cityand literally running a coal plant in his basement to power the thing. He hascoal and smoke belching out the back of his house from running this electricalsystem.
But then how it was not until the new deal, which was 50 or60 years later before electrification reached out into rural environments inthe United States, and basically everybody had electricity. It was not untilthe 1940s or '50s. It took a really long time for this stuff to get outeverywhere. I think about this stuff with LLMs a lot. I think of this as like,we're in the early days of like, you have an LLM, I have an LLM. It's obviouslyspreading so quickly, but I'm just curious to think about how long will thistake to reach everyone in the world? Again, where literally everybody has aJarvis. Every human being on the planet has a Jarvis. Is it 60 years? Is it adecade? I'm like, "What is it?" I'm not really sure anyway.
Yeah, it makes me think of two things. One, with regards tothe time, the increase of time. Also, a quick side note, I feel like so much ofhow the US works today was foundationalized during that industrial revolution.
It's really interesting to read about all that stuff becauseit's almost like this long running cascading effect that you can almost traceback to these innovations. Literally from how finance works all the way to, yousaid, electricity, was all the foundations were set, I guess, over 100 yearsago now. The time example is interesting because it relates to, well, even ifyou have more time, you still go to sleep and it goes to what we're catchingup. I think it almost applies here about LLMs too, where it's just because wecan do all these things... The one thing that we can't easily update yet, andwho knows, is the wetware, and our cells, essentially, how our brains work andprocess and do these things.
I think it'll be interesting to see that come into play too,especially if we start to get to this point where we do find LLMs that are, Iguess, stronger. Like you mentioned, having them do the podcast for us. It'slike finding where we... I think we're always away from them being thatpowerful. But it's definitely opened the questions up as we hit this new inflectionpoint of... I liken it to the new technology. It's almost like the straw thatbroke the camel's back in the sense of from a research perspective, thetechniques that we're using to create these LLMs aren't necessarily...
They're new, but they're more of an extension of something.It's not like this is a whole new like, "Oh, we just changed everythingfrom yesterday to today. We just invented this new concept." It's moreabout the concept we've gotten so good and have fine tuned it so much andgotten so good at building these transformer based architectures that we nowhave passed this, I guess, valley of disbelief from when humans interact withit. It's just gotten so good that it's like, "Cool, I see the lightnow." It's not you pitching me something and then I look at the product,I'm like, "Yeah, no, this isn't AI." It's finally like, "Okay,it's still imperfect, but it feels unlike what we were seeing before."
It's certainly imperfect. I think on the podcast point,there's now the Joe Rogan AI Experience,which is a purely AI driven podcast featuring an AI Joe Rogan talking to an AI,Sam Altman, and stuff. This is here. This happens. You and I obviously are notimportant enough to have an AI talking for us, basically. We're just twoidiots, right. But it doesn't seem that far off to me, I don't think. Beforethis, this is... If we want, you can just like, "I'm happy to just send myAI over to you. Here's your budget for how much computer you're allowed to useto have me generate answers and stuff like that."
I think that actually opens to the next point which youmentioned about use. I actually think that a lot of what we're doing now isessentially subsidized by Microsoft and others.
Without a doubt.
There's no way that they're making profit off of what you'respending for GPT-4. That's only on the cost of goods. That's just literally onthe electricity and the GPU time. That's not ignoring all the work that goesinto actually building these things. I think that's going to be a big questiontoo, is almost, even if this is foundational, once economics come into play, itbecomes a lot harder to justify having this constantly running large neuralnetwork that answers everyone of your ridiculous questions that you come upwith as you're standing. I think that's where lots of the professional usecases are probably going to find more widespread use because you can make theeconomic argument there. I feel like the personal Jarvis will be a thing thatis left to people who can spend 10, 20-
Yeah, like the J.P. Morgan's of our day, I think, at first.But people will just be super incentivized to figure out how to make this stuffcheaper and how to make the hardware better and all that stuff. Is it tomorrow?We all have a Jarvis? No, of course not. Is it 10 years from now? Yeah, I'mpretty sure. Is it five years from now? Maybe. Not quite sure. It gets fuzzy.It's the cliche about you overestimate what can be done in a year andunderestimate what can be done in a decade. I suspect we're in that part of thehype cycle right now. We're overestimating what can be done in a year andunderestimating what can be done in a decade.
I agree. I think it's almost like CPUs where it's almostlike we kept going and going and speed, speed, speed. Then all of a sudden, wegot to this point where it's like, hey, actually speed isn't what mattersanymore. It's energy efficiency. It's this other concept. I think that's wherewe just got. We're like, "Hey, these things are starting to get plentyaccurate." It's not really the issue anymore. It could always do betterfor specialized use cases, but it's probably way higher ROI to figure out howcan we do what we're doing now but with significantly less parameters. I thinkit's been a known thing for a while, but I think we've finally gotten to thepoint where it's not an interesting niche point of research. It's actuallybecome like, "Hey, if you pull this off, you're going to make a lot ofmoney."
There's a pot of gold waiting for you. Absolutely. Exactlyright.
How do you think bringing LLMs back to data? How is thisgoing to change how people do data, how LLM works? It's just LLM done for?Where are we at in that sense?
This is a question I've been thinking about a lot since Iwas... I've been anticipating being asked this question on podcasts and stufflike that. You know what I mean? I've been giving this a lot of thought. I havetwo thoughts here. Again, this is early, and again, I can be completely wrongabout this stuff.
My first thought, at least in particular with data, as itportends to data and data analysis and so on and so forth, for a long time,data engineering has been focused on data modeling, and how do we model datafor human consumption, for BI tools? We do look ML, we do semantic layers, wedo dimensional models. We model data to make it interpretable, approachable byhumans and so on and so forth.
With LLMs, I think you're starting to see that conversationshift to not... Basically, we're not going to model data for humans anymore.We're going to model data for the large language models. Large language modelswill be the target audience for our data models. Some folks have been talkingabout activity schema. The activity schema pattern is a better model, perhaps,for how would we make data consumable, where the large language model doesn'treally care about a bunch of stuff that a person cares about?
A large language model doesn't care that the table has athousand columns in it. It doesn't care. It doesn't make any difference. Itmakes a lot of difference to a person. A person sees a thousand columns,they're like, "Oh, my God. What's going on here?" But the largelanguage model doesn't care. Therefore, activity schemas, or these other, Iwould say non-standard data models, data models that aren't as popular with theclassic BI tools and stuff like that, will be in ascendence as we build forlarge language models.
I would take that one step further, though. I think that theengineering problem will no longer be how to engineer the data models. It willbe like, how do I engineer a data analyst? How do I build... Using the largelanguage model as a primitive and my own understanding of the business and thecontext and all the stuff, these tools available to me, how would I synthesizean actual data analyst? The persistent problem in data for a long time has beenthe shoulder tap phenomenon like, "Hey, can you just pull some data for meon this thing real quick, data analyst person?"
It's super annoying and data analysts hate it. It drivesthem crazy. They're basically a SQL generator that spits out Excel spreadsheetsor whatever with data pools. But the large language model is not going to care.The large language model is going to be super happy to do that. Literallynothing will make the large language model happier than spitting out some SQLand sending you an Excel file on it.
I think that engineering a system to do that, to build asystem to do that with business context, with the data itself, with the toolsavailable is the next grand challenge of data engineering. I think someone'sgoing to figure out how to do that. Similarly, for machine learning models, myhistorical experience for machine learning has been similar to yours. I'veworked on ad click prediction systems. I've worked on recommender systems. I'veworked on fraud detection systems, spam detection systems. Very classifierheavy stuff where there's lots of feature engineering, [inaudible 00:39:23],and like "Let's gather more data and let's run it through XGBoost andlet's go do our thing.
I'm imagining that same workflow applying here where thelarge language model is like yet another machine learning engineer and we'regiving it these classification problems. It's sitting there and it's crankingstuff out and trying out features and different weights and differentalgorithms and stuff. That's what it does all day. Quite possibly reportingback to me in a notebook environment, what exactly it's been up to. That sortof stuff.
That, for me, it's a weird thing to say, I think, but it'sdata engineering, ML engineering moving from empowering data analysts or makingthem productive to essentially replacing a large part of the most tedious andawful part of their job with a software system that just is built to do thatrole. That's thing one that's interesting.
Thing two that's interesting to me is how this changes thelandscape. I have this mental model in my head when it comes to the data toolsmarket, which is what I spend most of my time thinking about. In the data toolsmarket, there is ingestion. There's the Airbytes and the Fivetrans, theNotanos, that get data from source systems and move it into a data warehouse.
And then there's the BI tools. There's your Mode and yourTableau and your Looker and all those things. And in the middle, the absolutestar of the show, the sun that everything else revolves around is the clouddata warehouse. It is the Snowflake, BigQuery, Redshift. I said Big Shift,that's fantastic. BigQuery, Redshift, Postgres, whatever else it is you like touse.
I can't help but wonder how that looks, how that model worksgoing forward. It seems to me the most likely place for... If I want to injecta large language model into this stack, ingestion, cloud data warehouse, BI,where does it go? It seems obvious to me that it goes in the BI layer. It goesin the presentation analysis layer. That's where it goes. And once it's there,what happens to the rest of the stack?
I feel like the center of gravity, the most important systemfor so long has been the cloud data warehouse. But I think that LLMs end upshifting that over to the BI side of things. The BI tool becomes the mostimportant thing in the world. I maybe don't care so much anymore about how thedata is modeled or what the back-end cloud data storage system is, because Idon't know anything about that. It's not designed for me anymore. It's designedfor the large language model that's powering the BI tool.
To me, it's like, what about the stack has to change inorder to allow us to effectively incorporate large language models and makethem more effective at their jobs? Again, I don't know, but I think it's goingto shift the balance of power towards the BI side and away from the cloud datawarehouses. That's my rough sense.
I think that makes a lot of sense. You could argue thatthat's where the final business value comes from. Just having stuff deployeddoesn't inherently drive value. It's more the things you do with it. It alwaysmade sense that eventually it falls into a commodity category.
It does, but it hasn't so far. And [crosstalk 00:42:31] hasdone a fantastic job of not letting that happen by being better at integratingstorage and compute together to handle large volumes of data and do all thisgreat stuff. You need something as disruptive as large language models in orderto shift the balance of power here. It is up there with the mobile phonechanging the way we develop software and stuff like that. It has to besomething on that magnitude. Otherwise, things would just go on the way theyalways have, more or less.
I agree. I think that's what we're saying. I think thatanother company, Databricks, I think has always done a very good job of movingup the solution stack. They've done a lot to make sure they're not just likeSpark and that they get as close as they can to being actual... I don't know ifthey have a BI tool, but I wouldn't be surprised if they did [crosstalk00:43:18].
They have a notebook-based system, of course. Of course,they do. You need notebooks and Databricks, absolutely.
Now [inaudible 00:43:26] Dolly, or what's their...
Dolly. Yeah, Dolly is their... Absolutely.
Yeah, it's Dolly. Now I'm thinking of Dolly, I immediatelythought of the old DALL-E, the image one. That's how fast I guess gen AI.
DALL-E, that's super great. That's super funny. I've beenreading about Databricks' Dolly, and I didn't even make that connection. Ithought they were doing it as a reference to the sheep that was cloned orwhatever.
I think they are. I've never, ever made that reference.Right now, when I said it, I'm like, "Did I say that wrong?" I thinkI've never thought about it while saying it. I just-
I didn't either. That's fantastic. I didn't get that.
I think that makes a ton of sense. It just flew up.
Databricks, to me, is... I think this is the other greatquestion here, which is that I love seeing the Dolly stuff that Databricks isdoing. Databricks is very much incentivized to live in a world in whicheveryone trains their own large language models. That is a much better outcomefor the world for Databricks than a world in which there is only one model.There's GPT-7 or whatever, and that's it. That's the only model and everyoneuses GPT-7 to do everything.
So it's like, I get where they're coming from. Going thatapproach 100% makes sense for their business and all that good stuff. It's hardfor me to know right now whether there will be one model to rule them all orwhether everyone will have their own model and stuff, and I can make the[crosstalk 00:44:42].
I think we'll give it away once the economics come intoplay. Once you're cool, you can't just raise however many billions of dollarsyou fund. You have to start cashing in a bit and start to actually charge whatyou're actually... I'm sure there's a handful of people in the world that know,but I don't know how much each query costs OpenAI, but it's probably a lot.It's probably a lot more than we think.
I'm sure it is not cheap. It seems like these days thescarce commodity is not money, obviously. The scarce commodity is literallyGPUs. There was a good article about that in The Information. Just getting theGPUs is physically difficult to do right now, unfortunately.
Yeah, we'll start trading GPUs for food.
It honestly wouldn't shock me to see that. Yeah, exactly.Precisely.
Awesome. Well, Josh, I thought we'd keep going for a longtime, but I do need to cut this off.
You have a job. You have things to do.
Exactly. Yeah, totally. I understand.
My driver will be here talking with you and maybe we'llbring you back on for another one of these. Maybe we'll do like, where did weend up on this in however long?
Totally. I would love to listen to this conversation in sixmonths or a year and just absolutely cringe at how wrong I was about it.
Well, thanks again for hopping on. It's been a pleasure.
Thanks for having me, man. This was fun. I appreciate it.
From overviews to niche applications and everything in between, explore current discussion and commentary on feature management.