Summary Bettering Human Health Through Artificial Intelligence with Sean McClain and Joshua Meier of Absci (Youtube) www.youtube.com
11,520 words - YouTube video - View YouTube video
n/a Welcome to FYI, the 4 Year Innovation podcast. This show offers an intellectual discussion on technologically enabled disruption Because investing in innovation starts with understanding it. To learn more, visit arcdashinvest.com.
n/a ARK Invest is a registered investment adviser focused on investing in disruptive innovation. This podcast is for informational purposes only and should not be relied upon as a basis for investment decisions. It does not constitute either explicitly or implicitly any provision of services or products by Arc. All statements made regarding companies or securities are strictly beliefs and points of view held by ARC or podcast guests and are not endorsements or recommendations by ARC to Buy, sell, or hold any security. Clients of ARC Investment Management may maintain positions in the securities discussed in this podcast.
Simon My name is Simon, ARC's director of life sciences research. And today, we'll be discussing Absci, a public company harnessing generative AI to create more effective medicines faster and less expensively. I'm joined by Sean McClain, Absci's founder and CEO, as well as Joshua Meyer, AbsaCy's chief AI officer. Thanks for taking the time, guys.
Sean McClain Yeah. Absolutely. Thanks so so much for having us here, Simon.
Simon So, you know, our audience hasn't had the luxury, like me, of getting to know you guys beforehand. So before we dive into the company, Sean and Josh, would you mind briefly telling us a little bit about yourselves and also why you're so passionate about the role of AI in drug development.
Sean McClain Yeah. Absolutely. So I founded the the company, 12 years ago, actually, in a in a in a basement lab, and the original idea was not applying generative AI to biologic drug discovery. It was actually to engineer e coli to produce, antibodies. And we are the 1st company to be able to produce an antibody and in E.
Sean McClain Coli. And what that actually enabled you to do was actually Do what's called a pooled approach. You can basically take a a 1,000,000,000 member antibody library, put it into a test tube of e coli, and now you have a 1,000,000,000 antibodies is produced, and this gave us a huge data advantage, being able to essentially, Produce these and screen these at very, very high throughput. And it ended up becoming a realization to me about 3 years ago that, wow, this is what is needed For generative AI to really unlock biologic drug discovery or what we like to call drug creation. It's essentially going from this Paradigm of drug discovery, where you're looking for a needle in the haystack to drug creation, where you're actually creating the needle.
Sean McClain And in our case, it's The biologic, being able to get the biologic or the antibody with all the attributes you want the first go around.
Simon Mhmm. So I think a lot of people are probably familiar with the terms antibody and antigen and understand that antibodies are A really critical component of the body's immune system and help us to, you know, attack and defend against disease. But for those who may not have a very intimate understanding of, like, what the drug discovery and development processes look like Specifically for antibodies, which is, you know, what you guys are focused on, would you mind kind of briefly zooming in and discussing the importance of the things you talked about, like with yeast you know, generating these libraries. I think that would help people understand a little bit more context.
Sean McClain Yeah. No. Abs absolutely. So Unlike, like, a small molecule or a pill in the bottle where you have a chemist making it, you you have to have a living organism, make an antibody, And that's for the production. But then to discover, an antibody, you you then have to use usually a mouse or what's called phage display or yeast Play.
Sean McClain I'll focus in on on the mouse. Regeneron was really the pioneer in in creating a humanized mouse where you essentially could take, let's say, A cancer target or antigen of interest, you inject it into the mouse. The mouse uses its immune system To then, create antibodies towards that target. And then you extract out the blood and you're able to then, Find antibodies, to a given target. But the the issue with with that is that you can't tell a mouse To generate an antibody that hits the specific location of the target that you want, that has the the affinity or or how tightly it binds to a target, Or the developability or manufacturability.
Sean McClain Like, you have no control, over how a mouse generates, an antibody. And that's what really leads to these, you know, long times, lead times to get into the clinic as well as, you know, low success rates, in in the clinic of less than 4%. And and again, what we're doing is completely changing that paradigm and using AI to design the attributes you want the 1st go go around and actually have control over the biology for the 1st time.
Simon Mhmm. So before I I kick it over, to Josh, now that it seems like we've we've fixed the Wi Fi, I do wanna double underline something you said there, Sean, because I think it's gonna come up Time and time again as we talk about this is when you're trying to develop an antibody against a particular antigen or disease target regardless of the disease, The idea is that you're exploiting the specific binding reaction, and you use the word affinity, between those 2 things, almost like a lock and a key. And I I like that you walked through kind of the 1st vestiges of this approach with using animal models and immunization, which, you know, has its pros and cons. But moving further downstream and we'll we'll talk about the in vitro, like the yeast display and and e coli and the bacterial display technology, And then the ultimate sort of golden goose of how much of this can we just do on a computer. Right?
Simon Yep. And so we'll we'll get into that. But before we before we do, Josh, I'm gonna flip it over to you just for, like, a brief intro on yourself, how you got involved with Absci and and why you're so passionate about, you know, the role of AI in antibody discovery in development.
Joshua Meier Yeah. Absolutely. Thanks for having us, Simon. So a bit about my background. I've been working at this intersection of AI in biology, really before this thing became cool, it sounds like everyone these days has kind of established that, like, this is going to take over the industry.
Joshua Meier But, you know, when I first started working on this, everyone thought it was crazy. Like, why are you betting your career on like AI for biology? Just go work on like at least traditional AI, even that was, that's big then. I started training my 1st neural networks, back in in 2013. This is when we first got deep learning, really working on GPUs, and this space started to explode.
Joshua Meier But, I come from a family of doctors, was always excited to kind of deploy this technology, to just better human health. So I was always trying to find an intersection, but just the Technology wasn't there yet. AI wasn't good enough. The data wasn't good enough, and and the problem space wasn't really worked out yet. But, fast forward a couple years later, I was working at OpenAI, This was around the time of GPT 1.
Joshua Meier So we were first seeing, like, the first signs of life that language models could learn some really interesting things. And, you know, of course, with an opening eye, we were kinda like the believers of this stuff. We felt that everything we're seeing today was going to happen eventually. But the application I was really excited about then was, like, well, if we can get this thing to, you know, translate between languages or generate new text, Could you use this to generate new biological text? Right?
Joshua Meier Can you just output DNA sequences and protein sequences? I felt that'd be even bigger than just the NLP stuff. Because if you just think about, you know, NLP, you and I can just write stuff on a computer, but we can't sit down and start clinking out DNA sequences. You're not gonna get anything interesting. So I left OpenAI, shortly shortly after that that GPT 1 project, to go join Meta and help start up an AI for science initiative there.
Joshua Meier That's where we did this first work on language models on protein sequences, published some of the first, papers in this area, that these models could learn some really interesting things. And a couple of years into that, really, the thing that was clear it was missing is that, you know, when you're working in a Facebook or a Meta, You have really strong conviction in AI, which allows you to go into these areas like AI for science, and AI for biology. But the thing that you're then missing, is that wet lab component. So how do you actually validate these designs? How do you build differentiated training data?
Joshua Meier And it became clear that this stuff was going to work for biology And who was actually gonna reap the value? I felt that at some point, it wasn't going to be, you know, as much meta as it was a fantastic place to do this kind of research, but sort of aligning with a very forward looking biotech company that was also going through this generated value transformation made a lot of sense. So to that point, I kinda connected with Sean, and, these visions really started to collide, and I, you know, joined Absa, and it's it's been, pretty amazing to see, the kind of Research, and and science that can happen in such a short timeframe when you really bring these 2 differentiated edges together.
Sean McClain Yeah. Hey. And then it's actually a really funny story how Joshua and I met. It was, the team had put together a list of, you know, probably the the top 50 AI researchers in the space and gave it to me. And I went on LinkedIn, and I, you know, wrote wrote emails to all of them.
Sean McClain And, Joshua responded. And, Yeah. Him and I totally clicked on the vision and what we could do together, and it's, you know, I I would say the rest is is history.
Simon Absolutely. Well, I I wanted to maybe segue back to the the main conversation around, you know, this point that you're making, Josh, around the data and what you're able to do big tech company versus a company that's, you know, trying to combine multiple disciplines, which honestly feels like where the the field of life sciences is, like, inexorably headed, it's like a complete breakdown between the walls of of AI and biology. And, You know, if you look at some of the large language models that are being used with human language, you can train them on enormous amounts of information, ostensibly the whole Internet and all the text therein. But, you know, the issue with life sciences is, like, a lot of the data generation techniques Have been artisanal, you know, poor quality control. Databases are fragmented and poorly annotated.
Simon And, sure, we're improving along all these dimensions, especially with sequencing. Right? And, like, the breakout explosion and cost decline of DNA sequencing and, you know, other kind of ancillary technologies. But I wanted to focus the conversation on data for a moment and talk about it in the context of Absci. Sean, you mentioned billions of data points with with e coli expression.
Simon And, Josh, you know, you and I have had this previous conversation around data being the rate limiting step in training these models. So I wanted to Discuss both of those points together and really focus the conversation back on Absci and what you've done between in vitro and Insilico work.
Joshua Meier Yeah. Shom, why don't you kick it off as as the 1 who really invented, like, the core microbial platform we're using here?
Sean McClain Absolutely. Yeah. As I had, you know, previously, you know, talked about, we were the 1st to engineer a very simple organism, e coli, to produce an antibody. And I guess stepping actually back for just 1 moment. Like, how are antibodies, you know, currently produced?
Sean McClain They're they're produced in mammalian cells or or CHO cells, and You can really only scale that up to maybe producing, you know, 1,000 or tens of thousands of antibodies in a given week. And that's just not enough throughput or data to actually start training models. And so by being in In a microbial organism and engineering E. Coli to produce antibodies, what you can do is what's called a pooled approach. You can basically take a test tube of your your engineered e coli, take a 1,000,000,000 member DNA antibody Library that encodes a 1000000000 different antibodies, throw that into your test tube, and now you have every single e coli Making a different antibody.
Sean McClain So in that single test tube, I now have a 1000000000 antibodies that have been produced. So I've gone from, You know, 1,000 or tens of thousands of antibodies to billions. Now the second question you then, you know, have to ask or or solve Is the, the functionality. What you know, now that we've produced them, what is the functionality or potential efficacy of these? And this is where we develop this Really breakthrough assay that we call our ACE assay.
Sean McClain We're we're able to interrogate every single e coli, and look at the binding affinity or how tightly it it, binds to a a target of of interest. And so, in in that experiment, we can then be able to to look at, you know, billions of protein protein interaction Action data points. And and when you're developing an antibody, there's there's really 2 important aspects, kind of outside of The developability and manufacturability. It's it's where does the the antibody bind the location or, you know, what we've referred to as the epitope? And then, you know, how tightly is it binding?
Sean McClain Does it have high affinity or low affinity? And and these are the the 2 attributes we're really able to hone in on very rapidly and and train our our AI models. And and I'll let Joshua talk about this, but this data has allowed us to Really, build these these extraordinary, models or train these these models that have allowed us to actually have a huge breakthrough in the industry, Where we were the the first to use generative AI to design an antibody, From scratch, on a computer. And and, actually, I think this is a really great, you know, time to hand it over to Joshua on what that is and how we take this data, Train our models to ultimately kinda see our big vision through of being able to design a biologic at a click of a button.
Joshua Meier Sure. Thanks, Sean. So if we look about how we're using data at AppSign and even just taking a a step back, first of all, on the importance of data in this space, think the whole language modeling world is waking up to this today. You look at something like GPT, CHAT GPT and GPT 4. 1 of the big things that's advertised there is something called Reinforcement learning from human feedback.
Joshua Meier And the thing really to to think about there is the human feedback part where you can go and scrape, You know, basically infinite amounts of data from the Internet. Although some would say it's not really infinite. We're almost running out of of tokens now to feed these models, as as we're continuing to scale. But if you just think about that human feedback, it's really critical to give these models, this very, like, chat like capability when you're training it on, you know, people actually interfacing with the model and teaching it in a very direct manner. The data is is even more impactful than just finding random data on the instrument because the model is involved in that data collection process.
Joshua Meier So in that view, this is something that, you know, we've the view we've had at Absahe for for a couple of years now. And we've really built up the experimental, platform an AI integration accordingly. So, specifically, what that means is that almost all the data we're collecting in a lab is actually AI designed. So if we go in the lab and we're gonna go collect a 100,000 or a 1000000 data points, a 1000000 unique sequences, these aren't just random sequences that you, like, find on the Internet, or find in a mouse immunization campaign. These are sequences that the AI model is designing, or rather the AI scientist is designing.
Joshua Meier Sometimes there's there's, like, different, You know, benchmarks or or baselines that you wanna throw in their different controls. But at a high level, it's the machine learning model is actually creating that data, telling us what data it wants us to collect. And it's very similar then to this reinforcement learning from human feedback ID. I mean, that on a technical level, it's not exactly the same, but at a high level, it's It's the same sort of finding that if you allow the model to help you generate the data, you actually end up with a really nice flywheel there, where then you get that data back, the model becomes smarter, and then you can do this again. So that's really allowed us, I think, to scale the model really quickly.
Joshua Meier It's also allowed us to just run massive number of experiments and see what works. Machine learning is a very empirical field. A lot of people used to refer to to deep learning as black magic, where, you know, you would just have AI scientists who just have some intuition about how the models work, and that's still in a sense how you come up with the next generation of these architectures. Like, the transformer, that people are scaling these days in NLP. Like, how did you how did people come up with that?
Joshua Meier I mean, they can kinda give you reasons about it, but At the end of the day, there's just really strong intuition that goes into this, that's built up through just a lot of time spent, training these models and and evaluating them. That's where this experimental feedback loop is also critical. So the AI scientists can be doing dozens of experiments in a month, take different kinds of models that they're training, test all of them in the lab, and really start develop insight and intuition about what's working and what isn't. And I I also, would credit, as, 1 of the fuels to our to many of our recent successes.
Simon If I distill it down to a couple of key points here, like, you know, you have to solve for a few things to make this type of project work. The first is, you know, you're generating a ton of data. You've created A in vitro technique to generate a ton, you know, billions of data points. The second half of it is you have to actually create, you know, features and labels and get that functional data out, you know, for every part of that library. So you've done that with the ACE assay.
Simon And then you're describing and I actually wanted to dig into a nuanced point here because I'm I'm not sure if it comes across every time, which is that you're getting data on every different, you know, Combination or perturbation that you're having in these libraries, not just the top decile, you know, high affinity binders. Like, in in other types of display technologies, You know, you're base you're physically, like, washing away all the things that don't stick because they don't stick. And you're you know, maybe this is the wrong way to think about it, but you're kind of biasing That dataset towards only the things that work. And if you're talking about reinforcement learning and, like, you know, penalties and loss, I wanna get into that as well to try to understand the importance of actually collecting the whole thing, not just, you know, the fraction that that works.
Joshua Meier Yeah. That's that's absolutely right. So we've actually developed a number of variations, to our core, what we call ACE assay based on that microbial system that that Sean introduced before. And we can run the assay in a number of ways. So 1 way we can run it is, similar to past techniques in a binary way, where we're just looking at whether a sequence is is binding or not binding.
Joshua Meier And what's really nice about that is there are some cases where you just really wanna profile hit rates in a very accurate way. We find that when we run the assay that way, The precision recall of that assay compared to, like, lower throughput, but kinda gold standard assays is over 95%. So that means that the information we're getting out there is Highly reliable, for us to compare the hit rates of different models to each other. But then we've also developed an alternative way of screening, which actually gives us quantitative information. This is something that's very difficult, to get, with a traditional phage display or a yeast display, at least to do it in in a in a batched way.
Joshua Meier What this means is we can go screen, you know, hundreds of thousands of sequences, and then we can get a quantitative label for each of those sequences, a score, And that score correlates very well with gold standard measurements of affinity. You're seeing Pearson and Spearman correlations above 0.8 in most cases. So taken together, these are really the 2 tools that you want for anybody engineering. 1st, you wanna identify a binder, and you wanna be able to profile, like, what fraction of your sequences are actually binders. And then among those binders, you wanna be able to profile, the affinity in a highly quantitative way in a also very, like, robust and accurate manner as well.
Joshua Meier There's usually a often biology between throughput and accuracy, and I think we're at a very sweet spot, with the assays that we've tuned over time, to be able to get very meaningful information for the models.
Sean McClain Yeah. And I think this is a really important point that, Joshua, you know, hit on it. It it's not Just using the wet this wet lab technology to get data to train the models, but it's the validation as well. There's a lot of, Yeah. Manuscripts that that come out that don't show, you know, wet lab validation.
Sean McClain But every single Model that that we design and and we train on, we then go into the wet lab and and validate it. And we can, you know, validate, You know, roughly, 3,000,000 AI generated designs, in in a given, week, and and that. So so, you know, training on billions and then validating on on 1,000,000 and then being able to have a cycle time which you can go through all that in a 6 week time period Just allows you to make very, very rapid progress, in in a way in biology that really hasn't been done before.
Simon Before I get into some of the the manuscripts, which I'm eager to talk about, I did wanna ask just this general question. I'm gonna I'm gonna lean a little bit on a a a blog, that a friend of mine, Pablo Lubroth, wrote, I think last year about KPIs in AI enabled drug discovery. You know? And it it pulled from a lot of analogies from the SaaS industry. Right?
Simon Like, all these investors that are looking at SaaS companies, there's like a lingua franca of, okay, we'll use terms like ARR, and we have, like, very rigid definitions of, like, what everyone's managing to, but on both the investment side and the the entrepreneur and the founder side. And as the space matures, you know, as an investor, and I'm sure a lot of people are in the same position, like, you get fatigued hearing about every Possible mash up of AI and drug discovery because there are a lot of them. Right? And so I appreciate the point you're making, Sean, about The the importance of of wet lab, you know, gold standard validation is, like, a key part of that. But I wanted to ask, like, what are some some red flags and some green flags to you When you're thinking about AI and drug discovery come together, what are the things where you're like, oh, this is, you know, legitimately differentiated or or there's some value here?
Simon And also to the extent that you're able to talk about it, I'd love to know what are the KPIs that matter to you as you're tracking your own Progress against, you know, this ability to generate fully in silico antibodies without clear, you know, commercialization events or contracts. Like, what are those metrics?
Sean McClain Yeah. Absolutely. I I would say that there's, like, 3 key metrics, that we look at. Well, you know, first is, You know, being able to specifically hit an epitope that that you want. So, you know, again, an area of of the target that's that's of of interest, and and being able to, hit that and not having, you know, any sort of polyspecificity towards anything else.
Sean McClain The the second is then being able to Have the accuracy of the of the model be good enough to predict the exact binding affinity that you want. You know, Let's say you you you wanted in this, you know, particular instance to get the biology you wanted, you wanted a medium binder. You can have the model generate that for you. The third aspect is, you know, I totally lost my train of thought on the third aspect, but I will, hand it over to to Joshua. If If you agree on kind of those at least those 2 kinda major areas of of, you know, epitope specificity as as well as, being able to, you know, just, predict the the affinity.
Sean McClain And and, actually, the the third thing that I was actually gonna mention was, The the accuracy of of of the model. So can we hit the up and top 25% of the time? You know, 25% of what is coming out of the model is hitting that up and top, You know, and and, you know, ultimately increasing that over time. So being able to get up to greater than a 90%, you know, accuracy is is really where we wanna be.
Joshua Meier Yeah. So those are, exactly some of the problems that we're focused on at AppSight right now. I think 1 of the exciting things about this space is that it's moving very quickly, And those KPIs will definitely continue to update over time as well. Right? So it's not like, you know, we're talking a business setting.
Joshua Meier Right? There's, like, some AIR very clear metric that you wanna go what is your revenue for SaaS business? I think in AI drug discovery, you have to be more creative than that. You have to think through what are sort of the unmet needs, that your application can bring in And just be laser focused on solving for those. And then once you solve those, then you kind of move on to the next problems afterwards.
Joshua Meier So like Sean mentioned, some of the things that are just not, You can't really do with existing assays or things like epitope specificity. It's something that makes a lot of sense for an AI model because you can think of it as like prompting. You can prompt the model, or like a CHA GPT sort of study. Prompt the model with a specific epitope that you wanna hit or potentially other properties, that that you want your molecules, to, to sort of fit. Another thing is about accuracy as well.
Joshua Meier So you want a model that's very calibrated. 1 of the things that we see with our models is that, it's a phenomenon that we're calling hit rate decay. As you screen more and more sequences from the models, the the the accuracy started to go down. You know, we're screening hundreds of thousands of sequences here, and we're like, wait. You know?
Joshua Meier At some point, doesn't the model run out of binders? You know, is it in the 1st couple the model is pulling out, or is it all the way through the end? Do you really need to screen a 100,000 in order to discover a binder? It turns out the answer, today is is no. Like, we don't have to do that anymore because the model is actually giving us sequences from the first, let's say, 10 or first a 100, that you're pulling out of the model.
Joshua Meier So that's, for example, another KPI that we look for, and we've, Put a lot of thought into the kinds of metrics and statistics, that we've developed to sort of evaluate our models in that way. So this is another way that we're able to really benchmark our models against each other, and therefore, you know, BLAs are focused on making progress here.
Sean McClain And the other thing I'll mention too is that we're not just developing, you know, AI within biologic drug discovery for for the sake of it. Like, we're we're really wanting to utilize this to be able to discover, You know, new biologies. You know, 1 of the things that we're really excited about is actually utilizing this sort of technology to generate antibodies towards GPCRs, which are notoriously hard to drug with standard, immunization or phage display, approaches and being able to specifically target, you know, the epitope on a GPCR. This really starts to unlock New biology and new new targets, which is ultimately gonna be, you know, best for for patients. And and, you know, it, you know, becomes Highly differentiated, which we're we're really excited about.
Sean McClain And I think that's what you're gonna start to see more and more is is generative AI unlocking, new new biology in a faster way than we've ever seen before.
Simon And so earlier, Sean, too, you you mentioned this concept humanization when you were talking about, I think, a transgenic mouse model. I wanted to blow up this point about humanization as well because we've talked a lot about affinity. And, of course, there there are multiple things that can go wrong, you know, even if you have an excellent binder. I wanted to talk about this a little bit because it's the subject of of 1 of your papers too on this concept of naturalness, which is like forgive me if I mischaracterize it, but it's sort of like an ensemble of a lot of different, you know, key aspects of what it means to have an antibody that is developable, meaning it can be manufactured. It's hopefully rid of downstream liabilities.
Simon You know, a human body is able to take it in without any sort of, you know, unwanted immunogenic or or, you know, side reaction. So I wanted to learn about this, like, multilayer Swiss cheese model of optimization, you know, past just affinity. And maybe a little bit more of a of a technical add on to that question is, are these optimizations happening in parallel, or is it more sequential? Right? Does that make sense?
Simon Like, you you know, you're starting and kind of going downstream, so I'd love to know more about that.
Sean McClain Yeah. No. Absolutely. So so if you just look at a antibody Sequence. Like, there's more sequence variance or or drugs you could design than there are atoms in in in the universe.
Sean McClain And so the search space is is ginormous. And, you know, you look at, like, evolution, like, a mouse or humans evolved to have a particular immune system and they're gonna have a particular immune repertoire where they they design certain, you know, antibodies that, you know, don't have a immunogenic response. And and essentially, you know, that's what humanization is is making sure that the the antibody you design isn't gonna be targeted by the the the immune system and and have what's called an Antidrug, antibody. And when you look at, again, the overall possibilities and our ability now to search to to start to search that, space, you start to become concerned with, you know, immunogenicity. Like, is what the model is is is designing these kind of Sequences that that have the same functionality, but are are very different from, you know, what looks like, a normal antibody.
Sean McClain So that's when we started to build out, this naturalness model, which, I'll have Joshua kinda talk about that. But What we showed with this this model is that it's inversely correlated to, immunogenicity, and and has, pretty good correlation to developability and and manufacturability. And so you're ensuring not only are you getting, you know, the the functionality out of the antibody Body you want. But you're also ensuring that, you you have low immunogenicity, or it's as human like as possible when going into the clinic. Because There are clinical trials that do fail, due to to high immunogenicity.
Sean McClain And and so being able to control for both of these is is is really important, you know, aspect. And and I'll let Joshua kinda dive into, you know, how we went about, you know, doing this. And then, and then how are we, you know, doing this in terms of are we doing this in parallel at the same time, or is it sequential?
Joshua Meier Yeah. So when we think about the naturalness model, the key insight that we were trying to build towards is how can we take Sort of this universe of antibodies that we know about, and then use that information to distill the key factors that make an antibody so called natural, that's really what the naturalness score is. So the way that we train the model is we took hundreds of millions of antibody sequences that you find within humans that you find within animals really naturally occurring. And then we ask the model, to give us some score. So given a new sequence, What is essentially the likelihood that you would see that sequence within 1 of these natural immune repertoires?
Joshua Meier And, turns out, and, you know, it's it's kind of intuitive that this is the case, but if a sequence if the model thinks it's more likely to have been found in immune repertoires, then it's more likely to have all of these that Sean was talking about before, like the developability properties, having low immunogenicity. And the reason why we built the model this way is it turns out that there is, significantly more immune repertoire data that you can get access to, than there is even some of the developability and immunogenicity data that's out there. So on a fundamental level, you can think of the model as, like, bootstrapping from all the information that's available here, and it's a very similar insight to what you observe, with, you know, something like a GPT 3 or a chat GPT, right, where you pretrain the model on lots of information, then the model is sort of learning, like, what are the real semantic rules that go into language? We're doing the same thing here now for for what are the rules that go into an antibody.
Simon Okay. And I imagine, like, if you're if you're working along some, you know, like, multi parametric optimization, You wanna constrain this may space as much as you can with the early steps. And so is what you're doing basically, be, like you know, before you even get down to, like, what could be a good binder or not, let's throw out all the things we know are gonna be, like, highly toxic or Maybe they have, like, a premature stop code on or something, you you know, that would truncate it. Like, well but, you know, where along the process are these different steps working, I guess, is what I'm curious about.
Joshua Meier So for each 1 of these, things we're building, there's actually multiple ways we can kind of combine them together depending on the problem that we're trying to solve. It goes back to your points earlier about a KPI, for example. You need to be very thoughtful about what you're applying this technology to. 1 of the issues I think that the the field has broadly is you have a lot of smart AI people who are building hammers and looking for nails. And, you know, as We all know, you know, if you start with the nail, you know, as the saying goes, it's gonna be easier to find that solution.
Joshua Meier So when we think about combining these together, It really depends on the problem that we're trying to solve. So, usually, in the case of, like, a drug discovery campaign, you're going to bring a naturalness as a way to select the molecules that are most interesting to you. So, we're at a point now where we can take our our AI models, And then we can come up with hundreds, even thousands of potential, sequences that could be brought forward for your preclinical testing. And the question is, how do you prioritize between these hundreds of different molecules, and that's where we wanna bring in this information about naturalness, for example. That's 1 way that we use it.
Joshua Meier Another way that we can use it is actually, just bring all these properties together, from the start. So maybe you have some some mul like, you have some weed, and what you'd like to do is optimize that lead for various properties. Maybe you don't have any sequence that has the affinity or national profile that you'd like, And you can just co optimize for all these properties together. So that's another 1 of the ways that you might use this, this information.
Simon I imagine, you know, once you get past affinity and specificity, we're we're just trying to, like, systematically remove as many of the downstream, like, gotchas and surprises that keep drugs from ultimately making it to, you know, commercialization. And so we've talked a lot about some of these, I guess you could call them known unknowns. Like, we we know what they are, but we just don't know where they rank or how dangerous they are. And I would imagine, like, over the years, people have always kind of had this, like you know, maybe it's hubris, but maybe it's also just strategy or or thinking about the problem. But You try to get rid of all these downstream question marks.
Simon And, like, is there a set of just, you know, unknown unknowns? Things that we're We're not even really able to query or look for that could still be a surprise. Like, I think about this with and I'll use a specific example to maybe guide what I how I'm thinking about this. But, like, Post translational modifications that are not, like, directly being measured by DNA sequencing or even in some cases, like, you know, m like mass spec or protein sequencing, If those show up in a in a CDR or 1 of those active regions, like, you could still, I I imagine, change the the binding properties of an antibody. And I know there are some tools to predict those liabilities and things like that, but I'm generally just curious about, like, the unknown unknowns with antibody development and, like, Places where you think no one's really paying attention to, like, you know, how to ensure that these things actually do make it, all the way to the end and to the finish line.
Sean McClain Yeah. Absolutely. I I mean, I think, like, 1 thing that comes to to to mind is, a lot of Developability, you know, attributes. Let's say that, you know, you end up finding out that you can't, You know, you can't get good enough viscosity in order to do sub sub q dosing. And and so you're having to do, like, you know, infusion instead of, being able to do sub q injections.
Sean McClain And so And and it all comes kind of back to to to data. Like, you know, that's actually 1 area that I think, like, pharma actually has a lot of data on is The developability, you know, attributes, you know, like us, like, we don't have technology to to scale up, you know, being able to screen for viscosity in a kind of a high throughput manner. And so for us, I'd like like, that could potentially be, you know, a blind But for us is, you know, we ultimately get down down the road, and it's like, we really wanna do sub q dosing, but we don't have the data to, you know, train our models to to to predict for for that. And I think that's where, you know, some really interesting ideas that we've always had of of, like, how could, you know, some of the Data that large pharma has that's not necessarily for the drug itself or, like, the the efficacy and the functionality, but it's kinda on the developability. How could you actually kind of develop, like, a consortium where you could take all of this data, it's for a greater good, and everyone could have access to The models, that are trained on this this this data, to really help with kind of these develop abilities where, you know, it is an unknown unknown for us, You know, when we, you know, get to get to the end, but others have, you know, the data.
Sean McClain I think these are some interesting kind of ideas that we've had on how, you know, The industry can can you know, can collaborate and and form, you know, potential interesting, you know, consortiums.
Simon Yeah. And I wanted to Maybe just take a little bit of a of a side step and and touch on this topic. You know, earlier, Joshua, and you were talking about large language models and the analogies to to, you know, it's biology. I mean, the the the analogies are really beautiful. Like, if you think about, you know, proteins having essentially, like, 20 canonical amino acids and, Yeah.
Simon The the English alphabet has 26 characters, and they're structured into words, and those words form paragraphs. It's like there there are a lot of, I think similarities between those things. So I wanted to ask a question along the same sort of axis, which is, you know, we've been talking for the last 30, 40 minutes about antibodies, which are a subset of of biologics, right, proteins. The models that you're building are certainly, you know, not limited to 1 particular application or 1 particular modality, and I would like to understand a little bit more about, Like, what's common what's different when we're in our local neighborhood of proteins? And then as we really zoom out, like, what modalities do you think are most amenable Or maybe the most kind of, you know, recalcitrant to in silico design.
Joshua Meier Yes. Are you thinking outside of biology or outside of antibodies more generally?
Simon Yeah. Just starting with antibodies, kind of zooming out a little bit to biologics, and then maybe taking a big step back and thinking about everything from small molecules to whatever else.
Joshua Meier Sure. So first starting with antibodies, you know, antibodies are really interesting because, most of the binding is conferred by the CDR regions, component already determining regions in these in these antibodies. And that's something it turns out that the model can learn extremely quickly. Right? Because it's something that is It's very clear from the data.
Joshua Meier And that also means you have a very focused design task to work on designing those CDRs. And that, you know, you really wanna bake that into the task that you're working on. You have these CDRs and then they're they're binding to a specific epitope. So we wanna provide the model, for example, with some structural information. You know, this is a departure from, let's say, a traditional language model that's just in the sequence space.
Joshua Meier We're really thinking about how do we bring in meaningful information about that target protein structure so that we can really focus the model on that. I think this is something that's maybe more unique within the AI space. Right? It's it's this is not just to GPT off the shelf turned up train on biology sequences. This requires, really some deep thought about how do you best apply AI to these problems.
Joshua Meier Now when we zoom out a step, so we're focused on antibodies, they have a specific form. When you move to a general protein, you don't really have that information anymore, So you need the model, to be able to, represent that information. I'll give you 1 of the, the challenges that comes up here. If you take like a general protein model and you apply to antibodies, If you look at a lot of the success of, like, protein language models, within general protein design, essentially what the models are learning to do, or at least what we it's it's hard to know for Sure. Anything in in deep learning.
Joshua Meier So it's really a little bit of, like, speculation, what these models are doing, but they're presumably taking a bunch of evolutionarily related sequences and using that to build some understanding of the protein world. So this is the way people used to do, let's say, protein folding 10, 20 years ago. You would take a given protein sequence. You would go enumerate the most similar proteins to that, and then you would start to do some statistics on top then. You're like, okay.
Joshua Meier I see, like, this position is always the same. That probably means that position is like doing something really important, or I see these positions co vary. It's always a, b, or c, d probably means that they're they're touching each other. Antibodies don't work that way. You know, antibodies are not the result of evolution.
Joshua Meier It's not random, you know, pieces of dirt that we're finding around the world that were evolved in different ways. It's created by the immune system. So you need to figure out general models that can really pick up on all that information, and I think that sort of segues into how do you apply this to an even broader picture, where the modalities, really start to change in in a more in a in a broader way. Right? If you wanted to have a model that brings in proteins and small molecules, or maybe you wanted to bring in something like protein dynamics, or if you wanted to represent a whole cell or a whole physiology.
Joshua Meier You again, you need to be thoughtful about what the domain how the domain switches over time, and then how to sort of build the right biases into your models to pick up on all that information.
Simon And you you mentioned this point, you know, about off the shelf bottles and reapplying them, and I wanted to just kind of give a reference frame. Like, if we're talking about experimental ground truth, know, for those proteins that we can crystallize and and use, like, cryo EM or x-ray crystallography, We're getting on the order of what in terms of accuracy, like an angstrom, roughly?
Joshua Meier You're talking about for protein folding, predicting the structure?
Simon Yeah. Like, I'm what I'm what I would love to do is just for people who are thinking about how good these models are at predicting The general structure of a protein and how that relates to what experimental ground truth is and, like, what's been that vector of improvement? And then I wanna flip it back over and just ask this theoretical question about, you know, is there an asymptote? Like, is there a reason why they can't just blow right through that experimental, you know, plateau.
Joshua Meier Yeah. Okay. So I think first of all, what you're citing here, let's say, in a a protein folding setting, right, you have all these structures. Now you can use AI to predict those structures. For the average protein, you can get pretty close to experimental accuracy.
Joshua Meier Like, we're talking about, like, an angstrom as you just pointed out. Right? That's sort of what you can get in the lab anyways. So these things are, are very highly performing. Now there's exceptions, you know, something like antibodies, it out are actually very hard to model the CDR loops that I mentioned before, the ones that actually confer a lot of the binding.
Joshua Meier They're 1 of the hardest things to actually model with protein folding tools. So 1 of the things we've done at Absci, for example, is develop state of the art capabilities for antibody antigen folding. So we can really get a good idea of what our potential drugs are gonna look like before we even go go test them in the lab. But where things start to get really is accuracy becomes a lot harder to measure when you're no longer in a prediction setting. You know, for protein folding, you have a sequence and there's some structure.
Joshua Meier Right? There is some answer. That sequence folds up into some structure. I mean, maybe it's dynamic. We can, you know, have a whole conversation about whether there is, like, 1 answer or not.
Joshua Meier But for the most part, it's Vacation problem. When you move into a design setting, it's not like that anymore. I take some target antigen. I design antibodies against it. Like what is the correct Solution.
Joshua Meier I mean, there is no correct solution. There are probably hundreds of correct solutions, and it really depends on the constraints that you're putting into the model. So that's where evaluation becomes critical. The NLP people realize this, you know, a few years ago and started to put together all these benchmarks, which, you know, things like GPT 4 just Put everything, you know, just blew everything out of the water. And now there's like, did you just memorize all the benchmarks?
Joshua Meier How do we even evaluate models anymore? And, you know, you're getting to that asymptote. Now, 1 of the big issues you're gonna have now in, NLP, the big issue will be evaluation. In biology, it's actually more convenient because when we have high throughput experimental capabilities, don't need if I have a 100,000 proteins that I generate, I don't need, like, a 100,000 human labelers to go look at each of those. I just have a 100,000 bacteria do it, and it's actually a lot more scalable.
Joshua Meier So I think this is 1 of the funny things about biology. It feels very esoteric and it feels very hard to build the data, But going to that, that point about like asymptotic limits, for NLP, it's, it's starting to become really hard and this is something where biology will be able to pass, I think, because, again, you can really evaluate all of this within the lab in a very scalable way.
Simon Interesting. So the the the problem is kinda flipped. It's like, you know, on 1 hand, it's Dearth of data, but easy to do these evaluations and then the mirror inverse of that for NLP. Yeah. I mean, that's pretty exciting.
Simon Like, I I would love if the, you know, the dominant sort of headlines around AI and just started to Become mostly biological. And it seems like with AlphaFold, I think that it really became I mean, I remember going home for Thanksgiving. It's the first time I've ever been asked about protein folding over turkey, So that was that was cool. Yeah. No.
Simon Thank thank you for that explanation. I think that that makes a lot of sense. And, you know, I think the the importance is, like, the devil's in the in the details with these things. Like, you know, good enough can be fine for some downstream tasks, but to your point on antibodies, it's really gotta be Virtually perfect to to keep going, you know, down towards the next thing. So, maybe in the last couple minutes, just kind of open ended Questions around, you know, the pace of technological improvement, what this field looks like by, you know, let's say, a decade from now.
Simon And if you think about the enabling technologies on both the dry lab and the wet lab side that could really boost, you know, your capabilities of doing what you're doing, but also just for the field of drug discovery, you know, it's it's it's an exciting thing for us because we're looking at, you know, Full stack, like hardware, wetware, software. And I'm interested, you know, for you guys if if you have strong opinions about that.
Sean McClain Absolutely. I can start off with this. So where this is all headed is is personalized medicine. What we're gonna start to see, over the next 5 to 10 years is Seeing that 4% success rate, you know, start to to increase, you know, to to 10% to 20%, and and so forth. And what that actually enables is you actually to go after smaller and smaller, patient populations because You're you're no longer, you know, having to pay for as many drugs that that failed in in the clinic.
Sean McClain I mean, that's why drug development's so expensive is because you you pay for The 96% that that ultimately, failed. But as you increase that, you can go, you know, smaller and smaller patient population, ultimately getting to the point where It's cheap enough to do personalized, medicine. And ultimately, being able to take a patient sample, Find the target that's relevant for that disease, and then design an antibody that not only hits the epitope and has the affinity that you want, But the model actually starts to understand the biology. It knows that that that target, and it says, okay. This is the the the epitope and and the affinity that that you need in order to achieve the biology that you're looking for.
Sean McClain And I think, like, that's the next big step for for us is Being able to not only, you know, design the the, you know, the antibody hit the the the target we want and the affinity, actually starting to understand the the biology, and I think that's gonna be a big next, you know, step for for us is is being able to Scale data around the the the biology. So when I get a brand new target that comes in, I want the model to then, again, give me the antibody that hits the epitope and the affinity that achieves my biology. And so I see synthetic biology playing a very important role In in all of this, I mean, you you see how important synthetic biology was for for our success. I mean, we started out as a synthetic biology, company, And that technology was what allowed us to scale biological data to train the AI models. And the synthetic biology technology on its own to get the data, yes, you know, we could get by with it, but it wasn't going to solve The the 4% success rate.
Sean McClain It wasn't gonna solve the decreased time to to to to clinic. You need to combine the 2 in order to Solve these these, you know, biological problems. And that's the exciting thing is because the next problem we have to solve is then in the wet lab. So how do you scale the the biology data to train the the models, and that's another syn synbio problem. And so I feel like you're gonna go you know, ultimately, synbio is gonna play a very, very important role For developing technologies that can scale data to answer the next question for AI.
Simon Got it. Got it. So it's basically to to maybe put well, 22 things, I think, that are shining through on that for me. The first is people may not understand the disincentives around, creating drugs against rare disease. Right?
Simon Like, to to your point, you know, you underwrite this huge investment with a low very low probability of success And to recoup your investment as a drug company on the back end, you know, you've got this finite window of time of patent protection to go out and make that money back before there's competition. And in a you know, not only is it difficult logistically to actually source and find these patients when they're scattered about and there are not many of them, but there just aren't many. And if you have a if you have a treatment that in some cases, like we're seeing with some of these gene editing techniques that Potentially, it could be cures. Right? Then you're deleting, you know, that person's status as a sick person, which is great Patient and, I think, great for humanity.
Simon But to your point, like, the economic incentive has to be there. And so by increasing The probability of success, you you make tractable, you know, a greater subset of of diseases. So I think that's a that's a really good point. And then on the synbio side, the genetic engineering, you're right. I mean, from this whole conversation we've talked about, there's a lack of biological data that you need to train these models Using large animal systems that require a lot of time, you know, immunization in the case of antibodies, It's just not a high enough throughput, up screen screening tool.
Simon Right? So we've got to figure out in vitro, You know, mechanisms for generating data that's not only abundant, but high quality, testable, functional. And to do that, we might have to do some genetic engineering of our own on these single cell organisms. So I think those are both great great points. Joshua, did you have anything else that you wanted to add to the mix here?
Joshua Meier Yeah. I think maybe 1 thing on the technical side too, that I find really interesting here is about, where does commoditization kind of appear, across the value chain here. So 1 of the nice things I think about kind of working out this intersection of, of AI and data is that the inputs and outputs, to our models, I think that the cost seem to be coming down really quickly. So on the inputs, you know, our model designs all these sequences. We need to go synthesize DNA encoding those sequences.
Joshua Meier So you're starting to see the cost of DNA synthesis really come down, over time. And then on the back end, you know, we take all this DNA. We run it through our E. Coli system. We have this flow cytometry based assay, and at the end of the day, you've got sequences.
Joshua Meier You've you've got DNA and you need to go sequence it. And sequencing, As we know, the costs are also going down dramatically. So I think 1 really nice thing about working in this field is that if we think about, like, the dollar A cost of every, let's say, you know, data point we're creating, that number is going down over time, which will mean that, you know, for the same amount of data, we can for the same amount of money, we can just produce more data over time. So it's a really nice place to be. And it's actually reminiscent, I think of like the early days in computers.
Joshua Meier So, you know, when apple and Microsoft were building the 1st personal computers, It was just really expensive to get hold of that hardware. And what did Microsoft do? They kind of put out all the blueprints, how to make a computer, people looked at that, and then all the OEMs started to make their own computers. They commoditized it, got really cheap, and it's like, well, you have all this hardware, but now you need the software, you go buy SoftOS. So I think it's a really interesting thing and it's a way that really massive companies start to get built.
Joshua Meier When you have, you know, a, a real sort of advantage in the market because things around you are getting cheaper, but the problem you're working on is very hard, and you have a differentiated angle on it.
Sean McClain And I think at the end of the day too, like, the generative AI companies that are going to to win at the end of the day, and this is across, like, all industries, Are those that that own and control the data. At at the end of the day, though, those are gonna be the companies that that ultimately win. And I think, like, know, if you're looking at kinda through an investor lens, especially in this space, it's, the differentiation is is is the data And where are you getting your your data? How are you training this? Because that's ultimately, like, what's going to, enable you to have a competitive mode.
Sean McClain Because At the end of the day, when we go fully in Silico and we're able to design a drug at a click of a button, it's gonna be very hard for people to catch up to us because we've, You know, spent so much time training the models on a on on on a ton of data to increase the overall accuracy, then we've done our own model designs as as well, and so, it comes down to data. Data is data is key to success in generative AI.
Simon Yeah. And I I you know, this is another 1 of those cool KPIs we were talking about earlier. It's like, what is your dollar per data point, you know, governed by the the inputs and outputs is another 1 that I I like. And maybe the the last comment or or question, I'll open it up to you guys is, Joshua, I really wanted to zoom in on the statement you made about, you know, the cost of the data point coming down. It seems like if if you look at the last 50 years of, You know, drug discovery, drug design, as that kind of improvement vector has has, you know, continued churning along and the cost per data point comes down, There's a point in which you almost, like, stop taking the traditional, like, hypothesis driven approach to your problem, and you kinda just start letting the data points speak for themselves and tell you what to do because it's cheap enough to do it that way.
Simon And, of course, experimental design. I'm not saying throw that all out the window, but, like, how do you think about hypothesis driven science in an era where there's abundance now in, You know, the data that you're able to generate?
Joshua Meier That's a great question. I kind of did this thought experiment myself, recently. So, I started just to, you know, I don't do enough coding in my day job anymore. So I said, let me just try to build something in a programming language I haven't worked with for for many years. Right?
Joshua Meier Start building an app, and I used, GPT 4 as kind of my copilot, right, to help me build it. And when I realized something really interesting happened, I never really done this I was literally just copying and pasting the code it was giving me and putting it straight into Xcode to kind of build my app. Right? Usually, you know, read it very carefully. I'm gonna pull up the components I need I'm gonna rename the variables.
Joshua Meier I was kinda just going on autopilot at some point. Right? And and I imagine a similar thing might start to happen in drug discovery, right, where the model starts to get so good that you're just like, okay. The model says go test this antibody. You just do it.
Joshua Meier You don't even think twice because you just become second nature to trust this thing. So I think that might be where the field is heading. Of course, no 1 knows, but just, you know, it's it's kind of cool to work in this space where you, you know, know of course, no 1 knows How things are gonna play out in drug discovery, but you do see it playing out in an earlier field. So a lot of those like product experiences, you kind of get a sense into the future of what it's gonna be like, you know, in in this industry. So, yeah, that's that's 1 thing that that might happen.
Joshua Meier Right? It could really change the game of of how we do science here.
Sean McClain Surin, my question is, what what app are you building?
Joshua Meier It was just, just playing around with some some cool AI capabilities. I mean, well, that was the cool thing about, you know, GPT 4. It's able to build that app, just just in an evening. Right? Because you can just you know?
Joshua Meier Maybe, you know, it's it's actually, Sean is making me push on this. Right? How was I able to do it quickly? I had some playbook. Right?
Joshua Meier I said, like, Wrote down a piece of paper. This is a small amount of thing to look like and and started building it that way. Right? So I think it's going to it could lead to a a world, where we're just a lot more efficient, where scientists, instead of getting distracted about, like, fancy technologies, you just trust the model and you think about, You know, what is the, the real application that you wanna go after? Right?
Joshua Meier So you think about your disease indication, for example, you can be very passionate about that, it allows us to be more strategic about biotech and and and stop spending as much money, kind of chasing a lot of fancy technologies. I think drug discovery is really hard, so it's gonna take us some time to to figure this out, and I think we're, you know, kind of being the trailblazers here on that at Absci, like, Thinking about what that future looks like. So for for folks who are excited about this, like, come come join us and and work with us on on that journey. But, yeah, I think it's it's a really exciting time to to be working in science more generally because AI is is, I think, gonna really revolutionize, the way that we think and do our work.
Simon Well, I think that's a good place to end it. Guys, it's been a blast. For people that have been listening in and and stuck with us through the end here, please Go follow Absci on Twitter. We'll be sure to link, you know, all the recent literature, so people can, you know, stay engaged and learn about, you know, what you're doing and and making sure that they're informed. But other than that, Sean, Joshua, thank you for spending an hour talking with us about this.
Simon It was, a lot of fun, and and hope to see and talk to you guys again soon.
Sean McClain Yeah. Thanks so much, Simon.
Joshua Meier Thanks, Simon. It's a lot of fun.
n/a ARC believes that the information presented is accurate and was obtained from sources that ARC believes to be reliable. However, ARC does not guarantee the accuracy or completeness of any information, and such information may be subject to change without notice from ARC. Historical results are not indications of future results. Certain of the statements contained in this podcast may be statements of future expectations and other forward looking statements that are based on ARC's current views and assumptions and involve known unknown risks and uncertainties that could cause actual results, performance, or events to differ Materially, from those expressed or applied in such statements.