{"id":1163921,"date":"2026-03-23T08:00:21","date_gmt":"2026-03-23T15:00:21","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=1163921"},"modified":"2026-04-20T07:45:44","modified_gmt":"2026-04-20T14:45:44","slug":"will-machines-ever-be-intelligent","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/podcast\/will-machines-ever-be-intelligent\/","title":{"rendered":"Will machines ever be intelligent?\u00a0"},"content":{"rendered":"\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<div class=\"yt-consent-placeholder\" role=\"region\" aria-label=\"Video playback requires cookie consent\" data-video-id=\"6gdlcbhtqSk\" data-poster=\"https:\/\/img.youtube.com\/vi\/6gdlcbhtqSk\/maxresdefault.jpg\"><iframe aria-hidden=\"true\" tabindex=\"-1\" title=\"Will machines ever be intelligent?\" width=\"500\" height=\"281\" data-src=\"https:\/\/www.youtube-nocookie.com\/embed\/6gdlcbhtqSk?feature=oembed&rel=0&enablejsapi=1\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe><div class=\"yt-consent-placeholder__overlay\"><button class=\"yt-consent-placeholder__play\"><svg width=\"42\" height=\"42\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" aria-hidden=\"true\" focusable=\"false\"><g fill=\"none\" fill-rule=\"evenodd\"><circle fill=\"#000\" opacity=\".556\" cx=\"21\" cy=\"21\" r=\"21\"\/><path stroke=\"#FFF\" d=\"M27.5 22l-12 8.5v-17z\"\/><\/g><\/svg><span class=\"yt-consent-placeholder__label\">Video playback requires cookie consent<\/span><\/button><\/div><\/div>\n<\/div><\/figure>\n\n\n<div class=\"wp-block-msr-podcast-container my-4\">\n\t<iframe loading=\"lazy\" src=\"https:\/\/player.blubrry.com\/?podcast_id=153442105&modern=1\" class=\"podcast-player\" frameborder=\"0\" height=\"164px\" width=\"100%\" scrolling=\"no\" title=\"Podcast Player\"><\/iframe>\n<\/div>\n\n\n\n<p>Technical\u202fadvances are moving at such a rapid pace that it can be challenging to define the tomorrow we\u2019re working toward. In\u202f<em><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/story\/the-shape-of-things-to-come\/\" type=\"link\" id=\"https:\/\/www.microsoft.com\/en-us\/research\/story\/the-shape-of-things-to-come\/\">The\u202fShape of Things to Come<\/a><\/em>, Microsoft Research leader <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/dburger\/\" type=\"person\" id=\"31582\">Doug Burger<\/a>\u202fand experts from across disciplines tease out the thorniest AI\u202fissues facing technologists, policymakers, business decision-makers, and other stakeholders today. The goal: to amplify the shared understanding needed to build a future in which the AI transition is a net positive.\u202f<\/p>\n\n\n\n<p>In this first episode of the series, Burger is joined by <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/fusi\/\" type=\"person\" id=\"31829\">Nicol\u00f2 Fusi<\/a> of Microsoft Research and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.linkedin.com\/in\/subutai\/\" type=\"link\" id=\"https:\/\/www.linkedin.com\/in\/subutai\/\" target=\"_blank\" rel=\"noopener noreferrer\">Subutai Ahmad<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> of Numenta to examine whether today\u2019s AI systems are truly intelligent. They compare transformer-based large language models (LLMs) with the human brain\u2019s distributed, continuously learning architecture, exploring differences in efficiency, representation, and sensory-motor grounding. The discussion probes what intelligence really means, where current models excel or fall short, and what future AI systems might need to bridge the gap.<\/p>\n\n\n\n<div class=\"wp-block-buttons is-layout-flex wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button is-style-outline is-style-outline--1\"><a data-bi-type=\"button\" class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/story\/the-shape-of-things-to-come\/\">The Shape of Things to Come podcast series<\/a><\/div>\n<\/div>\n\n\n\n<section class=\"wp-block-msr-subscribe-to-podcast subscribe-to-podcast\">\n\t<div class=\"subscribe-to-podcast__inner border-top border-bottom border-width-2\">\n\t\t<h2 class=\"h5 subscribe-to-podcast__heading\">\n\t\t\tSubscribe to the <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/podcast\">Microsoft Research Podcast<\/a>:\t\t<\/h2>\n\t\t<ul class=\"subscribe-to-podcast__list list-unstyled\">\n\t\t\t\t\t\t\t<li class=\"subscribe-to-podcast__list-item\">\n\t\t\t\t\t<a class=\"subscribe-to-podcast__link\" href=\"https:\/\/itunes.apple.com\/us\/podcast\/microsoft-research-a-podcast\/id1318021537?mt=2\" target=\"_blank\" rel=\"noreferrer noopener\">\n\t\t\t\t\t\t<svg class=\"subscribe-to-podcast__svg\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" fill=\"black\" viewBox=\"0 0 32 32\">  <path d=\"M7.12 0c-3.937-0.011-7.131 3.183-7.12 7.12v17.76c-0.011 3.937 3.183 7.131 7.12 7.12h17.76c3.937 0.011 7.131-3.183 7.12-7.12v-17.76c0.011-3.937-3.183-7.131-7.12-7.12zM15.817 3.421c3.115 0 5.932 1.204 8.079 3.453 1.631 1.693 2.547 3.489 3.016 5.855 0.161 0.787 0.161 2.932 0.009 3.817-0.5 2.817-2.041 5.339-4.317 7.063-0.812 0.615-2.797 1.683-3.115 1.683-0.12 0-0.129-0.12-0.077-0.615 0.099-0.792 0.192-0.953 0.64-1.141 0.713-0.296 1.932-1.167 2.677-1.911 1.301-1.303 2.229-2.932 2.677-4.719 0.281-1.1 0.244-3.543-0.063-4.672-0.969-3.595-3.907-6.385-7.5-7.136-1.041-0.213-2.943-0.213-4 0-3.636 0.751-6.647 3.683-7.563 7.371-0.245 1.004-0.245 3.448 0 4.448 0.609 2.443 2.188 4.681 4.255 6.015 0.407 0.271 0.896 0.547 1.1 0.631 0.447 0.192 0.547 0.355 0.629 1.14 0.052 0.485 0.041 0.62-0.072 0.62-0.073 0-0.62-0.235-1.199-0.511l-0.052-0.041c-3.297-1.62-5.407-4.364-6.177-8.016-0.187-0.943-0.224-3.187-0.036-4.052 0.479-2.323 1.396-4.135 2.921-5.739 2.199-2.319 5.027-3.543 8.172-3.543zM16 7.172c0.541 0.005 1.068 0.052 1.473 0.14 3.715 0.828 6.344 4.543 5.833 8.229-0.203 1.489-0.713 2.709-1.619 3.844-0.448 0.573-1.537 1.532-1.729 1.532-0.032 0-0.063-0.365-0.063-0.803v-0.808l0.552-0.661c2.093-2.505 1.943-6.005-0.339-8.296-0.885-0.896-1.912-1.423-3.235-1.661-0.853-0.161-1.031-0.161-1.927-0.011-1.364 0.219-2.417 0.744-3.355 1.672-2.291 2.271-2.443 5.791-0.348 8.296l0.552 0.661v0.813c0 0.448-0.037 0.807-0.084 0.807-0.036 0-0.349-0.213-0.683-0.479l-0.047-0.016c-1.109-0.885-2.088-2.453-2.495-3.995-0.244-0.932-0.244-2.697 0.011-3.625 0.672-2.505 2.521-4.448 5.079-5.359 0.547-0.193 1.509-0.297 2.416-0.281zM15.823 11.156c0.417 0 0.828 0.084 1.131 0.24 0.645 0.339 1.183 0.989 1.385 1.677 0.62 2.104-1.609 3.948-3.631 3.005h-0.015c-0.953-0.443-1.464-1.276-1.475-2.36 0-0.979 0.541-1.828 1.484-2.328 0.297-0.156 0.709-0.235 1.125-0.235zM15.812 17.464c1.319-0.005 2.271 0.463 2.625 1.291 0.265 0.62 0.167 2.573-0.292 5.735-0.307 2.208-0.479 2.765-0.905 3.141-0.589 0.52-1.417 0.667-2.209 0.385h-0.004c-0.953-0.344-1.157-0.808-1.553-3.527-0.452-3.161-0.552-5.115-0.285-5.735 0.348-0.823 1.296-1.285 2.624-1.291z\"\/><\/svg>\n\t\t\t\t\t\t<span class=\"subscribe-to-podcast__link-text\">Apple Podcasts<\/span>\n\t\t\t\t\t<\/a>\n\t\t\t\t<\/li>\n\t\t\t\n\t\t\t\t\t\t\t<li class=\"subscribe-to-podcast__list-item\">\n\t\t\t\t\t<a class=\"subscribe-to-podcast__link\" href=\"https:\/\/subscribebyemail.com\/www.blubrry.com\/feeds\/microsoftresearch.xml\" target=\"_blank\" rel=\"noreferrer noopener\">\n\t\t\t\t\t\t<svg class=\"subscribe-to-podcast__svg\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" fill=\"none\" viewBox=\"0 0 32 32\"><path fill=\"currentColor\" d=\"M6.4 6a2.392 2.392 0 00-2.372 2.119L16 15.6l11.972-7.481A2.392 2.392 0 0025.6 6H6.4zM4 10.502V22.8a2.4 2.4 0 002.4 2.4h19.2a2.4 2.4 0 002.4-2.4V10.502l-11.365 7.102a1.2 1.2 0 01-1.27 0L4 10.502z\"\/><\/svg>\n\t\t\t\t\t\t<span class=\"subscribe-to-podcast__link-text\">Email<\/span>\n\t\t\t\t\t<\/a>\n\t\t\t\t<\/li>\n\t\t\t\n\t\t\t\t\t\t\t<li class=\"subscribe-to-podcast__list-item\">\n\t\t\t\t\t<a class=\"subscribe-to-podcast__link\" href=\"https:\/\/subscribeonandroid.com\/www.blubrry.com\/feeds\/microsoftresearch.xml\" target=\"_blank\" rel=\"noreferrer noopener\">\n\t\t\t\t\t\t<svg class=\"subscribe-to-podcast__svg\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" fill=\"none\" viewBox=\"0 0 32 32\"><path fill=\"currentColor\" d=\"M12.414 4.02c-.062.012-.126.023-.18.06a.489.489 0 00-.12.675L13.149 6.3c-1.6.847-2.792 2.255-3.18 3.944h13.257c-.388-1.69-1.58-3.097-3.179-3.944l1.035-1.545a.489.489 0 00-.12-.675.492.492 0 00-.675.135l-1.14 1.68a7.423 7.423 0 00-2.55-.45c-.899 0-1.758.161-2.549.45l-1.14-1.68a.482.482 0 00-.494-.195zm1.545 3.824a.72.72 0 110 1.44.72.72 0 010-1.44zm5.278 0a.719.719 0 110 1.44.719.719 0 110-1.44zM8.44 11.204A1.44 1.44 0 007 12.644v6.718c0 .795.645 1.44 1.44 1.44.168 0 .33-.036.48-.09v-9.418a1.406 1.406 0 00-.48-.09zm1.44 0V21.76c0 .793.646 1.44 1.44 1.44h10.557c.793 0 1.44-.647 1.44-1.44V11.204H9.878zm14.876 0c-.169 0-.33.035-.48.09v9.418c.15.052.311.09.48.09a1.44 1.44 0 001.44-1.44v-6.719a1.44 1.44 0 00-1.44-1.44zM11.8 24.16v1.92a1.92 1.92 0 003.84 0v-1.92h-3.84zm5.759 0v1.92a1.92 1.92 0 003.84 0v-1.92h-3.84z\"\/><\/svg>\n\t\t\t\t\t\t<span class=\"subscribe-to-podcast__link-text\">Android<\/span>\n\t\t\t\t\t<\/a>\n\t\t\t\t<\/li>\n\t\t\t\n\t\t\t\t\t\t\t<li class=\"subscribe-to-podcast__list-item\">\n\t\t\t\t\t<a class=\"subscribe-to-podcast__link\" href=\"https:\/\/open.spotify.com\/show\/4ndjUXyL0hH1FXHgwIiTWU\" target=\"_blank\" rel=\"noreferrer noopener\">\n\t\t\t\t\t\t<svg class=\"subscribe-to-podcast__svg\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" fill=\"none\" viewBox=\"0 0 32 32\"><path fill=\"currentColor\" d=\"M16 4C9.383 4 4 9.383 4 16s5.383 12 12 12 12-5.383 12-12S22.617 4 16 4zm5.08 17.394a.781.781 0 01-1.086.217c-1.29-.86-3.477-1.434-5.303-1.434-1.937.002-3.389.477-3.403.482a.782.782 0 11-.494-1.484c.068-.023 1.71-.56 3.897-.562 1.826 0 4.365.492 6.171 1.696.36.24.457.725.217 1.085zm1.56-3.202a.895.895 0 01-1.234.286c-2.338-1.457-4.742-1.766-6.812-1.747-2.338.02-4.207.466-4.239.476a.895.895 0 11-.488-1.723c.145-.041 2.01-.5 4.564-.521 2.329-.02 5.23.318 7.923 1.995.419.26.547.814.286 1.234zm1.556-3.745a1.043 1.043 0 01-1.428.371c-2.725-1.6-6.039-1.94-8.339-1.942h-.033c-2.781 0-4.923.489-4.944.494a1.044 1.044 0 01-.474-2.031c.096-.023 2.385-.55 5.418-.55h.036c2.558.004 6.264.393 9.393 2.23.497.292.663.931.371 1.428z\"\/><\/svg>\n\t\t\t\t\t\t<span class=\"subscribe-to-podcast__link-text\">Spotify<\/span>\n\t\t\t\t\t<\/a>\n\t\t\t\t<\/li>\n\t\t\t\n\t\t\t\t\t\t\t<li class=\"subscribe-to-podcast__list-item\">\n\t\t\t\t\t<a class=\"subscribe-to-podcast__link\" href=\"https:\/\/www.blubrry.com\/feeds\/microsoftresearch.xml\" target=\"_blank\" rel=\"noreferrer noopener\">\n\t\t\t\t\t\t<svg class=\"subscribe-to-podcast__svg\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" fill=\"none\" viewBox=\"0 0 32 32\"><path fill=\"currentColor\" d=\"M6.667 4a2.676 2.676 0 00-2.612 2.13v.003c-.036.172-.055.35-.055.534v18.666c0 .183.019.362.055.534v.003a2.676 2.676 0 002.076 2.075h.002c.172.036.35.055.534.055h18.666A2.676 2.676 0 0028 25.333V6.667a2.676 2.676 0 00-2.13-2.612h-.003A2.623 2.623 0 0025.333 4H6.667zM8 8h1.333C17.42 8 24 14.58 24 22.667V24h-2.667v-1.333c0-6.618-5.382-12-12-12H8V8zm0 5.333h1.333c5.146 0 9.334 4.188 9.334 9.334V24H16v-1.333A6.674 6.674 0 009.333 16H8v-2.667zM10 20a2 2 0 11-.001 4.001A2 2 0 0110 20z\"\/><\/svg>\n\t\t\t\t\t\t<span class=\"subscribe-to-podcast__link-text\">RSS Feed<\/span>\n\t\t\t\t\t<\/a>\n\t\t\t\t<\/li>\n\t\t\t\t\t<\/ul>\n\t<\/div>\n<\/section>\n\n\n<div class=\"wp-block-msr-show-more\">\n\t<div class=\"bg-neutral-100 p-5\">\n\t\t<div class=\"show-more-show-less\">\n\t\t\t<div>\n\t\t\t\t<span>\n\t\t\t\t\t\n\n<h2 class=\"wp-block-heading\" id=\"transcript\">Transcript<\/h2>\n\n\n\n<p>[MUSIC]&nbsp;<\/p>\n\n\n\n<p><strong>DOUG&nbsp;BURGER:&nbsp;<\/strong>This is&nbsp;<em>The Shape of Things to Come,&nbsp;<\/em>a Microsoft Research Podcast.&nbsp;I\u2019m&nbsp;your host, Doug Burger. In this series,&nbsp;we\u2019re&nbsp;going to venture to the bleeding edge of AI capabilities, dig down into the fundamentals, really try to understand them, and think about how these capabilities are going to&nbsp;change the world\u2014for better and worse.&nbsp;&nbsp;&nbsp;<\/p>\n\n\n\n<p>In today\u2019s podcast,&nbsp;I\u2019m&nbsp;bringing on two AI researcher-experts:&nbsp;Nicol\u00f2&nbsp;Fusi, who is an expert in digital, transformer-based large language model architectures and learning,&nbsp;and&nbsp;Subutai&nbsp;Ahmad, who is an expert in biological architectures, specifically the human brain. And the question&nbsp;we\u2019re&nbsp;going to discuss is, are machines intelligent?&nbsp;&nbsp;<\/p>\n\n\n\n<p>And what I mean by that: are digital intelligence, large language models,&nbsp;on a path to surpass humans, or are the architectures&nbsp;just&nbsp;so fundamentally different that one will do one set of things&nbsp;well,&nbsp;the other will do something else very well? And&nbsp;so&nbsp;we\u2019ll&nbsp;be debating the architecture of intelligence across digital implementations and biological implementations&nbsp;because the answer to that question,&nbsp;I think,&nbsp;really will&nbsp;determine&nbsp;the shape of things to come.&nbsp;<\/p>\n\n\n\n\t\t\t\t<\/span>\n\t\t\t\t<span id=\"show-more-show-less-toggle-2\" class=\"show-more-show-less-toggleable-content\">\n\t\t\t\t\t\n\n\n\n<p>[MUSIC FADES]&nbsp;<\/p>\n\n\n\n<p>I&#8217;d&nbsp;like to ask each of my guests to introduce themselves. Tell me a little bit about your background&nbsp;and what&nbsp;you&#8217;re&nbsp;currently working on\u2014to the extent you can talk about it\u2014in AI. So,&nbsp;Nicol\u00f2, would you please start?&nbsp;<\/p>\n\n\n\n<p><strong>NICOL\u00d2&nbsp;FUSI:&nbsp;<\/strong>Yeah, thank you, Doug,&nbsp;for having us&nbsp;and having me&nbsp;here.&nbsp;It&#8217;s&nbsp;so much fun.&nbsp;So&nbsp;I&#8217;m&nbsp;Nicol\u00f2 Fusi. I&#8217;m&nbsp;a researcher at MSR&nbsp;[Microsoft Research].&nbsp;So&nbsp;Doug is my boss, so I will be very,&nbsp;very, very good&nbsp;to Doug&nbsp;in this podcast.&nbsp;&nbsp;<\/p>\n\n\n\n<p>No, but jokes aside, my own background is in Bayesian nonparametric.&nbsp;That&#8217;s&nbsp;what I started studying. So Gaussian processes and things like that. And then equally,&nbsp;I would say,&nbsp;in&nbsp;computational biology, because I found it,&nbsp;like,&nbsp;one of the most interesting use cases&nbsp;for AI techniques. And that,&nbsp;kind of,&nbsp;has been&nbsp;true throughout my career. And&nbsp;pretty much like&nbsp;everybody else,&nbsp;eventually,&nbsp;I moved away from the kernel methods and the Bayesian&nbsp;nonparametrics&nbsp;and I started working more on language models, transformer models,&nbsp;with a particular eye towards information theory and the connection between information theory and generative modeling. And that&#8217;s,&nbsp;kind of,&nbsp;one of the main things I do today other than, kind of, managing the research of people&nbsp;who&nbsp;do much more interesting work than I do.&nbsp;[LAUGHS]&nbsp;&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>I&nbsp;have to&nbsp;interject there,&nbsp;Nicol\u00f2, because you dragged a piece of bait across my path.&nbsp;&nbsp;<\/p>\n\n\n\n<p><strong>FUSI:&nbsp;<\/strong>I&nbsp;figured.&nbsp;&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>You know,&nbsp;at&nbsp;Microsoft&nbsp;Research, I have a management rule that I&nbsp;can&#8217;t&nbsp;tell anyone what to do because we hire some of the best people in the world.&nbsp;You have to trust them.&nbsp;And everyone is always completely free to call BS on me.&nbsp;And so&nbsp;Nicol\u00f2&nbsp;was joking&nbsp;there;&nbsp;[LAUGHTER]&nbsp;he does not have to toe the party line. In fact, I encourage him not to. So, so&nbsp;\u2026&nbsp;<\/p>\n\n\n\n<p><strong>FUSI:<\/strong>&nbsp;I just have to be well-behaved.&nbsp;That&#8217;s&nbsp;the only thing I will say.&nbsp;[LAUGHS]&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>Yeah. Thank you, thank you for&nbsp;baiting&nbsp;me.&nbsp;[LAUGHS]&nbsp;Because&nbsp;he knew exactly what he was doing. And I love him for it.&nbsp;&nbsp;<\/p>\n\n\n\n<p>Subutai, can you tell us a little bit about yourself?&nbsp;<\/p>\n\n\n\n<p><strong>SUBUTAI&nbsp;AHMAD:<\/strong>&nbsp;Sure. Thank you so much, Doug, for having me.&nbsp;I&#8217;m&nbsp;really looking forward to the conversation between us all.&nbsp;&nbsp;<\/p>\n\n\n\n<p>So&nbsp;I see myself fundamentally as a computer scientist.&nbsp;You know, I&#8217;ve&nbsp;been studying computer science for longer than I care to admit.&nbsp;But&nbsp;something changed for me during my&nbsp;undergrad&nbsp;years. I decided to minor in cognitive psychology, and I started to get really interested in how the brain works.&nbsp;<\/p>\n\n\n\n<p>And to me, understanding intelligence and implementing intelligence was the hardest problem a computer scientists&nbsp;could ever&nbsp;solve.&nbsp;So&nbsp;I got&nbsp;very, very interested&nbsp;in that. You know, I&nbsp;couldn&#8217;t&nbsp;see how to really commercialize that. I was&nbsp;very interested&nbsp;in making&nbsp;products and stuff.&nbsp;So&nbsp;I stopped, you know, working on that for a while. I did a number of startups doing computer vision, you know, video processing, a&nbsp;lot of that stuff.&nbsp;<\/p>\n\n\n\n<p>And then when Jeff Hawkins started&nbsp;Numenta&nbsp;back in 2005 with the idea of really deeply understanding how the brain works and figuring out how to apply that to AI, for me, it was like all my worlds coming together.&nbsp;This, like, this&nbsp;is what I had to do. None of us&nbsp;thought&nbsp;[LAUGHS]&nbsp;it would take as long as it did.&nbsp;We spent the last couple of decades&nbsp;really deeply&nbsp;trying to understand neuroscience from a computer scientist\u2014from a programmer&#8217;s\u2014standpoint, the underlying algorithms. And&nbsp;that&#8217;s&nbsp;really what&nbsp;I&#8217;m passionate about, just trying to translate what we understand about&nbsp;the neuroscience&nbsp;to today&#8217;s AI.&nbsp;&nbsp;<\/p>\n\n\n\n<p>And in terms of what we&#8217;re working on today, it&#8217;s, you know, the human\u2014maybe we&#8217;ll get into some of this\u2014the brain is&nbsp;super efficient&nbsp;in how it works\u2014power efficient,&nbsp;energy efficient\u2014and we&#8217;re trying to embody those ideas and trying to make AI a lot more efficient than it is today.&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>Great. I think&nbsp;we&#8217;ll&nbsp;get into efficiency a little bit later in the podcast&nbsp;because&nbsp;that&#8217;s&nbsp;a subject that&#8217;s near and dear to my heart, you know, being a computer architect&nbsp;originally by training.&nbsp;&nbsp;<\/p>\n\n\n\n<p>I want to go back to, you know, one of the reasons I got involved with&nbsp;Numenta&nbsp;is, you know,&nbsp;Subutai&nbsp;and I have been exchanging emails, like,&nbsp;discussing collaborations, you know, visiting each other through the years, and the thing that really stuck with me was when I read one of the&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/us.macmillan.com\/books\/9780805078534\/onintelligence\/\" target=\"_blank\" rel=\"noopener noreferrer\">earlier books from Jeff&nbsp;<em>On Intelligence<\/em><span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.&nbsp;And there was an example in the book that talked about how, you know,&nbsp;the human brain learns continuously.&nbsp;I think&nbsp;biological organisms in general learn continuously.&nbsp;&nbsp;<\/p>\n\n\n\n<p>And the anecdote that I remember was this anecdote&nbsp;if you&#8217;re walking down your basement steps, you know, you&#8217;re walking down the stairs to your basement and there&#8217;s one step&nbsp;that\u2019s&nbsp;always been a few inches off and you decide to fix it, and so you raise it so it&#8217;s even with the others, and then the next time you go down the stairs, you don&#8217;t remember and you&#8217;re wildly off and,&nbsp;you know, you hit that step, you hit it earlier or later than you anticipated, you go out of balance.&nbsp;You&#8217;re&nbsp;flailing around. You know, you get all this adrenaline. You think&nbsp;you&#8217;re&nbsp;going to pitch headfirst down the stairs.&nbsp;Hopefully&nbsp;you&nbsp;don&#8217;t. And then the second time you do it,&nbsp;you&#8217;re&nbsp;a little off balance, but&nbsp;it&#8217;s&nbsp;not crazy. And the third time you&nbsp;maybe notice&nbsp;a little bit, and the fourth time,&nbsp;it&#8217;s,&nbsp;like,&nbsp;it&#8217;s&nbsp;your basement stairs.&nbsp;<\/p>\n\n\n\n<p>And so somewhere between that first time down and the third and fourth times down, there are molecular changes in your brain that have learned the new timing of your basement steps. And I remember just that example vividly from the book.&nbsp;And that got me thinking, wow, this&nbsp;is&nbsp;<em>so<\/em>&nbsp;different from the way our digital AI works.&nbsp;I&#8217;ll&nbsp;turn it over to you to comment for&nbsp;that&nbsp;and&nbsp;then I think&nbsp;we&#8217;ll&nbsp;go&nbsp;into the digital.&nbsp;<\/p>\n\n\n\n<p><strong>AHMAD:&nbsp;<\/strong>Yeah, no,&nbsp;that&#8217;s&nbsp;a great example. I think&nbsp;it&#8217;s&nbsp;remarkable how our brain is constantly modeling our entire world at such a granular level, and&nbsp;we&#8217;re&nbsp;not even aware of it perceptually. Like, you know, that example of the steps is&nbsp;probably not&nbsp;\u2026&nbsp;you&nbsp;wouldn&#8217;t&nbsp;consciously be aware of it, yet&nbsp;if something is different about anything in your world that&nbsp;you&#8217;re&nbsp;very familiar&nbsp;with,&nbsp;you&#8217;ll&nbsp;instantly notice it.&nbsp;And then you&#8217;ll, you know, you&#8217;ll update your world model, you&#8217;ll adjust,&nbsp;and you&#8217;ll continue on.&nbsp;It&#8217;s&nbsp;really remarkable how the&nbsp;brain\u2019s&nbsp;able to do that so seamlessly.&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>And a lot of that is based on neurotransmitters, right? Because&nbsp;there&#8217;s&nbsp;just a&nbsp;\u2026&nbsp;you know, when you have that physical reaction to&nbsp;\u201cI\u2019m&nbsp;about to pitch down the stairs,\u201d&nbsp;you get a flood of transmitters that&nbsp;actually changes&nbsp;the way your brain&#8217;s learning&nbsp;or at least the rate.&nbsp;<\/p>\n\n\n\n<p><strong>AHMAD:&nbsp;<\/strong>Yeah,&nbsp;there&#8217;s&nbsp;a flood of&nbsp;neurotransmitters and neuromodulators, as well, that&nbsp;invoke change, sometimes very rapidly. Another example, you know, if you touch a hot stove\u2014that&#8217;s&nbsp;the canonical example\u2014you will learn that very, very quickly.&nbsp;So&nbsp;there&#8217;s&nbsp;a lot of chemical changes that happen.&nbsp;But it&#8217;s also really interesting that we can update things and update our world knowledge without impacting everything else that we know.&nbsp;This is&nbsp;something&nbsp;that&#8217;s&nbsp;very, very different, again, from today&#8217;s AI models. We&#8217;re&nbsp;able to make these changes in a very contextual and very,&nbsp;sort of,&nbsp;fine-grained way.&nbsp;&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>So,&nbsp;Nicol\u00f2, I want to&nbsp;go and talk&nbsp;a little bit now to&nbsp;transformers.&nbsp;So&nbsp;I think, you know, you and I and&nbsp;Subutai&nbsp;were&nbsp;all working in the AI field, you know,&nbsp;many years before 2017,&nbsp;when the transformer hit. You know, I was building, you know, with my team hardware to accelerate RNNs&nbsp;[recurrent neural networks], LSTMs&nbsp;[long short-term memory], you know, which had this awful&nbsp;loop-carried&nbsp;dependence, you know,&nbsp;the&nbsp;bottlenecked&nbsp;computation, and then the transformer was just much more parallelizable.&nbsp;&nbsp;<\/p>\n\n\n\n<p>So&nbsp;what do&nbsp;you&nbsp;think&#8217;s&nbsp;really going on in these things? And&nbsp;maybe we&nbsp;could start\u2014I know&nbsp;you&nbsp;and I have talked a lot about this\u2014maybe&nbsp;just&nbsp;start with the major blocks. You know,&nbsp;you&#8217;ve got&nbsp;the attention layer. You&#8217;ve&nbsp;got the feedforward layer.&nbsp;You&#8217;ve&nbsp;got, you know, the encoder stack and the decoder stack and the latent space in between.&nbsp;Can you just,&nbsp;kind of,&nbsp;walk us through those pieces at&nbsp;a high level&nbsp;and tell us what you think is going on?&nbsp;<\/p>\n\n\n\n<p><strong>FUSI:&nbsp;<\/strong>Yeah.<strong>&nbsp;<\/strong>Yeah, I mean, I have a very opinionated view of&nbsp;why&nbsp;transformers are so great.&nbsp;&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>That&#8217;s&nbsp;why&nbsp;you\u2019re&nbsp;here.&nbsp;[LAUGHS]&nbsp;<\/p>\n\n\n\n<p><strong>FUSI:&nbsp;<\/strong>Maybe,&nbsp;like,&nbsp;yeah,&nbsp;maybe&nbsp;I\u2019ll&nbsp;inject&nbsp;it.&nbsp;I&nbsp;don&#8217;t&nbsp;know.&nbsp;I&nbsp;don&#8217;t&nbsp;think&nbsp;it&#8217;s&nbsp;a super novel creative opinion, but it is an opinion.&nbsp;So&nbsp;I guess the two&nbsp;principal&nbsp;\u2026&nbsp;the two main components you already described:&nbsp;the, you know, the transformer&nbsp;[read: attention]&nbsp;layers and the feedforward layers. One&nbsp;way to think about them is,&nbsp;how does information in&nbsp;your&nbsp;context relate to each other and what is every token referring to, for&nbsp;instance, in the case of transformers in language models?&nbsp;<\/p>\n\n\n\n<p>So&nbsp;by context,&nbsp;we mean,&nbsp;like,&nbsp;the information you feed through the model, that the model keeps continuously generating and appending to.&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:<\/strong>&nbsp;So&nbsp;like your chat&nbsp;history.&nbsp;<\/p>\n\n\n\n<p><strong>FUSI:<\/strong>&nbsp;Your prompt.&nbsp;Your what?&nbsp;Your chat&nbsp;history or your particular prompt in&nbsp;a&nbsp;chat session.&nbsp;&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>OK.&nbsp;&nbsp;<\/p>\n\n\n\n<p><strong>FUSI:&nbsp;<\/strong>That prompt, which is a sequence of words, gets&nbsp;discretized in a series of tokens. Tokens can be individual words, can be multiple words,&nbsp;kind of,&nbsp;connected together. The way we go from words to tokens typically is through an algorithm that tries to&nbsp;basically collapse&nbsp;as much as possible.&nbsp;Multiple words,&nbsp;like \u201cthe dog,\u201d&nbsp;may&nbsp;be just one token as a first,&nbsp;kind of,&nbsp;level of compression to feed into the model.&nbsp;So&nbsp;it just tries to bring things together as efficiently as possible.&nbsp;&nbsp;<\/p>\n\n\n\n<p>Then there&nbsp;is, you know, within these&nbsp;models, there is a transformer layer. This transformer layer or this attention layer, sorry,&nbsp;tries to&nbsp;basically figure&nbsp;out what the&nbsp;\u201cthe\u201d&nbsp;refers to\u2014the term&nbsp;\u201c<em>the<\/em>\u201d&nbsp;in&nbsp;\u201cthe dog,\u201d&nbsp;or \u201cthe dog&nbsp;<em>jumps<\/em>&nbsp;on the table,\u201d&nbsp;\u201cjumps\u201d&nbsp;refers to the dog.&nbsp;So&nbsp;there is this kind&nbsp;of,&nbsp;like,&nbsp;mapping that happens.<\/p>\n\n\n\n<p>And then there&nbsp;is,&nbsp;like, feedforward layers, which in modern large language models,&nbsp;they store a lot of information. Like,&nbsp;that&#8217;s kind of,&nbsp;like,&nbsp;where the knowledge typically&nbsp;kind of sits&nbsp;in, the things that the model just&nbsp;<em>knows<\/em>.&nbsp;You know,&nbsp;that, I&nbsp;don&#8217;t&nbsp;know,&nbsp;if&nbsp;you slam your arm against&nbsp;[the]&nbsp;cup of water on your table, that cup of water falls off the table.&nbsp;That&#8217;s something that the model,&nbsp;kind of,&nbsp;has baked in through reading a lot&nbsp;about&nbsp;cups falling off of tables when&nbsp;they\u2019re hit.&nbsp;<\/p>\n\n\n\n<p>So that&#8217;s,&nbsp;kind of, those are,&nbsp;for me, the two fundamental components,&nbsp;and the reason why I have an opinionated&nbsp;view is that, you know, honestly, I do believe that RNNs and, you know, even state-space\u2014<em>modern<\/em>&nbsp;incarnations&nbsp;of state-space&nbsp;models\u2014are good enough to learn over these, you know, language data or whatever&nbsp;or&nbsp;vision data or audio data.&nbsp;<\/p>\n\n\n\n<p>The good thing about transformers is that they do two things very well. One&nbsp;is&nbsp;they get out of the way. They&nbsp;don&#8217;t&nbsp;have this notion of&nbsp;\u201ceverything has to be encoded through a state\u201d&nbsp;like recurrent networks. And two, they do that very computationally efficiently as you were saying.&nbsp;There&nbsp;isn&#8217;t&nbsp;a computational bottleneck. And&nbsp;so&nbsp;they created this nice overhang where they happen to be the right architecture&nbsp;at&nbsp;the right time to&nbsp;unlock&nbsp;enough&nbsp;flow of information&nbsp;through the model&nbsp;\u2026&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>Yeah.&nbsp;&nbsp;<\/p>\n\n\n\n<p><strong>FUSI:&nbsp;<\/strong>\u2026 that&nbsp;we could get through these amazing things.&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>Let me press you on one thing.&nbsp;Like, you know, in the attention blocks,&nbsp;you can figure out which&nbsp;words&nbsp;or which tokens relate to which tokens.&nbsp;So&nbsp;I put in the prompt and&nbsp;it&#8217;s&nbsp;finding all the relations and then feeding those relations up to, you know, the feedforward layer\u2014well, the feedforward unit within a layer.&nbsp;And you said that knowledge is encoded there, but then what does it really mean for those maps to then access knowledge, but then you project it back into, you know,&nbsp;the output and then feed it up to the attention block in the next layer?&nbsp;&nbsp;<\/p>\n\n\n\n<p><strong>FUSI:&nbsp;<\/strong>Again,&nbsp;yeah.&nbsp;&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>So&nbsp;it&nbsp;seems&nbsp;kind of weird&nbsp;that&nbsp;I\u2019d&nbsp;be,&nbsp;like,&nbsp;accessing knowledge and then taking that knowledge, merging it,&nbsp;and going back to another attention map.&nbsp;<\/p>\n\n\n\n<p><strong>FUSI:&nbsp;<\/strong>Well, you can&nbsp;see it as&nbsp;a mixing operation that happens in the feedforward&nbsp;part of the layer.&nbsp;You know, like,&nbsp;you&#8217;re&nbsp;attending, then&nbsp;you&#8217;re&nbsp;mixing,&nbsp;and,&nbsp;kind of,&nbsp;like,&nbsp;reprojecting&nbsp;to&nbsp;some space with&nbsp;higher-information content or,&nbsp;like,&nbsp;a different level of information&nbsp;extraction. And then&nbsp;you&#8217;re&nbsp;putting it back into,&nbsp;\u201cOK, so let me do another round of processing\u201d&nbsp;and,&nbsp;kind of,&nbsp;attending and then a mix again.&nbsp;And then I do it again and then I do it again.&nbsp;&nbsp;<\/p>\n\n\n\n<p>So&nbsp;I think that the information that is present in the prompt and in the, you know, that has been baked into the weights&nbsp;gather&nbsp;further and further&nbsp;refined.&nbsp;Whether that refinement is&nbsp;extraction&nbsp;of structure or aggregation into higher-level concepts,&nbsp;I&#8217;m&nbsp;not sure. I think&nbsp;it&#8217;s&nbsp;just structure gets extracted and things that are irrelevant get&nbsp;kind of pushed&nbsp;away. But that&nbsp;doesn&#8217;t&nbsp;necessarily mean that it gets aggregated through the&nbsp;architecture.<s>&nbsp;<\/s>&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:<\/strong>&nbsp;So now&nbsp;I&#8217;m&nbsp;going to try to,&nbsp;like,&nbsp;restate what&nbsp;I think I&nbsp;hear you saying. So,&nbsp;you know,&nbsp;we&#8217;re adding&nbsp;information&nbsp;and&nbsp;we&#8217;re&nbsp;kind of adding&nbsp;information at a higher&nbsp;level but&nbsp;not necessarily throwing away the&nbsp;low-level&nbsp;information, at least&nbsp;that&#8217;s&nbsp;not relevant, right?&nbsp;&nbsp;<\/p>\n\n\n\n<p><strong>FUSI:&nbsp;<\/strong>Yeah.&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>Because, you know, if the&nbsp;higher-level&nbsp;stuff depends on the low-level stuff, I have to have that first.&nbsp;And&nbsp;so&nbsp;then you get to the top of the encoder&nbsp;block&nbsp;and&nbsp;you&#8217;re&nbsp;in the latent space with&nbsp;all of&nbsp;that information&nbsp;kind of maximized. Is that a way to think about it? And if you agree, can you talk about what the encoder block really is and what the latent space&nbsp;is?&nbsp;<\/p>\n\n\n\n<p><strong>FUSI:&nbsp;<\/strong>I tend to agree,&nbsp;yes. I mean, there is&nbsp;\u2026&nbsp;you&#8217;re&nbsp;describing&nbsp;\u2026&nbsp;I think&nbsp;you&#8217;re&nbsp;describing what I think is happening, which is&nbsp;there is given the context in your prompt and given the task that the model perceives or,&nbsp;like,&nbsp;figures out that&nbsp;you&#8217;re&nbsp;doing, it&nbsp;has to&nbsp;highlight and pull out the relevant information.&nbsp;And it does that not by summarizing layer by layer, but it does it by, you know, increasing the prominence of that information and suppressing other things.&nbsp;So&nbsp;I think&nbsp;that&#8217;s&nbsp;ultimately what&nbsp;happens up to the point where you reach this beautiful point&nbsp;in&nbsp;concept space,&nbsp;which&nbsp;identifies&nbsp;both your intent and the things in the prompt and in the knowledge of the model that are necessary to solve it.&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:<\/strong>&nbsp;And&nbsp;so&nbsp;one last question, and then I want to go to&nbsp;Subutai&nbsp;for a second.&nbsp;&nbsp;<\/p>\n\n\n\n<p>So now when we go through the decoder stack, are we just going the other way and stripping out the high-level concepts early and then getting down to the granular tokens?&nbsp;Or, you know&nbsp;\u2026&nbsp;because you go up through the encoder stack,&nbsp;those attention blocks and feedforward layers,&nbsp;to get to that magical latent space.&nbsp;And now&nbsp;we&#8217;re&nbsp;going to go the other direction. How do you think about that other direction through the decoder stack, which is the same&nbsp;primitives&nbsp;as the encoder stack?&nbsp;<\/p>\n\n\n\n<p><strong>FUSI:&nbsp;<\/strong>Same primitives.&nbsp;You can think of it as kind of&nbsp;the reverse&nbsp;operation. Like you, you never lost information throughout. You just kind of suppress or&nbsp;privileged&nbsp;different kinds of information. And now&nbsp;you&#8217;re&nbsp;basically just&nbsp;projecting it back out to a space that is, you know, intelligible. And it&#8217;s,&nbsp;kind of,&nbsp;where the model gets&nbsp;it&#8217;s&nbsp;\u2026&nbsp;I hesitate to use the term&nbsp;<em>reward<\/em>&nbsp;because it has a particular implication, but that&#8217;s,&nbsp;kind of,&nbsp;where the loss gets computed and then gets pushed back through the model.&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>Right, as&nbsp;you&#8217;re&nbsp;trying to evolve and train&nbsp;<em>all<\/em>&nbsp;those parameters\u2014the relationship between words, the information in the feedforward layers, the design of that latent space,&nbsp;and the extraction of the knowledge from it.&nbsp;<\/p>\n\n\n\n<p><strong>FUSI:&nbsp;<\/strong>That&#8217;s&nbsp;right. And&nbsp;so&nbsp;in encoder-decoder model, you push through the whole&nbsp;thing,&nbsp;you decode back to a particular token, which for people who&nbsp;don&#8217;t&nbsp;know, it&#8217;s,&nbsp;like,&nbsp;literally a&nbsp;number out of a vocabulary, like&nbsp;word&nbsp;No.&nbsp;487. And if it was word&nbsp;No.&nbsp;1,500, you get, you know, like, \u2026&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>Something else.&nbsp;<\/p>\n\n\n\n<p><strong>FUSI:&nbsp;<\/strong>\u2026 a bad&nbsp;reward.&nbsp;Yeah.&nbsp;Yeah. And then&nbsp;\u2026&nbsp;and if you&nbsp;got&nbsp;it right, you get a positive signal&nbsp;that&nbsp;then just flows back through the model.&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>I&#8217;d&nbsp;like to go over to&nbsp;Subutai&nbsp;now.&nbsp;So&nbsp;after hearing this,&nbsp;you&#8217;ve&nbsp;studied, you know, neuroscience and the neocortex and cortical columns and all of this for a long time, and you and I have had lots of debates.&nbsp;Is the human brain doing something different than that? You know, are we just building latent spaces, then extracting?&nbsp;The architecture&nbsp;is&nbsp;very different, but&nbsp;what&#8217;s&nbsp;going on under the hood?&nbsp;<\/p>\n\n\n\n<p><strong>AHMAD:&nbsp;<\/strong>Yeah, the architecture is&nbsp;very different. You know, as&nbsp;Nicol\u00f2<strong>&nbsp;<\/strong>was describing what happens throughout a transformer stack, I was trying to relay and relate, you know, what we know in the brain,&nbsp;as well.&nbsp;&nbsp;<\/p>\n\n\n\n<p>In a typical, you know, transformer model,&nbsp;there is,&nbsp;at the end of the day, there is a single latent space from which the next token is&nbsp;output. That does not happen in the brain.&nbsp;There are thousands and thousands of latent spaces that are,&nbsp;sort of,&nbsp;collaborating together,&nbsp;if you will.&nbsp;&nbsp;<\/p>\n\n\n\n<p>You know, a lot of what we publish is under the&nbsp;moniker&nbsp;the Thousand Brains Theory of Intelligence. And Jeff&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.numenta.com\/resources\/books\/a-thousand-brains-by-jeff-hawkins\/\" target=\"_blank\" rel=\"noopener noreferrer\">has published a book a few years ago on that<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. And that,&nbsp;kind of,&nbsp;dates back to&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/scispace.com\/pdf\/the-columnar-organization-of-the-neocortex-fyv7837bo3.pdf\" target=\"_blank\" rel=\"noopener noreferrer\">discoveries in neuroscience from the&nbsp;\u201960s and&nbsp;\u201970s by the&nbsp;neuroscientist&nbsp;Vernon Mountcastle<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, who was a professor at Johns Hopkins.&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>Yup.&nbsp;<\/p>\n\n\n\n<p><strong>AHMAD:&nbsp;<\/strong>And&nbsp;what he discovered&nbsp;\u2026&nbsp;he made this remarkable discovery that, you know, our neocortex, which is the biggest part of our brain\u2014that&#8217;s where all intelligent function happens\u2014is actually&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.hachettebookgroup.com\/titles\/jeff-hawkins\/a-thousand-brains\/9781541675797\/?lens=basic-books\" id=\"https:\/\/www.hachettebookgroup.com\/titles\/jeff-hawkins\/a-thousand-brains\/9781541675797\/?lens=basic-books\" target=\"_blank\" rel=\"noopener noreferrer\">composed of roughly&nbsp;100,000&nbsp;what&nbsp;he&nbsp;called&nbsp;cortical columns<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:<\/strong>&nbsp;Right.&nbsp;&nbsp;<\/p>\n\n\n\n<p><strong>AHMAD:&nbsp;<\/strong>And each cortical column is&nbsp;maybe&nbsp;50,000&nbsp;neurons.&nbsp;And&nbsp;there&#8217;s&nbsp;a very complex&nbsp;microcircuit and microarchitecture&nbsp;between the neurons in a cortical column.&nbsp;&nbsp;<\/p>\n\n\n\n<p>But then&nbsp;there&#8217;s&nbsp;100,000 of them, and every part of your brain\u2014whether&nbsp;it&#8217;s&nbsp;doing visual&nbsp;processing, auditory processing, language, thought, motor actions\u2014they&#8217;re&nbsp;all composed of this, essentially,&nbsp;this&nbsp;same microarchitecture. And this was a remarkable discovery. It&nbsp;says that&nbsp;there&#8217;s&nbsp;a universal&nbsp;architecture.&nbsp;It&#8217;s&nbsp;not a simple one.&nbsp;It&#8217;s&nbsp;complex. But&nbsp;it&#8217;s&nbsp;repeated throughout the brain.&nbsp;<\/p>\n\n\n\n<p>And&nbsp;that&#8217;s&nbsp;where this,&nbsp;you know, the idea of the&nbsp;Thousand&nbsp;Brains&nbsp;\u2026&nbsp;each of these cortical columns is actually a&nbsp;complete sensory-motor processing system. It has inputs;&nbsp;it has outputs.&nbsp;It&#8217;s&nbsp;getting sensory input.&nbsp;It&#8217;s sending&nbsp;outputs to motor systems. And&nbsp;it&#8217;s&nbsp;building, in our theory, complete world models. So there&nbsp;isn&#8217;t&nbsp;a single latent space.&nbsp;There&#8217;s&nbsp;thousands of these latent spaces.&nbsp;<\/p>\n\n\n\n<p>And each little cortical column is trying to understand its little bit of the world. You know, one cortical column might be getting,&nbsp;at the lowest level,&nbsp;maybe one&nbsp;degree of visual information from the top&nbsp;right-hand&nbsp;corner of your retina. Another one might be focusing on specific frequencies in the auditory range. You know, each one has its own little view of the world, and&nbsp;it&#8217;s&nbsp;building its own little world model.&nbsp;<\/p>\n\n\n\n<p>And then they all collaborate together.&nbsp;There&#8217;s&nbsp;no top or bottom here.&nbsp;There&#8217;s&nbsp;no homunculus in the brain. Everything is&nbsp;sort of equal. And&nbsp;they&#8217;re&nbsp;all simultaneously collaborating and voting and coming up to, you know, what is the, you know, consistent interpretation of&nbsp;all of&nbsp;these sensory inputs that&nbsp;we&#8217;re&nbsp;getting? What is the single consistent, you know, concept, if you will, and,&nbsp;based on that, make the motor actions that are most relevant to that.&nbsp;<\/p>\n\n\n\n<p>So&nbsp;it&#8217;s&nbsp;a&nbsp;sensory-motor&nbsp;loop.&nbsp;It&#8217;s&nbsp;a, you know, it&#8217;s a&nbsp;constantly recurring system;&nbsp;we\u2019re&nbsp;constantly making predictions. As we discussed earlier, you know, we are constantly learning. Every cortical column is constantly updating its connections, constantly updating its weights.&nbsp;It&#8217;s&nbsp;building and incrementally improving its world model constantly.&nbsp;So&nbsp;it&#8217;s&nbsp;a massively distributed, you know, set of&nbsp;processing elements that we call cortical columns that are,&nbsp;they&#8217;re&nbsp;all equal,&nbsp;operating&nbsp;in parallel.&nbsp;<\/p>\n\n\n\n<p>So&nbsp;I think there&nbsp;are similarities,&nbsp;for sure,&nbsp;between them. But at least the way I described it, I think&nbsp;it&#8217;s&nbsp;very different&nbsp;in its operation than what I understand today\u2019s&nbsp;LLMs&nbsp;to be.&nbsp;I&nbsp;don&#8217;t&nbsp;know if you agree with that or not.&nbsp;<\/p>\n\n\n\n<p><strong>FUSI:&nbsp;<\/strong>Yeah, I \u2026<strong>&nbsp;<\/strong>To better understand,&nbsp;I had a&nbsp;question,&nbsp;which is, are these cortical columns relying on the fact that these are essentially multiple views of the same process and those multiple views, like,&nbsp;the, you know, the part of the sensory input that gets allocated or subdivided, is it happening at the same time point?&nbsp;So&nbsp;in other words,&nbsp;if you&nbsp;could artificially&nbsp;delay by some time&nbsp;<em>t<\/em>&nbsp;some cortical columns with respect to the rest, would the learning suffer?&nbsp;&nbsp;<\/p>\n\n\n\n<p><strong>AHMAD:&nbsp;<\/strong>Yes, absolutely.&nbsp;Yeah.&nbsp;&nbsp;<\/p>\n\n\n\n<p><strong>FUSI:&nbsp;<\/strong>And&nbsp;so&nbsp;in other words, how important is it that it&#8217;s,&nbsp;kind of,&nbsp;on the same schedule?&nbsp;<\/p>\n\n\n\n<p><strong>AHMAD:<\/strong>&nbsp;[LAUGHS]&nbsp;Yeah, I mean,&nbsp;that&#8217;s&nbsp;another&nbsp;\u2026&nbsp;I mean,&nbsp;LLMs&nbsp;today, you know, you get your input,&nbsp;one layer&nbsp;processes&nbsp;it, then the next, then the next, and the other layers are not&nbsp;operating.&nbsp;In the brain,&nbsp;it\u2019s&nbsp;not like that. Everything is&nbsp;operating&nbsp;in parallel asynchronously. And this is important.&nbsp;They&#8217;re&nbsp;constantly trying to make predictions and so on.&nbsp;So&nbsp;if you were to artificially slow down some of your cortical columns, you would absolutely suffer. Your thinking would absolutely suffer.&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>I wanted to interject here just because this is where&nbsp;\u2026&nbsp;this discussion is where, you know, I got <em>super<\/em>&nbsp;interested in the difference and then spent a bunch of time with&nbsp;Subutai&nbsp;to learn from him. So if I think about my skin,&nbsp;you know, which is an organ, you know,&nbsp;as I understand it, there&#8217;s a cortical column attached to&nbsp;each patch of my skin and the size of that patch,&nbsp;kind of,&nbsp;corresponds to the nerve density there.&nbsp;&nbsp;<\/p>\n\n\n\n<p><strong>AHMAD:&nbsp;<\/strong>That\u2019s&nbsp;right.&nbsp;Yeah.&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>So&nbsp;in my brain,&nbsp;there is a set of cortical columns that are skin sensors, and I could actually&nbsp;\u2026&nbsp;if I numbered all the cortical columns in the brain, I could draw a map on my skin and say,&nbsp;\u201cThis&nbsp;is&nbsp;No.&nbsp;72 in this patch.&nbsp;This is&nbsp;No.&nbsp;73 in this patch.\u201d&nbsp;Now&nbsp;are&nbsp;human cortical columns,&nbsp;like,&nbsp;better than, say, what we see in a mouse?&nbsp;And,&nbsp;of course, this is&nbsp;a leading&nbsp;question because I know the answer.&nbsp;<\/p>\n\n\n\n<p><strong>AHMAD:&nbsp;<\/strong>[LAUGHS]&nbsp;Yeah. So,&nbsp;yes, it,&nbsp;you know, cortical columns in your sensory areas, primary sensory areas,&nbsp;each, you know, pay attention&nbsp;to&nbsp;or get input from a, you know, some patch of your skin somewhere on your body. And&nbsp;there&#8217;s&nbsp;many more cortical columns associated with your fingertips than, you know, a square centimeter of your back, for example.&nbsp;So&nbsp;there&#8217;s definitely, you know, areas of sensory information that we pay a lot more attention to and devote a lot&nbsp;more physical resources&nbsp;to.&nbsp;&nbsp;<\/p>\n\n\n\n<p>In terms of a mouse and humans,&nbsp;it&#8217;s&nbsp;pretty remarkable&nbsp;that the cortical columns&nbsp;\u2026 so all mammals have cortical columns;&nbsp;all mammals have a neocortex. All mammals have cortical columns from a mouse all the way up to humans.&nbsp;And mice have cortical columns that are&nbsp;very, very similar&nbsp;to what a human has.&nbsp;It&#8217;s&nbsp;not identical. There&nbsp;<em>are<\/em>&nbsp;differences.&nbsp;But by and large,&nbsp;the&nbsp;architecture of a cortical column&nbsp;in&nbsp;a mouse is, you know,&nbsp;very, very similar&nbsp;to cortical columns in humans. Human cortical columns are bigger.&nbsp;There are&nbsp;more neurons, and&nbsp;there&#8217;s&nbsp;more detail&nbsp;there,&nbsp;but essentially,&nbsp;it&#8217;s&nbsp;the same.&nbsp;And \u2026&nbsp;&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:<\/strong>&nbsp;Maybe just scaled up a little bit.&nbsp;&nbsp;<\/p>\n\n\n\n<p><strong>AHMAD:&nbsp;<\/strong>Yeah. So evolution basically discovered this structure\u2014that it&#8217;s really excellent for processing information and dealing with it\u2014and then through, you know, very fast in evolutionary time, basically figured out that if you could scale up the number of cortical columns, you get more intelligent animals. And&nbsp;that&#8217;s&nbsp;what happened&nbsp;very, very fast&nbsp;evolutionarily.&nbsp;<\/p>\n\n\n\n<p><strong>FUSI:&nbsp;<\/strong>I&nbsp;didn&#8217;t&nbsp;know about the unevenness of cortical columns present. Like, this is not&nbsp;\u2026&nbsp;I&#8217;m&nbsp;not a neuroscientist, and so this is interesting because one of the biggest&nbsp;frustrations&nbsp;with many modern architectures of models is that they&nbsp;deploy a constant amount of computation no matter what the input is.&nbsp;&nbsp;<\/p>\n\n\n\n<p>So I go through the same number of layers whether I&#8217;m trying to predict the word&nbsp;\u201cdog\u201d&nbsp;after&nbsp;\u201cthe\u201d&nbsp;or whether I&#8217;m trying to solve, like, give the final answer to a very complicated math question or, you know, whether a theorem was proven or not in the prompt.&nbsp;And so&nbsp;that&#8217;s&nbsp;interesting because,&nbsp;like,&nbsp;some current instantiations&nbsp;of modern architecture&nbsp;actually&nbsp;deploy&nbsp;\u2026&nbsp;try to cluster things together such that you have a constant amount of information that you then push together through the model.&nbsp;[LAUGHTER]&nbsp;And so maybe like on my fingertips, I need more processing than I need on my elbow because, like, you know&nbsp;\u2026&nbsp;and so this,&nbsp;kind of,&nbsp;makes sense.&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>Nicol\u00f2&nbsp;is being&nbsp;humble. He was working on this problem two years ago and told me about&nbsp;it.&nbsp;It was one of the things I learned from&nbsp;you&nbsp;that made me think differently.&nbsp;So \u2026&nbsp;<\/p>\n\n\n\n<p><strong>FUSI:&nbsp;<\/strong>I just like to refer to&nbsp;<em>people<\/em>&nbsp;are working on this &#8230;&nbsp;[LAUGHS]&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>Random average people who are not all necessarily brilliant AI scientists.&nbsp;&nbsp;<\/p>\n\n\n\n<p>So the prediction part of this, though, is really what&#8217;s fascinating to me, because, again, something else Subutai&nbsp;and&nbsp;I discussed many years ago, you know, if I&#8217;m,&nbsp;like,&nbsp;moving my finger towards the table&nbsp;and\u2026my brain is making predictions because I have a world model. It&nbsp;knows&nbsp;a&nbsp;table is there. And the cortical columns&nbsp;representing&nbsp;that patch of skin, as&nbsp;it&#8217;s&nbsp;getting closer,&nbsp;they&#8217;re&nbsp;starting to predict that&nbsp;I&#8217;m&nbsp;going to feel something that feels&nbsp;<em>like<\/em>&nbsp;the table. And,&nbsp;yup,&nbsp;there;&nbsp;I hit it.&nbsp;Prediction met.&nbsp;&nbsp;<\/p>\n\n\n\n<p>But if I touched it and it felt&nbsp;really icy&nbsp;cold or&nbsp;super hot&nbsp;or fluffy or not there\u2014I pass&nbsp;through it\u2014I&#8217;d&nbsp;get a flurry of activity because the prediction&nbsp;wouldn&#8217;t&nbsp;match the world model, and that&#8217;s where learning would happen.&nbsp;&nbsp;<\/p>\n\n\n\n<p>Subutai,&nbsp;does that sound like the right model and intuition?&nbsp;&nbsp;<\/p>\n\n\n\n<p><strong>AHMAD:&nbsp;<\/strong>Yeah,&nbsp;that&#8217;s&nbsp;definitely&nbsp;a very important&nbsp;component&nbsp;of it.&nbsp;We&#8217;re&nbsp;constantly making predictions. And&nbsp;as you said,&nbsp;you know,&nbsp;you&#8217;re moving your right fingertip down; you know,&nbsp;perhaps you&#8217;ve never sat in this room before or, you know, seen this table before, you would still have a prediction, a very good prediction of it.&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>Yeah.<strong>&nbsp;<\/strong>Because you know&nbsp;what a table is.&nbsp;<\/p>\n\n\n\n<p><strong>AHMAD:&nbsp;<\/strong>You know&nbsp;what a table is. And if it&nbsp;was&nbsp;different, you would, you&nbsp;know,&nbsp;you would notice it right away. But if your left hand, which you&nbsp;weren&#8217;t&nbsp;paying attention to, also felt icy cold, then you would notice that,&nbsp;as well.&nbsp;So&nbsp;you&#8217;re actually making not just one prediction; you&#8217;re&nbsp;making thousands and thousands of predictions constantly about&nbsp;&#8230;&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>Every cortical&nbsp;column.&nbsp;<\/p>\n\n\n\n<p><strong>AHMAD:&nbsp;<\/strong>Every cortical column is making predictions. And if something were anomalous, highly anomalous, you would notice it.&nbsp;So&nbsp;this is something, you&nbsp;know,&nbsp;we&nbsp;don&#8217;t&nbsp;often realize;&nbsp;we&#8217;re&nbsp;making very, very granular predictions&nbsp;<em>constantly<\/em>. And when things are wrong, we do learn from it.&nbsp;&nbsp;<\/p>\n\n\n\n<p>And the other interesting thing\u2014and this is, again, possibly different from how&nbsp;LLMs&nbsp;work\u2014&nbsp;you know, if I were to tell you to touch the, you know, the bottom surface of the table, you could without,&nbsp;again, without looking at the table or opening your eyes, you would be able to move your finger in and touch the bottom of your&nbsp;table because you have a, you know, set of reference frames that relate to&nbsp;&#8230;&nbsp;&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>Yup&nbsp;\u2026&nbsp;<\/p>\n\n\n\n<p><strong>AHMAD:&nbsp;<\/strong>There you go. Yep.&nbsp;You&#8217;re&nbsp;able to do it.&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>I did it!&nbsp;Yeah.&nbsp;Amazing.&nbsp;<\/p>\n\n\n\n<p><strong>AHMAD:&nbsp;<\/strong>Even though you&nbsp;maybe never&nbsp;have&nbsp;been in this&nbsp;room;&nbsp;maybe&nbsp;you\u2019ve&nbsp;never seen this table before. It&nbsp;doesn&#8217;t&nbsp;matter.&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>I\u2019ve&nbsp;been in&nbsp;this room&nbsp;because&nbsp;we had&nbsp;to&nbsp;prep&nbsp;for&nbsp;the podcast&nbsp;series.&nbsp;But I&nbsp;didn&#8217;t&nbsp;touch the underside of the table,&nbsp;that&#8217;s&nbsp;for sure.&nbsp;[LAUGHS]&nbsp;<\/p>\n\n\n\n<p><strong>AHMAD:&nbsp;<\/strong>Yeah, exactly.&nbsp;[LAUGHS]&nbsp;So, you know, we know where things are&nbsp;in&nbsp;relation to each other, where our body is in relation to everything, and we can very, very rapidly learn. And again, if the bottom part of the table&nbsp;was&nbsp;anomalous, you would notice it and potentially remember that.&nbsp;<\/p>\n\n\n\n<p><strong>FUSI:&nbsp;<\/strong>I&#8217;m&nbsp;not going to lie.&nbsp;I was expecting you to find something under&nbsp;that&nbsp;table,&nbsp;[LAUGHTER]&nbsp;like a&nbsp;talk show.&nbsp;<\/p>\n\n\n\n<p><strong>AHMAD:&nbsp;<\/strong>Or chewing gum&nbsp;or something.&nbsp;<\/p>\n\n\n\n<p><strong>FUSI:&nbsp;<\/strong><em>And if you reach under the table,&nbsp;you&#8217;re&nbsp;going to find a copy of my paper.&nbsp;<\/em>[LAUGHS]&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>[LAUGHS]&nbsp;You know, if I&nbsp;was&nbsp;smarter&nbsp;and&nbsp;better prepared,&nbsp;that&#8217;s&nbsp;exactly what&nbsp;would have&nbsp;happened.&nbsp;But, sorry, guys.&nbsp;&nbsp;<\/p>\n\n\n\n<p>I think you&nbsp;told me something,&nbsp;Subutai,&nbsp;you know,&nbsp;that&nbsp;\u2026 and&nbsp;I&#8217;ll&nbsp;give a little bit of preamble.&nbsp;&nbsp;<\/p>\n\n\n\n<p>So, you know, the brain has these&nbsp;dendritic networks&nbsp;in each neuron,&nbsp;and they form&nbsp;synapses.&nbsp;And so a neuron fires,&nbsp;and that, you know, the&nbsp;axon&nbsp;of the neuron that&#8217;s firing will propagate a signal through the synapses, which might do a little signal processing to the dendrites of the downstream neurons,&nbsp;and those downstream\u2014the&nbsp;dendrites can then prime the neuron to fire.&nbsp;That&#8217;s&nbsp;one of the fundamental mechanisms. And&nbsp;it&#8217;s&nbsp;the formation of those synapses, you know, between the upstream and downstream neurons, the dendrites,&nbsp;that&nbsp;seem to&nbsp;be&nbsp;the basis of learning,&nbsp;and to me, that feels a little bit like an attention map.&nbsp;<\/p>\n\n\n\n<p><strong>AHMAD:&nbsp;<\/strong>Yes.&nbsp;&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>So&nbsp;maybe the&nbsp;dendritic&nbsp;network is doing something akin to self-attention, and we have some work&nbsp;going on&nbsp;in that direction at MSR.&nbsp;But the thing you told me was that your brain is actually forming an incredibly large number of synapses&nbsp;speculatively.&nbsp;In&nbsp;some sense, sampling the world when something happens in case it&nbsp;will recur.&nbsp;You&nbsp;know,&nbsp;it&#8217;s&nbsp;a&nbsp;more&nbsp;\u2026&nbsp;maybe&nbsp;it&#8217;s&nbsp;a version of&nbsp;Hebbian&nbsp;learning, right?&nbsp;You know, things that fire together, wire together.&nbsp;<\/p>\n\n\n\n<p><strong>AHMAD:&nbsp;<\/strong>Exactly.&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>But then if that pattern&nbsp;doesn&#8217;t&nbsp;recur, then they get pruned. And&nbsp;I\u2019m&nbsp;just&nbsp;going to, you know, what is the fraction of your synapses to get turned over every&nbsp;three&nbsp;or&nbsp;four&nbsp;days, you know, ballpark?&nbsp;<\/p>\n\n\n\n<p><strong>AHMAD:&nbsp;<\/strong>OK.&nbsp;Yeah, I remember this. This&nbsp;was&nbsp;an absolute mind-blowing&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.jneurosci.org\/content\/35\/36\/12535\" target=\"_blank\" rel=\"noopener noreferrer\">study in&nbsp;[The Journal of]&nbsp;Neuroscience<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.&nbsp;So,&nbsp;you know, the way a lot of learning happens in the brain is by adding and dropping connections.&nbsp;<\/p>\n\n\n\n<p>In AI models,&nbsp;it&#8217;s&nbsp;usually strengthening, you know, high-precision floating-point number, making it higher&nbsp;or&nbsp;lower. But&nbsp;you&#8217;re&nbsp;not adding and dropping connections.&nbsp;The connections are always\u2014in fact,&nbsp;everything is fully connected,&nbsp;right,&nbsp;between layers. And&nbsp;so&nbsp;in the brain,&nbsp;you&#8217;re&nbsp;always adding and dropping connections.&nbsp;That&#8217;s&nbsp;a fundamental mechanism by which we learn,&nbsp;<em>one<\/em>&nbsp;of the fundamental mechanisms.&nbsp;&nbsp;<\/p>\n\n\n\n<p>What&nbsp;I&nbsp;read in&nbsp;this&nbsp;study is that they&nbsp;looked at adult mice&nbsp;and adult&nbsp;animals,&nbsp;and what they found is that they would look at the number of synapses that were connected&nbsp;over the course of a couple of months\u2014and they were able to trace individual synapses in this particular part of the brain\u2014and what they found is that&nbsp;every four days, 30% of the synapses that were there were no longer there four days from now. And there was a new 30%. And&nbsp;there&#8217;s&nbsp;a huge number&nbsp;of connections that are constantly being added and constantly being pruned. And my theory of&nbsp;what&#8217;s&nbsp;going on there is that&nbsp;we&#8217;re&nbsp;always speculatively&nbsp;trying to learn things.&nbsp;<\/p>\n\n\n\n<p>So, you know,&nbsp;there&#8217;s&nbsp;all sorts of random coincidences and things that we are exposed to on a&nbsp;day-to-day&nbsp;basis.&nbsp;We&#8217;re constantly forming connections there because we don&#8217;t know what&#8217;s actually going to be required and what&#8217;s real and what&#8217;s random.&nbsp;Most of&nbsp;it&#8217;s&nbsp;random; most of&nbsp;it&#8217;s&nbsp;not necessary.&nbsp;And the stuff that actually is necessary will&nbsp;stay on.&nbsp;But&nbsp;we&#8217;re&nbsp;constantly trying to learn.&nbsp;<\/p>\n\n\n\n<p>This is a part of continuous learning&nbsp;that&#8217;s&nbsp;often not appreciated, I think, is that&nbsp;we&#8217;re&nbsp;constantly forming new connections, and then we prune the stuff that we&nbsp;don&#8217;t&nbsp;need.&nbsp;In&nbsp;an AI&nbsp;model, if you were to do that, it would just go, I&nbsp;don&#8217;t&nbsp;know, it would&nbsp;go bananas. [LAUGHTER]&nbsp;&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>Well, so&nbsp;let&#8217;s&nbsp;double-click&nbsp;on that.&nbsp;So&nbsp;when you told me that,&nbsp;the way&nbsp;I&nbsp;\u2026&nbsp;<\/p>\n\n\n\n<p><strong>AHMAD:&nbsp;<\/strong>This is mind-blowing,&nbsp;this&nbsp;30%.&nbsp;&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>It\u2019s&nbsp;crazy.&nbsp;&nbsp;<\/p>\n\n\n\n<p><strong>AHMAD:<\/strong>&nbsp;Your brain is going to be&nbsp;totally different&nbsp;a few days from now.&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>It&#8217;s&nbsp;so&nbsp;mind-blowing.&nbsp;When you told me that,&nbsp;I spent some time processing it, so&nbsp;a&nbsp;whole bunch of synapses were created&nbsp;and&nbsp;destroyed during that time.&nbsp;&nbsp;<\/p>\n\n\n\n<p>But it just made me think that we have, you&nbsp;know,&nbsp;we have all of these columns getting all of this input continuously. You know,&nbsp;eyes,&nbsp;hearing, smell, taste, skin, heat,&nbsp;and then,&nbsp;you know, interactions with people,&nbsp;and then planning and experiences,&nbsp;just at every level.<strong>&nbsp;<\/strong>And&nbsp;they&#8217;re&nbsp;constantly sampling all this noise coming in&nbsp;and&nbsp;basically filtering&nbsp;out&nbsp;the noise.&nbsp;It&#8217;s&nbsp;like,&nbsp;kind of,&nbsp;like&nbsp;a&nbsp;low-pass filter.&nbsp;But when something statistically significant&nbsp;recurs,&nbsp;it&#8217;s&nbsp;going to lock and then become persistent.&nbsp;&nbsp;<\/p>\n\n\n\n<p><strong>AHMAD:&nbsp;<\/strong>Yeah,&nbsp;yeah,&nbsp;I think so.&nbsp;There&#8217;s&nbsp;so much&nbsp;that&#8217;s&nbsp;happening,&nbsp;and&nbsp;you\u2019re&nbsp;constantly learning,&nbsp;and, you know, when you touch a hot stove or something,&nbsp;there&#8217;s&nbsp;a flood of&nbsp;dopamine&nbsp;specific to those areas that caused these synapses to strengthen very, very quickly. You know, most of these&nbsp;synapses that are learned are very, very weak synapses.&nbsp;&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>Yup.&nbsp;<\/p>\n\n\n\n<p><strong>AHMAD:&nbsp;<\/strong>And so,&nbsp;yeah, you know, when you look&nbsp;\u2026&nbsp;in this study, they also quantified the turnover&nbsp;in,&nbsp;kind of,&nbsp;strong synapses versus weak synapses. And&nbsp;it&#8217;s&nbsp;comforting to know that the strong synapses stay there.&nbsp;It&#8217;s&nbsp;really these weak synapses&nbsp;that&nbsp;are constantly added and dropped. And then&nbsp;some of them&nbsp;will become&nbsp;strong.&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>Now I want to go back&nbsp;\u2026&nbsp;return to&nbsp;Nicol\u00f2, but with an observation.&nbsp;&nbsp;&nbsp;<\/p>\n\n\n\n<p>So&nbsp;when&nbsp;I&#8217;m&nbsp;training a transformer,&nbsp;it&#8217;s&nbsp;also a&nbsp;prediction-based&nbsp;system. You know,&nbsp;I&#8217;m&nbsp;running&nbsp;\u2026&nbsp;I have my input in the training set;&nbsp;I have my masked token or the next token&nbsp;I&#8217;m&nbsp;trying to predict.&nbsp;I run it through. I look at how successfully did it make that prediction,&nbsp;and the worse it was,&nbsp;the,&nbsp;sort of,&nbsp;the steeper the error, you know, I drive back through the network.&nbsp;So, you know, if it&#8217;s&nbsp;spot-on, I don&#8217;t learn very much. But if the prediction is&nbsp;way off,&nbsp;I&#8217;ve&nbsp;got to change a bunch of stuff. That sounds analogous to what Subutai was just describing&nbsp;with&nbsp;the cortical columns.&nbsp;<\/p>\n\n\n\n<p><strong>FUSI:&nbsp;<\/strong>No,&nbsp;that&#8217;s&nbsp;right.&nbsp;I mean,&nbsp;with, I don&#8217;t know, with one big pet peeve of mine in pretraining,&nbsp;in particular around pretraining these&nbsp;language&nbsp;models.&nbsp;&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>OK.&nbsp;<\/p>\n\n\n\n<p><strong>FUSI:&nbsp;<\/strong>So again, for context,&nbsp;like,&nbsp;language&nbsp;models in particular, but, you know, many other<strong>&nbsp;<\/strong>instantiations of&nbsp;large&nbsp;models,&nbsp;are trained in a few phases&nbsp;usually.&nbsp;One of them is pretraining,&nbsp;where you have some ground truth text and you remove,&nbsp;let&#8217;s&nbsp;say, just the last word, and then you ask the model to predict the last word.&nbsp;And&nbsp;that&#8217;s&nbsp;when you get that loss.&nbsp;Do&nbsp;you get the word right? Do&nbsp;you get the word wrong?&nbsp;&nbsp;<\/p>\n\n\n\n<p>One of the big problems that I have is that, you know, in human experience, we do not get&nbsp;feedback&nbsp;every single&nbsp;thought.&nbsp;&nbsp;<\/p>\n\n\n\n<p>The problem with language models, the way we are training&nbsp;them,&nbsp;at least in pretraining,&nbsp;is that they do a thing called&nbsp;teacher forcing.&nbsp;So&nbsp;they guess the word, then they get&nbsp;immediately&nbsp;the signal, and then the right word gets filled in, and then they predict the next one.&nbsp;<\/p>\n\n\n\n<p>So&nbsp;when you go through,&nbsp;like,&nbsp;a passage of text, you constantly get&nbsp;this&nbsp;reward. And&nbsp;it&#8217;s&nbsp;such a bizarre way to train&nbsp;a&nbsp;model.&nbsp;It&#8217;s&nbsp;necessary because you want&nbsp;a&nbsp;lot of flow&nbsp;of supervision.&nbsp;Like,&nbsp;you want,&nbsp;like,&nbsp;a lot of supervision to&nbsp;essentially use&nbsp;all the computation available. But at the same time, it actually makes the models arguably a little bit worse than&nbsp;what they&nbsp;would be if you had enough&nbsp;compute&nbsp;to train them without this.&nbsp;<\/p>\n\n\n\n<p>I went on a tangent just because&nbsp;it&#8217;s&nbsp;a pet peeve.&nbsp;[LAUGHS]&nbsp;&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>It&#8217;s&nbsp;a&nbsp;really important point, though, because your goal when you&#8217;re training a model is to get to your loss&nbsp;target with the minimal cost and time.&nbsp;Or,&nbsp;of course, like,&nbsp;fixed budget and,&nbsp;like,&nbsp;lowest loss target.&nbsp;<\/p>\n\n\n\n<p>But, you know, biological systems,&nbsp;also,&nbsp;their goal is survival with energy minimization. And so,&nbsp;like,&nbsp;once you&#8217;ve&nbsp;built a world model that works, right, like touching the table, touching the underside of the table\u2014nope, still nothing exciting there\u2014like,&nbsp;it takes&nbsp;very little&nbsp;energy to do that. And&nbsp;I think a tragedy&nbsp;is that we all have these supercomputers in our heads. You know, the&nbsp;neocortex is what, about&nbsp;10&nbsp;watts? And&nbsp;it&#8217;s&nbsp;this amazing thing, right, that&nbsp;can compose symphonies. But once we have a world model, a lot of us just stop learning because&nbsp;it&#8217;s&nbsp;comfortable,&nbsp;right.&nbsp;You&nbsp;don&#8217;t&nbsp;have to perturb the state.&nbsp;You can go through&nbsp;\u2026 and, you know, I mean, how many of us go through every day and&nbsp;all of&nbsp;our predictions succeed&nbsp;[LAUGHTER], and&nbsp;there&#8217;s&nbsp;no surprises, you know?&nbsp;&nbsp;<\/p>\n\n\n\n<p>So&nbsp;all the new synapses get swept away,&nbsp;right.&nbsp;That&#8217;s&nbsp;not a goal of pretraining because then&nbsp;you&#8217;re&nbsp;just wasting energy. But&nbsp;we&#8217;re&nbsp;trying to minimize energy consumption.&nbsp;So&nbsp;it does feel,&nbsp;kind of,&nbsp;aligned to me in some sense.&nbsp;<\/p>\n\n\n\n<p>So I&#8217;ve got a straw man I want to hit you with, but before we do,&nbsp;Nicol\u00f2, I want you to talk about your view on compression,&nbsp;like LLMs&nbsp;as&nbsp;compressors, because I know this is something you&#8217;re very passionate about and&nbsp;opinionated about. And&nbsp;I&#8217;ve&nbsp;learned a lot from you on this,&nbsp;too.&nbsp;<\/p>\n\n\n\n<p>And then,&nbsp;Subutai,&nbsp;after this,&nbsp;I&#8217;d&nbsp;like to hear your biological response. I mean, your response from a biological perspective.&nbsp;[LAUGHTER] And \u2026&nbsp;&nbsp;<\/p>\n\n\n\n<p><strong>AHMAD:&nbsp;<\/strong>You&#8217;ll&nbsp;get&nbsp;both.&nbsp;&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>That&#8217;s&nbsp;right, of course. And then I want to try&nbsp;\u2026&nbsp;I want to throw out this hybrid straw man. So, Nicol\u00f2, tell us about compression.&nbsp;<\/p>\n\n\n\n<p><strong>FUSI:&nbsp;<\/strong>The view is that&nbsp;basically the&nbsp;generative models are compressors in&nbsp;an&nbsp;information theoretic sense, and so trying to&nbsp;come up with&nbsp;a better generative model is equivalent to trying to&nbsp;find the best compressor for some data. And&nbsp;\u2026&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>Now when you say&nbsp;compressor, do you mean&nbsp;lossless&nbsp;or&nbsp;lossy?&nbsp;<\/p>\n\n\n\n<p><strong>FUSI:&nbsp;<\/strong>I mean&nbsp;lossless.&nbsp;&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>OK.&nbsp;<\/p>\n\n\n\n<p><strong>FUSI:&nbsp;<\/strong>You can basically look at literally my&nbsp;much-maligned&nbsp;objective function that you use for pretraining, which is, you know,&nbsp;next-token&nbsp;prediction, and&nbsp;you can basically draw a complete&nbsp;parallel to what you would do if you were trying to come up with the, you know,&nbsp;try to do compression, which is coming up with the shortest possible code&nbsp;for something that you&#8217;re trying to compress.&nbsp;<\/p>\n\n\n\n<p>And so the two things are the same, and it,&nbsp;kind of,&nbsp;fits into a broader picture that, you know, like, goes back to&nbsp;Occam&#8217;s razor&nbsp;and&nbsp;Kolmogorov complexity&nbsp;and&nbsp;Solomonoff&#8217;s&nbsp;principle of induction, which is, you want short descriptions&nbsp;for likely things that happen in the world and you want&nbsp;your&nbsp;algorithm that produces those short descriptions to be also short.&nbsp;That&#8217;s&nbsp;the&nbsp;minimum&nbsp;description length principle.&nbsp;&nbsp;<\/p>\n\n\n\n<p>And I do feel like it fits in,&nbsp;kind of,&nbsp;also what you were saying about the concept of you&nbsp;have&nbsp;a good world model, why look for surprise? Because it simultaneously affects both terms, both the algorithm,&nbsp;like your own world model, but also the loss that you incur when something unexpected happens.&nbsp;<\/p>\n\n\n\n<p>And&nbsp;so&nbsp;if&nbsp;I&#8217;m&nbsp;an&nbsp;agent in the world trying to minimize&nbsp;the&nbsp;minimum description length&nbsp;of the world,&nbsp;I\u2019d&nbsp;like to go and seek some in-distribution data such that I&nbsp;don&#8217;t&nbsp;bump up my surprise&nbsp;term&nbsp;too much.&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>Right. And I think you said at some point that, you know, when I&#8217;m training a model, even though you took the same loss point, you know, between&nbsp;Model&nbsp;A and&nbsp;Model B, if I have a steeper loss curve in&nbsp;Model&nbsp;A than&nbsp;Model&nbsp;B, you know, it&#8217;s getting to a better,&nbsp;sort of,&nbsp;compressed-based vocabulary faster, which makes it more general.&nbsp;The shape of that curve matters from a compression perspective.&nbsp;<\/p>\n\n\n\n<p><strong>FUSI:&nbsp;<\/strong>Yeah. I mean, I think it would help here to expand on what I was talking about in terms of,&nbsp;\u2026&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>Yes. Please.&nbsp;&nbsp;<\/p>\n\n\n\n<p><strong>FUSI:&nbsp;<\/strong>\u2026&nbsp;like,&nbsp;minimum description length principle.&nbsp;The&nbsp;minimum description length principle&nbsp;is&nbsp;basically the&nbsp;loss of the model&nbsp;you&#8217;re&nbsp;training;&nbsp;that&#8217;s&nbsp;one&nbsp;component. And&nbsp;so&nbsp;it&#8217;s&nbsp;a sum over the mistakes you make at predicting or, you know, the&nbsp;mistakes&nbsp;you&nbsp;make at&nbsp;predicting&nbsp;each word.&nbsp;And&nbsp;that&#8217;s&nbsp;one term. And the other term is how long it takes you&nbsp;in&nbsp;code to describe the model and the training&nbsp;procedure,&nbsp;\u2026&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>Right.&nbsp;<\/p>\n\n\n\n<p><strong>FUSI:&nbsp;<\/strong>\u2026&nbsp;to get to that training curve,&nbsp;to produce that training curve.&nbsp;&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>Right.&nbsp;<\/p>\n\n\n\n<p><strong>FUSI:&nbsp;<\/strong>So,&nbsp;yes, if you&nbsp;look at&nbsp;collectively,&nbsp;one term is,&nbsp;kind of,&nbsp;fixed.&nbsp;It&#8217;s&nbsp;an&nbsp;amount of code&nbsp;it&nbsp;would take you to write out&nbsp;a&nbsp;language model, for instance,&nbsp;in code.&nbsp;Like,&nbsp;literally&nbsp;implement&nbsp;it,&nbsp;<em>not the weights<\/em>, just implement the initialization of it&nbsp;and then the training&nbsp;loop. And then on the other&nbsp;side,&nbsp;you have this training loss that gets generated as you start&nbsp;observing&nbsp;data. And,&nbsp;of course,&nbsp;because&nbsp;it&#8217;s&nbsp;a sum,&nbsp;you&nbsp;want to minimize really&nbsp;the area,&nbsp;like,&nbsp;you want to minimize the sum.&nbsp;And so, like,&nbsp;a flatter curve is much better than,&nbsp;like, the steeper curve, you know,&nbsp;even&nbsp;if it ends up at the end to be slightly better.&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>Yeah.&nbsp;Concave is better than convex.&nbsp;<\/p>\n\n\n\n<p><strong>FUSI:&nbsp;<\/strong>Among other things,&nbsp;yes.&nbsp;[LAUGHTER]&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>Sorry.&nbsp;So, you know, I think that we could do a whole episode on this compression view because it&#8217;s really fascinating.&nbsp;And the&nbsp;lossless&nbsp;part of it is what blew my mind. And I think, you know,&nbsp;I&#8217;m&nbsp;guessing there are multiple camps here, and&nbsp;you&#8217;re&nbsp;squarely in one camp, so&nbsp;I&#8217;m&nbsp;guessing&nbsp;we&#8217;ll&nbsp;get a bunch of feedback&nbsp;from the other camps.&nbsp;<\/p>\n\n\n\n<p>So,&nbsp;Subutai, you know, can I think of cortical columns as compressors?&nbsp;<\/p>\n\n\n\n<p><strong>AHMAD:&nbsp;<\/strong>Yeah,&nbsp;it&#8217;s&nbsp;a good question. You know, I, you know,&nbsp;there&#8217;s&nbsp;so much in the compression literature that you can draw insight from. You know, if you look at the representations in cortical columns and that populations&nbsp;that&nbsp;neurons have, you know, some of the things you&nbsp;have to&nbsp;deal with are that the brain doesn&#8217;t&nbsp;have a huge nuclear power plant attached to it.&nbsp;<\/p>\n\n\n\n<p>You know, we only have&nbsp;12&nbsp;watts&nbsp;or so&nbsp;to process everything we want to do,&nbsp;and the representations that evolution has discovered are incredibly sparse.&nbsp;And what that means is that&nbsp;you may have thousands and thousands of neurons in a layer, but only about 1% of them will actually be active at a time.&nbsp;And&nbsp;so&nbsp;it&#8217;s a very small subset of neurons that are actually active.&nbsp;&nbsp;<\/p>\n\n\n\n<p>I&nbsp;don&#8217;t&nbsp;know about this&nbsp;minimum description length, whether that applies. I can say a couple of things about that. There&#8217;s, you know, by and&nbsp;large,&nbsp;the representations are very sparse when&nbsp;you&#8217;re predicting well. When you see a surprise,&nbsp;there&#8217;s&nbsp;a burst of activity.&nbsp;&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>Yup.&nbsp;<\/p>\n\n\n\n<p><strong>AHMAD:&nbsp;<\/strong>When&nbsp;there&#8217;s&nbsp;something&nbsp;that&#8217;s&nbsp;unusual,&nbsp;there&#8217;s&nbsp;a&nbsp;lot more neurons that fire, and &#8230;&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>That&#8217;s&nbsp;why learning is&nbsp;<em>tiring<\/em>!&nbsp;&nbsp;<\/p>\n\n\n\n<p><strong>AHMAD:&nbsp;<\/strong>That&#8217;s&nbsp;why learning&nbsp;[LAUGHTER]&nbsp;\u2026&nbsp;exactly. No, no,&nbsp;that&#8217;s&nbsp;right,&nbsp;that&#8217;s&nbsp;right.&nbsp;&nbsp;<\/p>\n\n\n\n<p>And&nbsp;so&nbsp;what we think is happening is that, you know, the actual representation of something is&nbsp;a very small&nbsp;number of neurons. When&nbsp;you&#8217;re&nbsp;surprised, there may be many things that are consistent with that surprise, and so your brain&nbsp;represents&nbsp;a union of&nbsp;all of&nbsp;those things at once.&nbsp;<\/p>\n\n\n\n<p>And when you have a very sparse representation, you can actually have a union of many, many different things without getting confused.&nbsp;So&nbsp;that&#8217;s&nbsp;what we think is going on there.&nbsp;So&nbsp;it is a very compressed, very efficient representation. And because&nbsp;it&#8217;s&nbsp;such a small percentage of neurons that are firing, we are very, very&nbsp;parsimonious&nbsp;in how we&nbsp;represent&nbsp;things and extremely energy efficient&nbsp;metabolically.&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>I wanted to get to the efficiency point,&nbsp;but before I do, you know, you talk about this&nbsp;1, you know, 1 to 2% of the neurons firing.&nbsp;But it&#8217;s,&nbsp;actually,&nbsp;the brain is actually much sparser than that&nbsp;at a fine grain, right?&nbsp;&nbsp;<\/p>\n\n\n\n<p><strong>AHMAD:&nbsp;<\/strong>Yes, yes.&nbsp;&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>Because,&nbsp;you know,&nbsp;you have 1% of the neurons firing, but they&nbsp;aren&#8217;t&nbsp;connected to all the other neurons in the region.&nbsp;<\/p>\n\n\n\n<p><strong>AHMAD:&nbsp;<\/strong>That&#8217;s&nbsp;right.&nbsp;Yeah.&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>So really the sparsity should be the product of the connectivity fraction times the activity factor.&nbsp;<\/p>\n\n\n\n<p><strong>AHMAD:&nbsp;<\/strong>Yeah.&nbsp;Yeah.&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>Right.&nbsp;That&#8217;s&nbsp;about&nbsp;one&nbsp;out of 10,000. Something like that.&nbsp;<\/p>\n\n\n\n<p><strong>AHMAD:&nbsp;<\/strong>Exactly.&nbsp;Yeah.&nbsp;So&nbsp;something like maybe&nbsp;1%&nbsp;of the neurons are firing at any point in time, and maybe 1% of the connections that are possible are actually there at any point in time.&nbsp;So&nbsp;it&#8217;s&nbsp;a very, very small, you know, subnetwork through this massive network&nbsp;that&#8217;s&nbsp;actually being&nbsp;activated, a tiny percentage of neurons going through a very, very tiny piece of the full network.&nbsp;<\/p>\n\n\n\n<p>You know,&nbsp;it&#8217;s&nbsp;common to, you know, some people say,&nbsp;\u201cOh,&nbsp;we&#8217;re only using 1%&nbsp;of our brain.\u201d&nbsp;That&#8217;s&nbsp;not true. It just means at any point in time,&nbsp;you&#8217;re&nbsp;only using 1%, but at other points in time,&nbsp;a different&nbsp;1%&nbsp;is being used. So, you know, the activity does move around&nbsp;quite a bit.&nbsp;But,&nbsp;any&nbsp;point in time,&nbsp;it&#8217;s&nbsp;extremely small.&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:<\/strong>&nbsp;So,&nbsp;OK, the sparsity, I think, you know, the representation\u2014how the brain is doing this compression biologically\u2014is super fascinating. And I want to go on a little bit of a detour now to efficiency. So&nbsp;I remember in 2017 when in MSR we were building, you know, hardware acceleration for RNNs.&nbsp;<\/p>\n\n\n\n<p>And then the&nbsp;transformer&nbsp;hit,&nbsp;and they were&nbsp;optimized, you know, to be highly parallelizable across this quadratic attention map for GPUs. The way I would describe it is that that transition to semi-supervised training moved us from an era&nbsp;when&nbsp;we were really&nbsp;data limited, like you had to have good&nbsp;high-quality labeled data,&nbsp;to you were compute limited.&nbsp;&nbsp;<\/p>\n\n\n\n<p>And when that transition happened, we&nbsp;hockey-sticked&nbsp;from,&nbsp;\u201cI&#8217;m building faster machines but I&#8217;m limited by data\u201d&nbsp;to the bigger machine I can build,&nbsp;as&nbsp;long as I have enough, you know, unlabeled data of high quality, the better I can do with the model. And so we went on the supercomputing arms race,&nbsp;and now we&#8217;re&nbsp;building these, like, just gargantuan machines.&nbsp;<\/p>\n\n\n\n<p>And really,&nbsp;we&#8217;ve kind of been&nbsp;brute-forcing&nbsp;it. I mean,&nbsp;we&#8217;ve&nbsp;done a lot of things to&nbsp;optimize,&nbsp;like quantization, you know, and&nbsp;other&nbsp;and, you know, a better process&nbsp;node, you know, a better, more efficient tensor unit design.&nbsp;But&nbsp;to&nbsp;first order,&nbsp;we&#8217;ve&nbsp;been training bigger models by building bigger systems.&nbsp;&nbsp;<\/p>\n\n\n\n<p>And I just wonder,&nbsp;do you think that the brain at&nbsp;this 10 to 12&nbsp;watts&nbsp;in the neocortex just has a fundamentally more efficient learning mechanism?&nbsp;Or do we think that, you know, what&nbsp;we&#8217;re&nbsp;doing in transformers in the most advanced silicon is as&nbsp;efficient,&nbsp;we&#8217;re&nbsp;just building much larger, more capable models?&nbsp;<\/p>\n\n\n\n<p><strong>AHMAD:&nbsp;<\/strong>Oh, I think without a doubt, transformers are extremely inefficient and very, very brute force. We touched on this a little bit earlier in the attention mechanism,&nbsp;where&nbsp;we\u2019re,&nbsp;you know, transformers are essentially comparing&nbsp;every token to every other token. I mean, there are architectures which reduce that, for sure, but&nbsp;it&#8217;s&nbsp;essentially an&nbsp;<em>n<\/em>-squared operation. And&nbsp;we&#8217;re&nbsp;doing this at every layer.&nbsp;<\/p>\n\n\n\n<p>I mean,&nbsp;there&#8217;s&nbsp;nothing like that in the brain. Our processing, you know, in some sense, the context for the very next word&nbsp;I&#8217;m&nbsp;about to say is my entire life, right? And the amount of time I take to&nbsp;take&nbsp;the next word&nbsp;doesn&#8217;t&nbsp;depend on the length of the context at all.&nbsp;It&#8217;s&nbsp;a constant&nbsp;time&nbsp;dependence&nbsp;on context.&nbsp;<\/p>\n\n\n\n<p>So&nbsp;it&#8217;s&nbsp;a significant, you know, reduction in the&nbsp;compute&nbsp;that&#8217;s&nbsp;required.&nbsp;You can&nbsp;kind of think&nbsp;about, like the brain\u2014I&nbsp;think&nbsp;has somewhere around maybe 70 trillion synapses. When I say the brain, I mean the neocortex,&nbsp;has&nbsp;about 70 trillion synapses. And&nbsp;it&#8217;s using&nbsp;only 12&nbsp;watts. And&nbsp;a&nbsp;synapse is&nbsp;roughly equivalent to a parameter.&nbsp;<\/p>\n\n\n\n<p>And if you were to take the most efficient GPUs today and try to run a 70 trillion parameter model, it would be something like a megawatt of power.&nbsp;It&#8217;s&nbsp;tens of&nbsp;thous&nbsp;&#8230;&nbsp;it&#8217;s&nbsp;orders of magnitude more inefficient than what our brain is doing.&nbsp;So&nbsp;I&nbsp;absolutely believe&nbsp;that.&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>The metric I use,&nbsp;to go back to your point, you know, is, this is something,&nbsp;I think we&nbsp;talked about this back in the day, right? When, you know, after this kicked off for a few years, we were trying to project, like, how far would this go under the current model to inform the research and the directions you took.&nbsp;Which is why I got so interested in sparsity&nbsp;and&nbsp;working with you.&nbsp;&nbsp;<\/p>\n\n\n\n<p>And we would look at a training run and just say, how many joules did it take to train the whole model? How many parameters do we have? And sort of&nbsp;what&#8217;s&nbsp;our parameters per joule? And,&nbsp;if by that metric, you know, we were off by many orders of magnitude where the brain is, but I&nbsp;don&#8217;t&nbsp;know that&nbsp;that&#8217;s&nbsp;the right metric.&nbsp;So&nbsp;any thoughts on that?&nbsp;<\/p>\n\n\n\n<p><strong>AHMAD:&nbsp;<\/strong>Yeah. I mean, in some ways, you know,&nbsp;transformers, you know, embody more knowledge in them than any human has.&nbsp;&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>Right.&nbsp;&nbsp;<\/p>\n\n\n\n<p><strong>AHMAD:&nbsp;<\/strong>It&nbsp;has memorized, you know, the entire internet&#8217;s worth of&nbsp;knowledge,&nbsp;essentially.&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>All scientific papers&nbsp;&#8230;&nbsp;<\/p>\n\n\n\n<p><strong>AHMAD:&nbsp;<\/strong>All scientific papers. You know, good and bad, whatever, you know,&nbsp;it has&nbsp;memorized everything. So&nbsp;that&#8217;s&nbsp;something that, you know, humans just cannot do.&nbsp;So&nbsp;there&#8217;s definitely stuff that&#8217;s better in transformers than humans.&nbsp;&nbsp;<\/p>\n\n\n\n<p>But fundamentally, I think, you know,&nbsp;we&#8217;re&nbsp;extremely efficient in how we process&nbsp;the next token or the next bit of information&nbsp;that&#8217;s&nbsp;coming in.&nbsp;And I think&nbsp;there&#8217;s&nbsp;a lot we can learn from the brain and apply to&nbsp;LLMs and future AI models there.&nbsp;<\/p>\n\n\n\n<p><strong>FUSI:&nbsp;<\/strong>I was going to ask a question related to that because&nbsp;&#8230;&nbsp;forget memorizing the internet. But let me give you another example that&nbsp;transformers&nbsp;do&nbsp;really well. And&nbsp;I&#8217;m&nbsp;wondering,&nbsp;like, you know, the human aspect of this or the brain aspect of this because&nbsp;transformers, because of the&nbsp;<em>n-<\/em>square computation,&nbsp;they&#8217;re&nbsp;really good&nbsp;at stuff,&nbsp;like a needle in the haystack.&nbsp;<\/p>\n\n\n\n<p>So&nbsp;I can tell you right now, I can speak, I can talk to you, and I can tell you the password&nbsp;is&nbsp;something silly like&nbsp;\u201cpodcast&nbsp;microphone blue,\u201d&nbsp;whatever.&nbsp;That&#8217;s&nbsp;the&nbsp;password. And then I can&nbsp;proceed&nbsp;and read the entire&nbsp;<em>Odyssey<\/em>&nbsp;or a bunch of other books to you&nbsp;out loud&nbsp;for the next 5 or 6 hours. And then I can ask the transformer, what was the password?&nbsp;And transformer will do this nice&nbsp;<em>n<\/em>-square computation many times, and it will spit out the password.&nbsp;&nbsp;<\/p>\n\n\n\n<p>A human, you know, there will be a decay of that password. And then at some point,&nbsp;it&nbsp;won&#8217;t&nbsp;remember, and depending on the human, it may be&nbsp;in&nbsp;the first chapter of the&nbsp;<em>Odyssey<\/em>&nbsp;or like at the end, but&nbsp;\u2026&nbsp;so fundamentally the type of computation that is done is&nbsp;very different.&nbsp;So&nbsp;it always makes me wonder about the efficiency&nbsp;because&nbsp;it&#8217;s&nbsp;just,&nbsp;like,&nbsp;it&#8217;s&nbsp;a different type of computation.&nbsp;So&nbsp;the efficiency of&nbsp;\u2026&nbsp;like, efficiency is&nbsp;kind of like, what are&nbsp;you&nbsp;doing divided by how good&nbsp;are you&nbsp;at&nbsp;doing it. And&nbsp;so&nbsp;when the things&nbsp;we&#8217;re&nbsp;doing are so&nbsp;incomparable&nbsp;in many ways, that always makes me&nbsp;&#8230;&nbsp;always troubles me a little bit.&nbsp;I&nbsp;don&#8217;t&nbsp;know&#8230;&nbsp;I&nbsp;don&#8217;t&nbsp;know if&nbsp;there&#8217;s&nbsp;any&nbsp;question&nbsp;in there.&nbsp;[LAUGHTER]&nbsp;<\/p>\n\n\n\n<p><strong>AHMAD:&nbsp;<\/strong>Yeah. I mean,&nbsp;transformers can do the stuff that humans find&nbsp;very, very difficult&nbsp;to do. Absolutely. You know,&nbsp;maybe there&#8217;s&nbsp;a way to get the best of both. I&nbsp;don&#8217;t&nbsp;know. You know, I don&#8217;t know that it&#8217;s fundamentally necessary to have such&nbsp;brute-force computation to get all of these features.&nbsp;<\/p>\n\n\n\n<p><strong>FUSI:&nbsp;<\/strong>That&#8217;s&nbsp;right.&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>Yeah.&nbsp;Yeah, it is a weird thing because, you know, this is why memory palaces work so well.&nbsp;Like, there is a way, though, for a human to remember that my microphone is gray.&nbsp;It&#8217;s&nbsp;not actually blue, Nicol\u00f2.&nbsp;<\/p>\n\n\n\n<p><strong>FUSI:&nbsp;<\/strong>Mine is&nbsp;blue. You&nbsp;don&#8217;t&nbsp;see it.&nbsp;It&#8217;s&nbsp;off camera. You see, your world model \u2026&nbsp;&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>It\u2019s&nbsp;off camera.&nbsp;Yeah, I know. I was just teasing you.&nbsp;&nbsp;<\/p>\n\n\n\n<p>But&nbsp;there&#8217;s&nbsp;a way, like, if I can just connect it to enough things, get that connectivity graph, then&nbsp;I&#8217;ll remember it because&nbsp;it&#8217;s&nbsp;captured the signal out of the noise and connected to enough things I can retrieve it.&nbsp;And retrieval&nbsp;would&nbsp;be a whole other topic&nbsp;we&nbsp;don&#8217;t&nbsp;have time to get into today.&nbsp;&nbsp;<\/p>\n\n\n\n<p>But I do&nbsp;\u2026 now,&nbsp;I want to go to the straw man.&nbsp;So&nbsp;let&#8217;s take continual&nbsp;learning off the table.&nbsp;Let&#8217;s&nbsp;imagine that,&nbsp;as I go through my day,&nbsp;I&#8217;m&nbsp;just saving&nbsp;all of&nbsp;the sensory data to put in my training set. And now imagine that I take 100,000 little transformer blocks,&nbsp;and&nbsp;I&#8217;m&nbsp;training them each with what&nbsp;they&#8217;re&nbsp;seeing.&nbsp;<\/p>\n\n\n\n<p>OK,&nbsp;I&nbsp;replay the day&nbsp;so I don&#8217;t have to,&nbsp;again, I don&#8217;t have to worry about continuous learning and whatever cross-cortical column, you know, routing feature of the outputs, the inputs, and&nbsp;there&#8217;s\u2014Subutai, we&#8217;ve talked about this\u2014there\u2019s&nbsp;a complex set of wiring there to bring features from here to there that gets learned. If I replicated that,&nbsp;could a transformer block&nbsp;kind of do&nbsp;what the cortical columns are doing?<\/p>\n\n\n\n<p>Could I just instrument all my sensory patches with little transformer blocks and then wire them up in the right way and have it work?&nbsp;<\/p>\n\n\n\n<p><strong>AHMAD:&nbsp;<\/strong>I think there&#8217;ll&nbsp;be&nbsp;\u2026&nbsp;there&#8217;s still a couple of things we need.&nbsp;One is that cortical columns are fundamentally sensory&nbsp;motor. And&nbsp;so&nbsp;they&#8217;re actually,&nbsp;each&nbsp;one,&nbsp;each cortical column is initiating actions, as well.&nbsp;So&nbsp;you cannot have a static dataset fundamentally ahead of time.&nbsp;It&#8217;s always&nbsp;a&nbsp;dynamic because we&#8217;re&nbsp;constantly making movements to get the next bit of data.&nbsp;And so&nbsp;\u2026&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>Couldn\u2019t&nbsp;I tokenize that, though?&nbsp;<\/p>\n\n\n\n<p><strong>AHMAD:&nbsp;<\/strong>I mean, you could tokenize the input and you can tokenize the output, but, you know, if you were to play the same set of inputs back again to a network that&nbsp;\u2026 a cortical&nbsp;column&nbsp;that\u2019s&nbsp;randomly wired differently, it may make a different set of actions. And so as soon as it makes the first action that&#8217;s different, that dataset is no longer valid, right?&nbsp;It&#8217;s,&nbsp;you know, there is&nbsp;&#8230;&nbsp;you can&#8217;t fundamentally&nbsp;\u2026&nbsp;you have to have a simulation of an environment rather than a static&nbsp;one-way&nbsp;dataset, if that makes sense.&nbsp;&nbsp;<\/p>\n\n\n\n<p>So&nbsp;I think&nbsp;that&#8217;s&nbsp;one piece that I&nbsp;think\u2019s&nbsp;missing in&nbsp;transformers today,&nbsp;is this,&nbsp;sort of, sensory-motor loop. And then the other piece we talked about&nbsp;is&nbsp;continuous learning.&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>Yeah.&nbsp;<\/p>\n\n\n\n<p><strong>AHMAD:&nbsp;<\/strong>I guess you&nbsp;said take it off the table, but&nbsp;\u2026&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>It&#8217;s&nbsp;fundamental.&nbsp;&nbsp;<\/p>\n\n\n\n<p><strong>AHMAD:&nbsp;<\/strong>Fundamental&nbsp;\u2026&nbsp;different.&nbsp;Yeah,&nbsp;yeah.&nbsp;And&nbsp;maybe one&nbsp;other difference. We talked, you know, much earlier about a single latent space and the prediction&nbsp;that&#8217;s&nbsp;being made at the top of the transformer that you compute the loss function, and that&#8217;s&nbsp;back-propagated&nbsp;through the transformer. That&#8217;s&nbsp;not how neurons learn. Neurons are making&nbsp;\u2026 every neuron is&nbsp;actually making&nbsp;predictions, and every neuron is getting its input.&nbsp;<\/p>\n\n\n\n<p>And&nbsp;it&#8217;s&nbsp;learning independent of anything that happens at the top. And&nbsp;so&nbsp;it&#8217;s&nbsp;a much more granular learning signal. And information does flow from the top to bottom. But&nbsp;there&#8217;s&nbsp;also many, many other sources of information that&nbsp;it&#8217;s&nbsp;learning from.&nbsp;So&nbsp;it&#8217;s&nbsp;different in that sense,&nbsp;as well, mechanistically.&nbsp;&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>The reason I ask,&nbsp;and now&nbsp;I&#8217;d&nbsp;like to get into, you know, some of the&nbsp;&#8230;&nbsp;the fun speculation because I&#8217;ve&nbsp;just&nbsp;&#8230;&nbsp;it&#8217;s&nbsp;been a phenomenal discussion with the two. I think we&#8217;ve&nbsp;kind of elucidated&nbsp;the differences. Something I&#8217;ve wondered after I&#8217;ve talked to both of you&nbsp;\u2026 and, you know,&nbsp;Nicol\u00f2,&nbsp;kind of learning about this compression view of the world, lossless compression,&nbsp;and,&nbsp;Subutai, just, you know, the&nbsp;Thousand&nbsp;Brains&nbsp;Theory and these cortical columns and the sampling of, you know, the world to capture the signal that you can learn from.&nbsp;<\/p>\n\n\n\n<p>So&nbsp;let&#8217;s say that I was able to design a really small, efficient digital cortical column.&nbsp;Maybe it&#8217;s&nbsp;transformer-based with some, you know, a sparse representation and some sensory-motor mechanism built in.&nbsp;Maybe it&#8217;s&nbsp;more&nbsp;dendritic-based, you know, mapped into digital hardware.&nbsp;And I put a cortical column on every sensor I have in the world, associated with every person, and wire them up together with some of this and then have a, you know, billions of them that can form higher-level abstractions. Like, what do you think would&nbsp;happen? What could we do?&nbsp;<\/p>\n\n\n\n<p><strong>AHMAD:&nbsp;<\/strong>That&#8217;s&nbsp;a fantastic&nbsp;thought exercise, I&nbsp;think&nbsp;[LAUGHS]. You know, again, assuming the cortical column is faithful and can generate, you know,&nbsp;or&nbsp;suggest motor actions,&nbsp;as well.&nbsp;I mean, in some sense, you could potentially have a super intelligent&nbsp;system, right,&nbsp;that&#8217;s&nbsp;far more intelligent than anything else on the planet.&nbsp;&nbsp;<\/p>\n\n\n\n<p>Now&nbsp;we&#8217;re&nbsp;scaling the number of cortical columns, you know, not from a mouse, you know, to a hundred thousand columns that a human might have, but potentially billions of cortical columns and way more. And there&#8217;s&nbsp;no reason to think&nbsp;there&#8217;s&nbsp;any fundamental limit there.&nbsp;So&nbsp;this sort of&nbsp;a system&nbsp;is, I think, the way that superintelligent systems will eventually be built.<strong>&nbsp;<\/strong>&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>But this is&nbsp;a very different&nbsp;direction&nbsp;\u2026&nbsp;<\/p>\n\n\n\n<p><strong>AHMAD:&nbsp;<\/strong>It\u2019s&nbsp;a very&nbsp;different&nbsp;&#8230;&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>\u2026 than the one&nbsp;we&#8217;re&nbsp;currently headed down with, like, these monolithic models where&nbsp;we&#8217;re doing tons of RL, you know, to capture, you know, to get high-value human collaboration in distribution.&nbsp;<\/p>\n\n\n\n<p><strong>AHMAD:&nbsp;<\/strong>Yes.&nbsp;It&#8217;s&nbsp;completely different&nbsp;than&nbsp;the direction&nbsp;we&#8217;re&nbsp;proceeding.&nbsp;&nbsp;<\/p>\n\n\n\n<p>So&nbsp;I think they, you know, to go down that path, there needs to be a fundamental rethinking of some of our assumptions, potentially even down to the hardware architectures that are necessary to implement it.&nbsp;The, you know,&nbsp;fundamental learning algorithms, the fundamental training paradigm. We talked about, you know, you&nbsp;can&#8217;t&nbsp;have a static dataset.&nbsp;You&#8217;re&nbsp;constantly moving&nbsp;around in&nbsp;the world and doing things. So&nbsp;it&#8217;s&nbsp;a very, very different&nbsp;way of going about AI than what&nbsp;we&#8217;re&nbsp;doing today.&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>Sounds like&nbsp;a great time&nbsp;to be an AI researcher.&nbsp;<\/p>\n\n\n\n<p><strong>AHMAD:&nbsp;<\/strong>Absolutely.&nbsp;[LAUGHTER]&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>Nicol\u00f2, what was your reaction to&nbsp;that hypothesis?&nbsp;<\/p>\n\n\n\n<p><strong>FUSI:&nbsp;<\/strong>It sounds super interesting. I mean, my brain was churning. You know, my background is&nbsp;very different. And so,&nbsp;like,&nbsp;I&#8217;m&nbsp;in a much worse position to answer this question.&nbsp;But&nbsp;I was starting to think,&nbsp;OK, so&nbsp;let&#8217;s&nbsp;say I&nbsp;do&nbsp;this. What would be my loss function? What, you know, how&nbsp;would information&nbsp;flow through the system?&nbsp;Like, sounds like cortical columns would each have their own loss&nbsp;that&nbsp;then I&nbsp;would aggregate\u2014and then I would add a contribution that is,&nbsp;like,&nbsp;higher level.&nbsp;<\/p>\n\n\n\n<p>And then back to my question. You know,&nbsp;how is&nbsp;the temporal information coordinated?&nbsp;Because one way to see this is that,&nbsp;you know, the way&nbsp;I&#8217;m&nbsp;coming to understand this is that&nbsp;it&#8217;s&nbsp;kind of like&nbsp;a multi-view framework.&nbsp;<\/p>\n\n\n\n<p>You have the same phenomena&nbsp;represented&nbsp;to multiple independent,&nbsp;but at the same&nbsp;time,&nbsp;views. And&nbsp;so part of&nbsp;me&nbsp;is like it feels&nbsp;like that&nbsp;you need to tie together these cortical columns in such a way that they all get that gradient feedback&nbsp;if&nbsp;you&#8217;re&nbsp;training with&nbsp;gradient-based methods, for instance. And so that&#8217;s, kind&nbsp;of,&nbsp;it feels super, super interesting.&nbsp;<\/p>\n\n\n\n<p>It is related to a lot of, you know, very superficially, to a&nbsp;lot of&nbsp;ideas&nbsp;in machine learning around, hey,&nbsp;is it better to have one giant super deep network? Is it better to have a bunch of shallow networks? But the difference is&nbsp;also in&nbsp;the way you train them, right? We typically train this bunch of shallow networks on&nbsp;kind of the&nbsp;same&nbsp;objective&nbsp;and&nbsp;the same data and not typically into an experiential cycle.&nbsp;Whereas this&nbsp;sounds like this is&nbsp;a different way&nbsp;to do it.&nbsp;&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>Right,&nbsp;right.<strong>&nbsp;<\/strong>I think&nbsp;\u2026&nbsp;I want to pull this back around to the title of the podcast. And&nbsp;so&nbsp;I&#8217;ll&nbsp;share an observation. You know, so&nbsp;I&#8217;ve&nbsp;been using some of the latest models to code.&nbsp;You know, they&#8217;re getting better really fast.&nbsp;I&#8217;ve&nbsp;been using them to&nbsp;kind of relearn&nbsp;some of the physics that I never really understood deeply.&nbsp;<\/p>\n\n\n\n<p>You know,&nbsp;especially&nbsp;in general relativity, like&nbsp;E=MC<sup>2<\/sup>. Like,&nbsp;why is C in there at all, right?&nbsp;Just stuff like that. Because now it can actually explain it to me, and I can keep&nbsp;beating at&nbsp;it until I understand it,&nbsp;and then,&nbsp;of course,&nbsp;work.&nbsp;<\/p>\n\n\n\n<p>And at\u202fsome\u202fpoint,\u202fI asked the model,&nbsp;\u201cCan you describe how I\u202fthink?\u201d\u202fAnd I was\u202fjust curious. And it, you know, it gave me a page description that my jaw dropped because I said\u202fthis,\u202fthis thing knows me better than I know myself. I\u202fdon&#8217;t\u202fthink any human being, including me, could have captured\u202fkind of the\u202fway my approach to learning and my brain works, and I just read it\u202fas,\u202flike, like, yep,\u202fthat&#8217;s\u202fright.\u202fAnd I learned something about myself.\u202f&nbsp;<\/p>\n\n\n\n<p>So\u202fI\u202fwouldn&#8217;t\u202fsay that it passed the Turing test because this is way beyond\u202fTuring\u202ftest. This was like, this thing knows me\u202fway better, you know, than I thought any machine ever could. I mean,\u202fI&#8217;m having a conversation with it. It could be human, but it&#8217;s superhuman. So in some sense,\u202fit&#8217;s like intelligent beyond human capabilities with its ability to discern patterns in how someone&#8217;s interacting.\u202f&nbsp;And yet it&#8217;s a tool.&nbsp;You know,\u202fit&#8217;s\u202fnot\u202fconscious. It\u202fdoesn&#8217;t\u202fhave agency, embodiment, emotion. It understands a lot of that stuff from the training data. But at the end of the day,\u202fit&#8217;s\u202fa stochastic parrot, right? It&#8217;s got, you know,\u202fit&#8217;s got the weights,\u202fand I give it a\u202ftoken,\u202fand it outputs a&nbsp;token. So,\u202flike, are these machines intelligent or\u202fnot?\u202f&nbsp;<\/p>\n\n\n\n<p><strong>FUSI:&nbsp;<\/strong>I\u2019ll&nbsp;let&nbsp;Subutai&nbsp;answer first. [LAUGHS]&nbsp;<\/p>\n\n\n\n<p><strong>AHMAD:&nbsp;<\/strong>OK.&nbsp;You know, you know, it&#8217;s definitely a savant, right?&nbsp;It knows a huge&nbsp;amount&nbsp;about the world. It&#8217;s&nbsp;absorbed&nbsp;a lot of stuff, and it can articulate that in ways that are&nbsp;just&nbsp;amazing. And, you know, it&#8217;s taken your chat history with, you know,&nbsp;presumably thousands&nbsp;of chats and able to summarize that in a way&nbsp;that&#8217;s&nbsp;remarkable.&nbsp;<\/p>\n\n\n\n<p>At the same time, I think, you know,&nbsp;transformers are not intelligent in the way that a&nbsp;three-year-old is, right? A&nbsp;three-year-old&nbsp;human&nbsp;is very curious, is constantly learning. It can learn&nbsp;almost anything. And, you know, a&nbsp;three-year-old&nbsp;Einstein was able to learn and eventually&nbsp;come up with&nbsp;theories that shook the world.&nbsp;That, you know,&nbsp;E=MC<sup>2<\/sup>.&nbsp;<\/p>\n\n\n\n<p>And so, you know, could a transformer do that? I&nbsp;don&#8217;t&nbsp;think so. And&nbsp;so&nbsp;I think&nbsp;there&#8217;s&nbsp;still a difference. There&#8217;s&nbsp;things&nbsp;it&nbsp;can do that are amazing. But there are still basic things that&nbsp;a&nbsp;child can do that transformers cannot do.&nbsp;So&nbsp;I think&nbsp;there&#8217;s&nbsp;still a gap there. Exactly how to articulate it,&nbsp;and how to bridge that gap,&nbsp;is,&nbsp;of course, the&nbsp;trillion-dollar&nbsp;question. But it is bridgeable.&nbsp;And there is a gap today.&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>Right.&nbsp;Nicol\u00f2?&nbsp;<\/p>\n\n\n\n<p><strong>FUSI:&nbsp;<\/strong>You know,&nbsp;I&nbsp;think, from my perspective,&nbsp;they&nbsp;are intelligent. And&nbsp;from&nbsp;my perspective, I go back to the definition of&nbsp;intelligent, which&nbsp;is like, can you achieve your&nbsp;objectives&nbsp;in a variety of environments? It&#8217;s&nbsp;a very basic&nbsp;fundamental, but&nbsp;it&#8217;s&nbsp;kind of, you know, it can be embodied, a form of embodied intelligence,&nbsp;an agentic&nbsp;intelligence.&nbsp;If I plop&nbsp;you&nbsp;in an environment,&nbsp;and I give you&nbsp;an objective, can you achieve it?&nbsp;And&nbsp;the&nbsp;wilder the&nbsp;environment, the harder the task is.&nbsp;&nbsp;<\/p>\n\n\n\n<p>And I do think&nbsp;\u2026&nbsp;I agree with&nbsp;Subutai. Like,&nbsp;there is a jaggedness of intelligence we keep describing.&nbsp;&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>Yup.&nbsp;<\/p>\n\n\n\n<p><strong>FUSI:&nbsp;<\/strong>Like these things cannot be simultaneously super good, you know, Olympiad-level mathematicians and still give you stupid answers when you&#8217;re trying to, I don&#8217;t know, you know, figure out which cable goes where in your&nbsp;\u2026&nbsp;in your car&#8217;s battery, you know, like, whatever.&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>[LAUGHS]&nbsp;Well, then&nbsp;it&#8217;s&nbsp;better than me.&nbsp;I&#8217;m&nbsp;not an&nbsp;Olympiad-level mathematician, and I do stupid stuff all the time.&nbsp;<\/p>\n\n\n\n<p><strong>FUSI:&nbsp;<\/strong>I know&nbsp;exactly.&nbsp;Well, you know, whatever that was, that was a bad example. But you&nbsp;get&nbsp;it. But part of it goes back to the compression&nbsp;view. Like, I do believe that intelligence is compression. So the ability to come up with succinct explanations for complex phenomena&nbsp;and even succinct explanations for complex worlds,&nbsp;and then&nbsp;it&nbsp;implies or leads to your ability to operate within them, and the fact that we&nbsp;have&nbsp;these things that they can prove crazy theorems but at the same time&nbsp;fail at&nbsp;fairly rudimentary tasks is a sign that the, yes, transformers are great in terms of inductive biases&nbsp;they put on the world and computation that are great, but we&#8217;re ultimately all subject to the&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/ieeexplore.ieee.org\/document\/585893\" target=\"_blank\" rel=\"noopener noreferrer\">No&nbsp;Free&nbsp;Lunch&nbsp;Theorem<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.&nbsp;<\/p>\n\n\n\n<p>You know, across the&nbsp;world, the set of tasks that you could&nbsp;be pursuing. You know, you have certain inductive biases that kind of privilege certain tasks at the expense of others. And there&nbsp;isn&#8217;t, like,&nbsp;a thing yet that&nbsp;has expanded&nbsp;our&nbsp;set of tasks that are addressable. And&nbsp;so&nbsp;I do think that&nbsp;it&#8217;s&nbsp;a matter of rethinking our approach to a few things, whether I think&nbsp;likely both&nbsp;on the architecture front and on the losses and the way we train these systems front.&nbsp;I think there&nbsp;is an opportunity to expand the intelligent frontier of these models. But&nbsp;yeah,&nbsp;from&nbsp;my perspective, they are&nbsp;intelligent&nbsp;already&nbsp;just in a jagged way.&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>It&#8217;s&nbsp;such an interesting question, and I know a lot of people write a lot about this, so&nbsp;I&nbsp;don&#8217;t think&nbsp;treading any new ground here. But, you know,&nbsp;there&#8217;s&nbsp;the diversity of the tasks you can excel at.&nbsp;You know, are you able to handle&nbsp;nuance&nbsp;and understand things deeply?&nbsp;Are you able to learn continuously? Right&nbsp;now, the systems&nbsp;can&#8217;t,&nbsp;right.&nbsp;Are you embodied? I&nbsp;don&#8217;t&nbsp;know if that matters. Do you have&nbsp;an objective? Well, we could give them one. Are you conscious? Is that&nbsp;\u2026&nbsp;I mean,&nbsp;that&#8217;s&nbsp;a whole other thing.&nbsp;&nbsp;<\/p>\n\n\n\n<p>So&nbsp;it just feels like&nbsp;there&#8217;s&nbsp;a bunch of check&nbsp;boxes, and&nbsp;we&#8217;ve&nbsp;checked a bunch of them, and a bunch of them are unchecked.<strong>&nbsp;<\/strong>And&nbsp;maybe there&#8217;s&nbsp;no consensus on,&nbsp;like,&nbsp;where that threshold is because there are many dimensions of intelligence,&nbsp;and some of which humans&nbsp;don&#8217;t&nbsp;even have.&nbsp;<\/p>\n\n\n\n<p><strong>FUSI:&nbsp;<\/strong>And&nbsp;that&#8217;s&nbsp;why we have the term AGI and&nbsp;ASI,&nbsp;and people are debating the&nbsp;<em>G<\/em>&nbsp;and the&nbsp;<em>S<\/em>\u2014what is general,&nbsp;what is specialized.&nbsp;So&nbsp;there is,&nbsp;like,&nbsp;it&#8217;s&nbsp;a huge discourse, like,&nbsp;for sure. But&nbsp;that&#8217;s&nbsp;why we had to start&nbsp;characterizing. But if you go back in the definition, going back to my schooling, go back to the definition of intelligence from&nbsp;Plato and Aristotle&nbsp;and&nbsp;Descartes, like,&nbsp;in some sense, you see the&nbsp;goalpost moving through the centuries around what we define as intelligent.&nbsp;&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>Right.<strong>&nbsp;<\/strong>&nbsp;<\/p>\n\n\n\n<p><strong>FUSI:&nbsp;<\/strong>And I feel like we are still doing it.&nbsp;<\/p>\n\n\n\n<p><strong>BURGER:&nbsp;<\/strong>Yeah.&nbsp;We\u2019ll&nbsp;be&nbsp;doing it for a long time, you know, which in AI velocity is&nbsp;probably another like four&nbsp;or&nbsp;five&nbsp;years.&nbsp;&nbsp;<\/p>\n\n\n\n<p>Hey,&nbsp;I just want to thank you both for the dialogue.&nbsp;You know, I treasure both of you as, you know, intellects and scholars and friends.&nbsp;It&nbsp;was just a joy to nerd out with you all.&nbsp;So&nbsp;thank you both for taking the time.&nbsp;<\/p>\n\n\n\n<p><strong>AHMAD:&nbsp;<\/strong>Thank you so much, Doug, for having me.&nbsp;&nbsp;<\/p>\n\n\n\n<p><strong>FUSI:<\/strong>&nbsp;Thank you for having us. This&nbsp;was&nbsp;great.&nbsp;<\/p>\n\n\n\n<p>[MUSIC]&nbsp;<\/p>\n\n\n\n<p><strong>STANDARD OUTRO:<\/strong>&nbsp;You\u2019ve&nbsp;been listening to&nbsp;<em>The&nbsp;Shape&nbsp;of&nbsp;Things to&nbsp;Come<\/em>, a Microsoft Research Podcast.&nbsp;Check out&nbsp;more&nbsp;episodes of&nbsp;the podcast&nbsp;at&nbsp;aka.ms\/researchpodcast&nbsp;or on YouTube and major podcast platforms.&nbsp;<\/p>\n\n\n\n<p>[MUSIC FADES]&nbsp;<\/p>\n\n\t\t\t\t<\/span>\n\t\t\t<\/div>\n\t\t\t<button\n\t\t\t\tclass=\"action-trigger glyph-prepend mt-2 mb-0 show-more-show-less-toggle\"\n\t\t\t\taria-expanded=\"false\"\n\t\t\t\tdata-show-less-text=\"Show less\"\n\t\t\t\ttype=\"button\"\n\t\t\t\taria-controls=\"show-more-show-less-toggle-2\"\n\t\t\t\taria-label=\"Show more content\"\n\t\t\t\tdata-alternate-aria-label=\"Show less content\">\n\t\t\t\tShow more\t\t\t<\/button>\n\t\t<\/div>\n\t<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-group msr-pattern-link-list is-layout-flow wp-block-group-is-layout-flow\">\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading h5\" id=\"learn-more-1\">Learn more:<\/h2>\n\n\n\n<ul class=\"wp-block-list list-unstyled\">\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/matching-features-not-tokens-energy-based-fine-tuning-of-language-models\/\" type=\"msr-research-item\" id=\"1163846\">Matching Features, Not Tokens: Energy-Based Fine-Tuning of Language Models<\/a><br>Publication | March 2026<\/li>\n\n\n\n<li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.hachettebookgroup.com\/titles\/jeff-hawkins\/a-thousand-brains\/9781541675797\/?lens=basic-books\" target=\"_blank\" rel=\"noopener noreferrer\">A Thousand Brains: A New Theory of Intelligence<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>&nbsp;<br>Book |&nbsp;Jeff Hawkins&nbsp;| 2022&nbsp;<\/li>\n\n\n\n<li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/thousandbrains.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">Thousand Brains Project<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>&nbsp;<br>Homepage&nbsp;<\/li>\n\n\n\n<li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.frontiersin.org\/journals\/neural-circuits\/articles\/10.3389\/fncir.2018.00121\/full\" target=\"_blank\" rel=\"noopener noreferrer\">A Framework for Intelligence and Cortical Function Based on Grid Cells in the Neocortex<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>&nbsp;<br>Publication | January 2019\u202f&nbsp;<\/li>\n\n\n\n<li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.frontiersin.org\/journals\/neural-circuits\/articles\/10.3389\/fncir.2016.00023\/full?ref=highscalability.com\" target=\"_blank\" rel=\"noopener noreferrer\">Why Neurons Have Thousands of Synapses, a Theory of Sequence Memory in Neocortex<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>&nbsp;<br>Publication | March 2016&nbsp;<\/li>\n\n\n\n<li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/us.macmillan.com\/books\/9780805078534\/onintelligence\/\" target=\"_blank\" rel=\"noopener noreferrer\">On Intelligence<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>&nbsp;<br>Book | Jeff Hawkins with Sandra Blakeslee | 2005&nbsp;<\/li>\n<\/ul>\n\n\n\n<div style=\"height:25px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Are machines truly intelligent? AI researchers Subutai Ahmad and Nicol\u00f2 Fusi join Doug Burger to compare transformer-based AI with the human brain, exploring continual learning, efficiency, and whether today\u2019s models are on a path toward human intelligence.<\/p>\n","protected":false},"author":43868,"featured_media":1166611,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"https:\/\/player.blubrry.com\/id\/153442105","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":null,"msr_hide_image_in_river":0,"footnotes":""},"categories":[240054],"tags":[],"research-area":[13556],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[243990],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-1163921","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-msr-podcast","msr-research-area-artificial-intelligence","msr-locale-en_us","msr-post-option-podcast-featured"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"https:\/\/player.blubrry.com\/id\/153442105","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[],"related-events":[],"related-researchers":[{"type":"guest","value":"doug-burger","user_id":"1168890","display_name":"Doug Burger","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/dburger\/\" aria-label=\"Visit the profile page for Doug Burger\">Doug Burger<\/a>","is_active":true,"last_first":"Burger, Doug","people_section":0,"alias":"doug-burger"},{"type":"guest","value":"subutai-ahmad","user_id":"1163932","display_name":" Subutai Ahmad","author_link":"<a href=\"https:\/\/www.linkedin.com\/in\/subutai\/\" aria-label=\"Visit the profile page for  Subutai Ahmad\"> Subutai Ahmad<\/a>","is_active":true,"last_first":"Ahmad,  Subutai","people_section":0,"alias":"subutai-ahmad"},{"type":"guest","value":"nicolo-fusi","user_id":"1168897","display_name":"Nicolo Fusi","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/fusi\/\" aria-label=\"Visit the profile page for Nicolo Fusi\">Nicolo Fusi<\/a>","is_active":true,"last_first":"Fusi, Nicolo","people_section":0,"alias":"nicolo-fusi"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/Episode-1-Doug-Nicolo-Subutai_TheShapeofThings_Hero_Feature_River_No_Text_1400x788-960x540.jpg\" class=\"img-object-cover\" alt=\"The Shape of Things to Come podcast | illustration of Nicolo Fusi, Doug Burger, and Subutai Ahmad\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/Episode-1-Doug-Nicolo-Subutai_TheShapeofThings_Hero_Feature_River_No_Text_1400x788-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/Episode-1-Doug-Nicolo-Subutai_TheShapeofThings_Hero_Feature_River_No_Text_1400x788-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/Episode-1-Doug-Nicolo-Subutai_TheShapeofThings_Hero_Feature_River_No_Text_1400x788-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/Episode-1-Doug-Nicolo-Subutai_TheShapeofThings_Hero_Feature_River_No_Text_1400x788-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/Episode-1-Doug-Nicolo-Subutai_TheShapeofThings_Hero_Feature_River_No_Text_1400x788-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/Episode-1-Doug-Nicolo-Subutai_TheShapeofThings_Hero_Feature_River_No_Text_1400x788-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/Episode-1-Doug-Nicolo-Subutai_TheShapeofThings_Hero_Feature_River_No_Text_1400x788-240x135.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/Episode-1-Doug-Nicolo-Subutai_TheShapeofThings_Hero_Feature_River_No_Text_1400x788-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/Episode-1-Doug-Nicolo-Subutai_TheShapeofThings_Hero_Feature_River_No_Text_1400x788-1280x720.jpg 1280w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/Episode-1-Doug-Nicolo-Subutai_TheShapeofThings_Hero_Feature_River_No_Text_1400x788.jpg 1400w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/dburger\/\" title=\"Go to researcher profile for Doug Burger\" aria-label=\"Go to researcher profile for Doug Burger\" data-bi-type=\"byline author\" data-bi-cN=\"Doug Burger\">Doug Burger<\/a>, <a href=\"https:\/\/www.linkedin.com\/in\/subutai\/\" title=\"Go to researcher profile for  Subutai Ahmad\" aria-label=\"Go to researcher profile for  Subutai Ahmad\" data-bi-type=\"byline author\" data-bi-cN=\" Subutai Ahmad\"> Subutai Ahmad<\/a>, and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/fusi\/\" title=\"Go to researcher profile for Nicolo Fusi\" aria-label=\"Go to researcher profile for Nicolo Fusi\" data-bi-type=\"byline author\" data-bi-cN=\"Nicolo Fusi\">Nicolo Fusi<\/a>","formattedDate":"March 23, 2026","formattedExcerpt":"Are machines truly intelligent? AI researchers Subutai Ahmad and Nicol\u00f2 Fusi join Doug Burger to compare transformer-based AI with the human brain, exploring continual learning, efficiency, and whether today\u2019s models are on a path toward human intelligence.","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/1163921","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/43868"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=1163921"}],"version-history":[{"count":35,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/1163921\/revisions"}],"predecessor-version":[{"id":1168898,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/1163921\/revisions\/1168898"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/1166611"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1163921"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=1163921"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=1163921"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1163921"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=1163921"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=1163921"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1163921"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=1163921"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=1163921"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=1163921"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=1163921"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}