Storing digital data in synthetic DNA with Dr. Karin Strauss

Published

By , Senior Principal Research Manager, Microsoft Research AI4Science

Episode 48, October 31, 2018

As traditional semiconductor technologies for computer storage scale down, everyone is looking for alternative solutions to the growing gap between the amount of data we’re capable of producing and the amount of data we’re capable of storing. While some have focused on hardware accelerators for machine learning, and others are investigating new memory technologies, Dr. Karin Strauss (opens in new tab), a Senior Researcher at Microsoft Research in Redmond, has been exploring the role of biotechnology in IT via an end-to-end system that stores digital data in DNA.

On today’s podcast, Dr. Strauss talks about life at the intersection of computer science and biology which, for many, is more like the intersection of science fiction and science, and explains how the unique properties of DNA could eventually enable us to store really big data in really small places for a really long time.

Related:


Episode Transcript

Karin Strauss: Right after the structure of DNA started to be more well-understood, people looked at it and scratched their heads and thought you know, it’s carrying information about life, why can’t it carry any kind of information?

Host: You’re listening to the Microsoft Research Podcast, a show that brings you closer to the cutting-edge of technology research and the scientists behind it. I’m your host, Gretchen Huizinga.

Host: As traditional semiconductor technologies for computer storage scale down, everyone is looking for alternative solutions to the growing gap between the amount of data we’re capable of producing and the amount of data we’re capable of storing. While some have focused on hardware accelerators for machine learning, and others are investigating new memory technologies, Dr. Karin Strauss, a Senior Researcher at Microsoft Research in Redmond, has been exploring the role of biotechnology in IT via an end-to-end system that stores digital data in DNA.

On today’s podcast, Dr. Strauss talks about life at the intersection of computer science and biology, which, for many, is more like the intersection of science fiction and science, and explains how the unique properties of DNA could eventually enable us to store really big data in really small places for a really long time. That and much more on this episode of the Microsoft Research Podcast.

Host: Karin Strauss, welcome to the podcast.

Karin Strauss: Thanks for having me.

Host: You’re a senior researcher at MSR and an affiliate professor in the Department of Computer Science and Engineering at the University of Washington. You situate your research at the intersection of computer science, systems and biology. Which is a fascinating intersection.

Karin Strauss: That’s right. It’s been very exciting.

Host: How would you describe the work you do? In general terms, what gets you up in the morning?

Karin Strauss: The multi-disciplinarity of the work is really exciting. We are studying how to store digital data in synthetic DNA, and really sort of bringing science-fiction into reality, so it’s sort of a dream come true.

Host: Let’s talk about this idea of the end of Moore’s Law. What do you think people need to understand about Moore’s Law that kind of gives us a framework for why you’re doing what you’re doing?

Karin Strauss: Moore’s Law is, let’s say, not a law but an observation that Gordon Moore has made, that you can shrink transistors at a certain rate. Every certain number of years, you have a doubling on the number of transistors. So, it’s essentially this observation that you get more transistors into the same area of a processor and so, you can do more with that over time. That’s based on silicon technology and fabrication methods that are really miniaturizing the devices that are built with silicon, therefore giving us more of those devices. It’s improving the technology so that we can do more with the same amount of material, the same amount of silicon.

Host: So, there’s a point where it’s too small. You can’t do anymore with it?

Karin Strauss: Precisely. We’re starting to get to that point. Some would argue that we’ve already gotten to that point, depending how you look at it. The structures that we’re manufacturing are already pretty small. We’re at the nanoscale. And in some cases, depending on the parts of the structures and the devices we’re fabricating, it’s just a layer of a few atoms that are being built there. And so, it’s getting quite hard, and also less profitable, from the scaling perspective, to really use these devices.

Host: All right. So why can’t you just – this is one of those crazy questions – but why can’t you just say sorry guys, we’re done. We’re as small as we get.

Karin Strauss: Well, there’s still more technology tricks that we can play and as long as we can play those tricks, we will, but it’s getter harder, right? It’s getting harder, it’s getting more expensive. And so, in addition to really continuing the process to the extent we can, people are looking at alternatives to it. It’s getting harder, so we need to be smarter.

Host: So, we’re tasked with creating new technologies. And by we, I mean you. So, talk for a minute, just for context, about the history of data storage. I mean, not like a box in a cave but digital data. What have been our technologies and how has it evolved to where we are now? And then we’ll talk about DNA in a second.

Karin Strauss: Sure. Let me start with the motivation and the trend which is the data we are capable of generating is growing exponentially. And even though the devices that can store the data are also growing exponentially, it’s different exponentials. And so, the gap between what we are creating and what we can store is growing. And that’s the trend we’re seeing that is concerning. So, if we just follow the industry – if today we’re capable of storing about 30% of the information we generate, in only ten or twelve years we’ll be able to store about 3%, and then in ten more years, about half of a percent of everything that we generate. So, that’s a concerning trend. We think that we need a radical new solution to address that problem. Now, in terms of the storage evolution, right, we’ve had magnetic devices like tape and hard drives. And in fact, you know, my first computer had 35 megabytes of storage.

Host: Right? I remember those days.

Karin Strauss: So, this is storage technology and then we have memory technology. So, for example D-RAM that’s electronic. And most recently, electronic and silicon storage being developed in Flash and SSDs. So, all of these different technologies, they’re based on creating structures that can then hold a bit, and that’s sort of what they have in common, but the scaling of the storage technology is also based on making these structures smaller. And so, we’re also hitting limits on the storage, not only transistors, but also storage technologies are hitting the same limits of miniaturization.

Host: Let’s talk about DNA, or deoxyribonucleic acid. You are working on putting data, digital data, on DNA. As a level set, what is DNA from a biological and chemical perspective.

Karin Strauss: All right. So, DNA is a big long molecule. It’s essentially a chain of what we call bases and that’s what we describe as A, T, C and G. Those are called monomers that, put together into a chain, make up DNA. Each side of the double-helix is made of complementary bases. So, A complements with T on the other side, and C complements with G. From an information and storage perspective, we only need to look at one of the sides, the other is redundant. Because there’s a direct correspondence between A to T, C to G.

Host: Okay. So, if I’m going back to my biology class, I remember DNA being framed as the building blocks of life, the molecules that make up our genome…

Karin Strauss: So, DNA carries the information, your genetic information. And in fact, it doesn’t just store information, it actually performs many of the – along with enzymes – it performs many of the functions that are needed for life.

Host: So, okay that’s a thing, it exists. It’s already loaded with instructions to make me and you and everybody else and every thing else, right? Living organisms have DNA. How did this idea of, “Hey we can put digital data on there!” come up? When did that happen and what was the scientific thinking behind that?

Karin Strauss: So, the idea actually dates back from the 60s. And it was right after the structure of DNA started to be more well-understood, people looked at it and scratched their heads and thought you know, it’s carrying information about life, why can’t it carry any kind of information? So, one could use DNA for that except at that time we didn’t have the technology to fabricate DNA or to read DNA… not at reasonable speeds.

Host: Before we get into the technical weeds, why are you interested in using DNA for storage? What’s the premise behind it?

Karin Strauss: We’re very excited about DNA for at least three of its properties. The first one is density. So instead of really storing the bits into devices that we have to manufacture, we are really looking at a molecule, storing data in a molecule itself. And so, a molecule can be a lot smaller than the devices we’re making. Just to give you an example, you could store the information, today stored in a datacenter, one exabyte of data, into a cubic inch of DNA. So that’s quite tiny. Durability is the next interesting property of DNA. And so, DNA, if preserved under the right conditions, can keep for a very long time, which is not necessarily possible with media that’s commercial today. I think the longest commercial media is rated for thirty years. That’s tape. Still tape! DNA, if encapsulated in the right conditions, has been shown to survive thousands of years. And so, it’s very interesting from a data preservation perspective as well. And then, one other property is that, now that we know how to read DNA and we’ll always have the technology to read it. So now we’ll have those readers… (if we don’t have those readers, we have a real problem) …but we’ll have those readers forever, as long as there is civilization. So, it’s not like floppy disks that are in the back of a drawer just gathering dust. We’ll really have technology to read it.

(music plays)

Host: So, let’s get into the technical weeds on your current work. What’s the science behind DNA storage? There’s a workflow that starts, as I understand, with binary code and then it gets crazy from there. Tell us the workflow behind the science.

Karin Strauss: That’s right. So, if we go back to the structure of DNA, it’s the chain of the different bases, A, T, C and G. And so, the way to think about them is, they’re a sequence of these bases. And the way to think about bits is, digital data is essentially a sequence of bits. And so, the science behind it starts with translating those bits into bases. So, a very simple way to think about that is A corresponds to zero, zero. C corresponds to zero, one. G to one, zero, and T to one, one. And so, if we have a sequence of bits, we’ll take every two bits and translate it into a base. We use a lot more sophisticated methods, but that’s the first step.

Host: So, that leads to the next question. You’ve got the binary code translated into DNA code. Then what do you do with it?

Karin Strauss: So, we know which sequences of DNA we want. Now, there’s a process of manufacturing the DNA and there’s a process where multiple chemicals are flowed and the DNA sort of grows, right? And so, we know which sequences need to be grown and those sequences are grown from a surface. So, once we grow the DNA, we’ll remove it from where it was grown, and we’ll encapsulate it. So, encapsulation can be done in a number of different ways. One of the types of encapsulation we’ve been looking at is encapsulation developed by ETH Zurich. They were inspired by the fact that DNA survives in bone and wanted to do something that had similar properties but that was easier to handle rather than having to grow bone. And so, what they developed was a type of chemistry that will encapsulate the DNA in glass. It’s actually silicon dioxide. And they developed nano-particles that, then, the DNA gets attached to and then a layer of glass is grown around it. And so, that keeps it away from water, which is something that degrades the DNA, UV light and when the temperature goes higher, it protects it from really degrading.

Host: So, what happens if I want to access my data, then? Let’s say I gave you a video and you say okay I’m going to put it on DNA and I’m going to store it. And I say Karin, I want my video back, what do you do?

Karin Strauss: When your video was stored, it was stored in a certain location. So, it’s just organized in a spatial way, right, so there’s some way to retrieve the actual molecules encapsulated. The next step is to de-encapsulate them, so we need to remove all that glass that was added for stability and extract the DNA. Once that’s done, and that’s the first part of a random-access process, it’s a hierarchical process. First, you physically find the smaller set of DNA molecules you are interested in, but then, within that, there are many molecules that may belong to different movies that you’ve stored, and you’re just interested in one movie, you don’t want to read the whole collection. Right? And reading the whole collection would actually be wasteful. And so, we would like the ability to further select that particular movie you want to watch. And it turns out that there’s a process to do that. We do it chemically. And so, it’s just another reaction that comes from nature, actually, and is repurposed for this goal, for this purpose.

Host: So, the paper you wrote about it that explains how you do that, can you give us an executive summary of how you’ve gone about that? That chemical process?

Karin Strauss: The chemical process is really borrowed from the biotech industry. It’s a pretty standard process called Polymerase Chain Reaction and it’s the process that copies DNA. However, you can use it in a way that will just copy sequences that have a certain sequence at the end. And so, we select the right sequences at the end and then use an enzyme to do the DNA copy. Now that enzyme is selective. It will only copy pieces of DNA that have a little bit of a double helix. And so, we store the DNA as a double helix. To do the process, we separate the two sides of the helix and then only attach a little tail to each of the sides so that that tail matches the object that we’re interested in reading.

Host: OK.

Karin Strauss: And so, the enzymes will only copy the sequences that have that specific tail. So, we can select which of the molecules are copied. Now, the other molecules storing the rest of the videos are still there, but in much smaller quantity. PCR, this Polymerase Chain Reaction, actually copies the DNA exponentially, and so we can very quickly get to a lot of copies of the data we’re interested in and just a few copies of other data we’re not interested in so that when we sample to read that information, most molecules we’re sampling are molecules that we’re interested in.

Host: What’s the process of identifying a molecule?

Karin Strauss: The technology to read DNA is sort of indirect reading, if you will. It’s indirect observation of the molecules. And so, there are a number of different technologies to do that. Some of them use optical solutions and so they multiply the molecules and then make them glow in different colors, just based on the sequences that they are made of. There’s another technology that is electronic. And so, you drag the DNA through a nanopore – it’s literally called a nanopore because it’s a pore that’s nano-size – and the DNA goes through and it causes some current disturbances through that nanopore, that are then read and sensed and then you can tell what the sequence of DNA is. It really does sound science-fiction, but it works.

(music plays)

Host: So most of us understand megabytes, gigabytes, even terabytes, because we have some frame of reference. But when we hit petabyte, exabyte, zettabyte and I just read the new one yottabyte, 10 to the 24th power, it’s just mind-boggling territory. So, here’s the funny part, in your research, it was big news that you stored 200 megabytes on strands of DNA. So, I think you are well past that now, like around 800 megabytes?

Karin Strauss: Yes.

Host: But it gives us some indication of how hard it is to do what you do. I mean, to be excited about 200 megabytes. I remember when I had a 200-megabyte hard drive but that was in the 80s. Right? So, so what are the challenges that researchers face? If 200 megabytes is a big deal, that indicates that there’s still some challenges to getting to scale.

Karin Strauss: That’s right. So, the challenges are really improving the speeds at which we can read the DNA. So, we don’t need very low latency. We can wait for the information to come out. But for it to be a practical solution for storage, we wanted to give us a high rate of bytes per second. Those rates today are so low and that’s why the 200 megabytes is a big deal, is, you know, the best we can do with technology today. But just to put it in perspective, just one year or two years earlier, in our lab, we were working with 200 kilobytes of data which is a thousand times smaller. And so, we’ve been experiencing improvements already just by doing the research. All of these technologies had to go through this milestone to get to full deployment today. So, we’re on the way there.

Host: You are on track.

Karin Strauss: Yes. Back to the challenges. So, getting the throughputs up is a challenge, and then lowering costs. So, DNA manufacturing today is still quite costly. But for both of these challenges, they sort of go hand in hand: if you get the speed up, you also get the cost down. We see no fundamental, physical reason why you couldn’t really scale it to the level of being acceptable or being suitable for DNA data storage.

Host: So, what are the roadblocks then, getting to that speed?

Karin Strauss: One of the big roadblocks is really scaling the structures but also automating the process. So, writing DNA and reading DNA today is automated, but there are many steps in-between. So, for example, the encapsulation and the preparation, the random access that we were talking about, those are not automated yet. We really need to automate the whole process. So, automation for us is also a big challenge and a big deal.

Host: And you are working on that.

Karin Strauss: Yes, we’re working on that as well.

Host: We talked about quite a few factors that make this more viable. And I would imagine, since we’re talking about computer science involved in the biology and systems science, that algorithms play a role here. Could you talk a little bit about what’s going on in that arena?

Karin Strauss: Yes, absolutely. So, in fact, algorithms do play a big role here and we have a fantastic team of coding theorists working here at Microsoft Research on this problem and on the project itself. So, they developed algorithms that really reduced the effort to recover the data from DNA. One of the big contributions there was to encode the data in a way that, once we read it on the way out, we need to process minimal amounts of information to really recover the data. So, that was also a big contribution in that paper.

Host: What’s the ultimate goal of DNA storage? What’s the aspiration? If you succeeded beyond your wildest dreams, what would success look like to you?

Karin Strauss: Success would look like everyone in the world has access to DNA data storage. And so, really at Microsoft, our mission is to empower every person and organization to achieve more. With DNA, we would empower every person and organization to store more!

Host: Maybe they will have that as one of the tag lines. But right now, where are we in that quest?

Karin Strauss: We are just starting, right? So, we’re looking at really wrapping our heads around how to build an end-to-end system that will allow us to achieve that goal. We’re first targeting archival storage applications in the datacenter and then we’ll see where it goes. We think that that’s probably the lowest resistant path for DNA data storage and we think that there may be improvements in the future that will allow us to target different scenarios as well.

Host: So, that’s interesting because there are different markets and different reasons for people storing things, right? And so archival storage would be more like, “I don’t need this right away and I can tolerate a gap” versus “I need this on my computer because I have to deliver it to my boss tomorrow morning.” How does DNA play in that realm right now?

Karin Strauss: Yeah, so, we look at DNA, at least currently or initially, as an archival storage technology. And so, in addition to the fact you don’t need it immediately, you need to store it for a long time and, hopefully, not taking too much space, right? So, that’s what we see as the great match between the DNA technology and the needs of archival storage.

Host: You were featured in Fast Company Magazine as one of the hundred most creative people in business in 2016. Which is super cool. But something in that write-up caught my eye. They projected that we’d hit 16 zettabytes of data by 2017, which was last year. I imagine we’re past that now… The premise behind all this is that we need to preserve the world’s data. My question then is, is everything we’re doing worth preserving? When do we cross the line to being people on that TV show Hoarders where everyone feels sorry for people who can’t throw anything away. And to be honest, I’m kind of like that person, digitally. I have 32,000 photos on my phone. Well, it’s in the cloud, right? Let’s be honest. I kind of feel like I need help, more than a bigger box.

Karin Strauss: Yes. So, there is a lot of information we don’t need to store. Even though I’d argue that digital archaeologists of the future would love to have access to all your 32,000 pictures.

Host: No, they wouldn’t! Most of them are really bad.

Karin Strauss: But they would really understand how we live today.

Host: Cultural exploration.

Karin Strauss: Cultural exploration. That’s right. And that’s why we still study past civilizations, right? We want to understand how they lived, and, you know, how life was at that time and learn from that, right? But also, you know, there are pieces of data that we are, today, throwing away that may be very helpful, actually. We talk to customers, we talk to different segments of the market, who would like to store more information. I just came back from a meeting at the Library of Congress. They would love to store more information if it weren’t that expensive, right, to do it.

Host: I actually love that answer because my question was real. And I think what you just said is an application – or a reason – for exploring this new technology. The bigger question would be how, then, will future digital archeologists sort through everything to make discoveries and inferences and things?

Karin Strauss: That’s a great question. So, there’s a whole field of machine learning in AI that’s evolving today. There’s quite a lot of advances happening in the field right now. They do help us sort through all the data that we’re generating. There’s a limit to that, too. So, that is really, I think, a question that we’ll have to do a lot more research to really understand how to deal with it. I’d say the field of data organization is also a pretty interesting one. If we already have structure on the data, then we might as well preserve that. But we need storage to preserve that information as well, so…

Host: Talk about the people you are working with. You have some really interesting partners both organizationally and individually and it’s bringing some really interesting diversity to your team that I think probably benefits both the work and the industry.

Karin Strauss: Yes, absolutely. So, the DNA Data Storage Project, at Microsoft, we’ve been collaborating with the University of Washington since the beginning of the project. And we also work with partners like Twist Bioscience and we have a collaboration with ETH Zurich. But overall, since it’s such a multidisciplinary project, we need people with different backgrounds, offering different perspectives. So, we have a very diverse team. And one of the interesting things is that, once you start a project like this, people are using different vocabularies and there is a little more work that goes into starting to communicate well, and as a team, but once that initial obstacle is overcome, things happen so much better just because people are offering different perspectives. There’s a lot diversity, not only personally, but also how people are thinking, and everybody is learning from each other. We have all the way from coding theorists to computer architects and engineers working on mechanics and molecular biologists. And so completely different backgrounds. People from all over the world just working in an exciting, multidisciplinary project and learning from each other. That makes it super exciting.

Host: Let’s talk about you for a minute. What got you interested in what you are doing, personally, and how did you end up at Microsoft Research?

Karin Strauss: Well, I am a forever learner. I think what led me to get my PhD was really, I didn’t want to stop learning, and so I made it kind of my profession. So, that’s what led you know me to keep pursuing more and more degrees until I ran out of degrees to pursue. Maybe an MBA someday. But, being a researcher allows me to continue learning and, in fact, learning and then putting together things that I’ve learned from different areas and different perspectives into the same project, and as I mentioned earlier, making science-fiction into reality, it sort of couldn’t be more exciting.

Host: How did you end up at Microsoft Research?

Karin Strauss: How did I end up here… (laughter). I don’t know!

Host: Well you had to have had a job interview at some point, right?

Karin Strauss: Yes!

Host: “I don’t remember!”

Karin Strauss: It’s one of those things. You bump into somebody in an airport and, you know, later on they contact you and say, “Hey, I’m forming a new group at Microsoft, would you like to come interview?”

Host: Did that happen?

Karin Strauss: That did happen.

Host: Do you remember who you bumped into?

Karin Strauss: Yeah, it was the manager who hired me into Microsoft, Doug Burger.

Host: All right, so one of the frameworks I read about this work in is an article that talked about how biotech has benefitted tremendously from the advances or the progress in silicon technology. And to some degree, they suggested that why we’re doing this is, it’s time for biotech to give back. In other words, for computer scientists to start putting biomolecules in their computer architecture. So, what’s next in the field here? What are the exciting lines of research that emerging researchers might be interested in?

Karin Strauss: One of the things we’re starting to look into, and are really excited about is, once you’ve stored data into DNA, there is the question of what else can you do with it as a molecule? Can you perform any operations over them at the molecular level, not just selecting them and then reading them? And it turns out there’s a whole field of DNA nanotechnology that looks at computing with these molecules. So, we even have, at Microsoft Research Cambridge, a group that has looked at bio-computation. And so, we’re starting also to see how we can use such techniques and the properties of DNA to perform operations over the data as it’s stored in DNA.

Host: As we close Karin, what would you like people to know about your research that they might not know?

Karin Strauss: Yeah, so I think the thing to be aware is that this is an emerging technology and we’re starting to work towards really building systems and wrapping services around it. It will take some time to get there, but I think what’s cool is that, you know, there’s a company like Microsoft who is really willing to invest and really think about, you know, how are we going to build solutions for the IT industry in the future? And so, I’m very happy and feel very fortunate to be working on this project.

Host: Karin Strauss, thank you for joining us today.

Karin Strauss: Thank you.

(music plays)

To learn more about Dr. Karin Strauss and the biological future of digital data storage, visit Microsoft.com/research

Continue reading

See all podcasts