DNA Fountain enables a robust and for a range of practical architectures (12) (figs. S1
to S5 and tables S1 to S3).
Previous studies of DNA storage realized about
efficient storage architecture half of the Shannon information capacity of DNA
molecules. In addition, most of the previous
studies reported challenges in perfect informa-
Yaniv Erlich1,2,3* and Dina Zielinski1
tion retrieval (Table 1). For example, two previ-
ous studies attempted to address oligo dropout
DNA is an attractive medium to store digital information. Here we report a storage strategy,
by dividing the original file into overlapping seg-
called DNA Fountain, that is highly robust and approaches the information capacity per
ments so that each input bit is represented by
nucleotide. Using our approach, we stored a full computer operating system, movie, and other
multiple oligos (4, 6). However, this repetitive
files with a total of 2.14 106 bytes in DNA oligonucleotides and perfectly retrieved the
coding procedure generates a loss of information
information from a sequencing coverage equivalent to a single tile of Illumina sequencing.
content and is poorly scalable (fig. S6). In both
We also tested a process that can allow 2.18 1015 retrievals using the original DNA sample
H
umanity is currently producing data at options. However, DNA encoding faces several perfectly retrieve the data, they were still far
exponential rates, creating a demand for practical limitations. First, not all DNA sequences from realizing the capacity. Moreover, testing
better storage devices. DNA is an excellent are created equal (13, 14). Biochemical constraints this strategy on a large file size highlighted dif-
medium for data storage, owing to its dem- dictate that DNA sequences with high GC con- ficulties in decoding the data due to local cor-
onstrated information density of petabytes tent or long homopolymer runs (e.g., AAAAAA) relations and large variations in the dropout
of data per gram, high durability, and evolution- are undesirable, as they are difficult to synthe- rates within each protected block (7), which is a
arily optimized machinery to faithfully replicate size and prone to sequencing errors. Second, known issue of blocked RS codes (16, 17). Only
this information (1, 2). Recently, a series of proof- oligonucleotide (hereafter oligo) synthesis, poly- after employing a complex multistep procedure
of-principle experiments has demonstrated the merase chain reaction (PCR) amplification, and and high sequencing coverage was the study able
value of DNA as a storage medium (39). decay of DNA during storage can induce uneven
To better understand its potential, we explored representation of the oligos (7, 15). This might
the Shannon information capacity (10, 11) of DNA result in dropout of a small fraction of oligos that 1
New York Genome Center, New York, NY 10013, USA.
2
storage (12). This measure sets a tight upper will not be available for decoding. In addition to Department of Computer Science, Fu Foundation School of
bound on the amount of information that can biochemical constraints, oligos are sequenced Engineering, Columbia University, New York, NY 10027, USA.
3
Center for Computational Biology and Bioinformatics
be reliably stored in each nucleotide. In an ideal in a pool and necessitate indexing to infer their (C2B2), Department of Systems Biology, Columbia
world, the information capacity of each nucleo- order, which further limits the number of avail- University, New York, NY 10027, USA.
tide could reach 2 bits, as there are four possible able nucleotides for encoding information. Quan- *Corresponding author. Email: yaniv@cs.columbia.edu
Table 1. Comparison of DNA storage coding schemes and experimen- recovery indicates whether all information was recovered without any error.
tal results. For consistency, the table describes only schemes that were Net information density indicates the input information in bits divided by the
empirically tested with pooled oligo synthesis and high-throughput sequenc- number of synthesized DNA nucleotides (excluding adapter annealing sites).
ing data. The schemes are presented chronologically on the basis of publica- Realized capacity is the ratio between the net information density and the
tion date. Coding potential is the maximal information content of each Shannon capacity of the channel. Physical density is the actual ratio of the
nucleotide before indexing or error correcting. Redundancy denotes the excess number of bytes encoded and the minimal weight of the DNA library used to
of synthesized oligos to provide robustness to dropouts. Error correction/ retrieve the information. This information was not available for studies by
detection denotes the presence of error-correction or -detection code to Bornholt et al. (6) and Blawat et al. (7), as indicated by the dashes. See (12)
handle synthesis and sequencing errors (RS, Reed-Solomon codes). Full for more information.
Parameter Church et al. (3) Goldman et al. (4) Grass et al. (5) Bornholt et al. (6) Blawat et al. (7) This work
to rescue a sufficient number of oligos. Taken rithm to infer the identities of the segments in pool. The input data were in the form of a tarball
together, these results inspired us to seek a cod- the droplet. Theoretically, it is possible to re- that packaged several files, including a complete
ing strategy that can better utilize the information verse the Luby transform using a highly efficient graphical operating system of 1.4 Mbytes, a movie,
capacity of DNA storage devices while showing algorithm by collecting any subset of droplets and other files (12) (Fig. 2A and fig. S8). We split
higher data-retrieval reliability. as long as the accumulated size of droplets is the input tarball into 67,088 segments of 32 bytes
We devised a strategy for DNA storage, called slightly bigger than the size of the original file. and iterated over the steps of DNA Fountain to
DNA Fountain, that approaches the Shannon ca- For DNA Fountain, our algorithm applies one create valid oligos. Each droplet was 38 bytes
pacity while providing robustness against data round of the transform in each iteration to create (304 bits): 4 bytes of the random-number gen-
corruption. Our strategy harnesses fountain codes a single droplet. Next, the algorithm moves to erator seed, 32 bytes for the data payload, and
(18, 19), which have been developed for reliable the droplet screening stage. This stage is not 2 bytes for an RS error-correcting code, to reject
and effective unicasting of information over chan- part of the original fountain code design and erroneous oligos in low-coverage conditions. With
nels that are subject to dropouts, such as mobile allows us to completely realize the coding poten- this seed length, our strategy supports encoding
TV (20). In our design, we carefully adapted the tial of each nucleotide. In screening, the algorithm files of up to 500 Mbytes (12). The DNA oligos
power of fountain codes to overcome both oligo translates the binary droplet to a DNA sequence had a length of 304/2 = 152 nucleotides (nt)
dropouts and the biochemical constraints of DNA by converting {00,01,10,11} to {A,C,G,T}, respec- and were screened for homopolymer runs of
storage. Our encoder works in three steps (Fig. tively. Then, it screens the sequence for the 3 nt and GC content of 45 to 55%. We instructed
1) (12): First, it preprocesses a binary file into a desired biochemical properties of GC content DNA Fountain to generate 72,000 oligos, yielding
series of nonoverlapping segments of a certain and homopolymer runs. If the sequence passes a redundancy of 72,000/67,088 1 = 7%. We se-
length. Next, it iterates over two computational the screen, it is considered valid and added to lected this number of oligos due to the price
Fig. 1. DNA Fountain encoding. (Left) Three main algorithmic steps. (Right) Example with a small file of 32 bits. For simplicity, we partitioned the file into
eight segments of 4 bits each. The seeds are represented as 2-bit numbers and are presented for display purposes only. See (12) for the full details of each
algorithmic step.
(Fig. 2C). To retrieve the information, we PCR- to 750,000 reads, equivalent to one tile of an reaction used 10 ng of material out of the 3-mg
amplified the oligo pool and sequenced the DNA Illumina MiSeq flow cell. This procedure re- master pool. Each subsequent PCR reaction con-
library on one Illumina MiSeq flow cell with 150 sulted in 1.3% oligo dropout from the library. sumed 1 ml of the previous PCR reaction and
paired-end cycles, which yielded 32 million reads. Despite these limitations, the decoder was employed 10 cycles in each 25-ml reaction. We se-
We employed a preprocessing strategy that pri- able to perfectly recover the original 2.1 Mbytes quenced the final library using one run on the
oritizes reads that are more likely to represent in 20 of 20 random down-sampling experi- Illumina MiSeq.
high-quality oligos (12). Because not all oligos are ments. These results indicate that beyond its Overall, this recursive PCR reflects one full
required for the decoding due to redundancy, this high information density, DNA Fountain also arm of an exponential process that theoretically
procedure reduces the exposure to erroneous reduces the amount of sequencing required for could generate 300 259 2 = 2.28 quadrillion
oligos. The decoder scans the reads, recovers the data retrieval, which is beneficial when storing copies of the file by repeating the same pro-
binary droplets, rejects droplets with errors based large-scale information. cedure with each aliquot (fig. S11). As expected,
on the RS code, and employs a message-passing DNA Fountain can also perfectly recover the the quality of the deep copy was substantially
algorithm to reverse the Luby transform and file after creating a deep copy of the sample. One worse than the initial experiment with the mas-
obtain the original data (12). of the caveats of DNA storage is that each re- ter pool. The average coverage per oligo dropped
In practice, decoding took ~9 min with a Py- trieval of information consumes an aliquot of the from an average of 7.8 perfect calls for each oligo
thon script on a single CPU of a standard laptop material. Copying the oligo library with PCR is per million reads (pcpm) to 4.5 pcpm in the deep
(movie S1). The decoder recovered the informa- possible, but this procedure introduces noise and copy. In addition, the deep copy showed much
tion with 100% accuracy after observing only induces oligo dropout. To further test the robust- higher skewed representation with a negative
69,870 oligos out of the 72,000 in our library ness of our strategy, we created a deep copy of binomial overdispersion parameter (1/size) of
72,000
5- Adapter Seed Data payload RS Adapter -3
4Byte 32Byte 2Byte
24nt 16nt 128nt 8nt 24nt
0 200nt
tar.gz: 2.14Mbyte
Master pool
Overdispersion: 0.15
0 5 10 15 20 25 30 Downsampling
Perfect calls per million reads [pcpm] Perfect decoding
to 750,000 reads
Deep copy
Copying samples
Master pool
Sequencing on
1l 1l 1l 1l 1l 1l 1l 1l
PCR #1 PCR #2 PCR #3 PCR #4 PCR #5 PCR #6 PCR #7 PCR #8 PCR #9 10l
MiSeq All reads Perfect
decoding
Mean = 4.5
PCR (10 rounds)
10ng out of 3g
0.006
25l 25l
Frequency
Downsampling
0.004
Perfect
Overdispersion: 0.76 to 5 million reads decoding
0.002
0.000
0 10 20 30 40 50 60
Perfect calls per million reads [pcpm]
Diluting samples
2.14 Mbytes after compression. (B) Structure of the oligos. Black labels, length
in bytes; red, length in nucleotides; RS, Reed-Solomon error-correcting code.
(C) Experimental results of the master pool. (D) Experimental procedures 10ng 1ng 100pg 10pg 1pg 100fg 10fg
(215Tb/g) (2Pb/g) (21Pb/g) (215Pb/g) (2Eb/g) (21Eb/g) (215Eb/g)
of deep copying of the oligo pool. (C and D) Histograms display the fre-
Multiplexed
quency of perfect calls per million sequenced reads (pcpm). Red, mean; sequencing
blue, negative binomial fit to the pcpm. (E) Serial dilution experiment. The
weight corresponds to the input DNA library that underwent PCR and
sequencing. Green, correctly decoded; red, decoding failure.
was able to fully recover the file without a To summarize, in this work, we reported an ically viable solution for long-term, high-latency
single error with the full sequencing data. After efficient and robust coding strategy that enables storage.
down-sampling the sequencing data to five mil- virtually unlimited data retrieval and high phys-
lion reads, resulting in an approximate dropout ical density while approaching the Shannon ca- REFERENCES AND NOTES
rate of 1.0%, we were able to perfectly recover pacity of DNA storage closer than any previous 1. M. R. Wallace, Kybernetes 7, 265268 (1978).
the file in 10 of 10 trials. These results suggest design. We tested our framework with a relatively 2. C. Bancroft, T. Bowler, B. Bloom, C. T. Clelland, Science 293,
that with DNA Fountain, DNA storage can be large file compared with those used in previous 17631765 (2001).
3. G. M. Church, Y. Gao, S. Kosuri, Science 337, 1628
copied virtually an unlimited number of times studies and were able to perfectly recover the data (2012).
while preserving the data integrity of the sample. under various tests. Implementing our approach 4. N. Goldman et al., Nature 494, 7780 (2013).
Next, we explored the maximal achievable in concert with long-term preservation tech- 5. R. N. Grass, R. Heckel, M. Puddu, D. Paunescu, W. J. Stark,
physical density using DNA Fountain. The pio- niques, such as DNA embedding in silica beads Angew. Chem. Int. Ed. 54, 25522555 (2015).
neering study by Church et al. predicted that (5), might require further fine-tuning of the re- 6. J. Bornholt et al., in Proceedings of the Twenty-First
International Conference on Architectural Support for
DNA storage could theoretically achieve an dundancy levels. We expect that such fine-tuning Programming Languages and Operating Systems
information density of 680 Pbytes (P: peta-; can benefit from the high flexibility of the DNA (Association for Computing Machinery, 2016),
1015) per gram of DNA, assuming the storage of Fountain framework, which allows determina- pp. 637649.
100 molecules per oligo sequence (3). However, tion of virtually any redundancy level without 7. M. Blawat et al., Procedia Comput. Sci. 80, 10111022
(2016).
previous DNA storage experiments have never changing the software or affecting the decoding
8. S. H. T. Yazdi et al., IEEE Trans. Mol. Biol. Multi-Scale Commun.
tested the maximal density of their storage time. 1, 230248 (2015).
scheme. To find the maximal physical density, Moving forward, practical implementation of 9. S. M. Yazdi, Y. Yuan, J. Ma, H. Zhao, O. Milenkovic, Sci. Rep. 5,
Editor's Summary
Article Tools Visit the online version of this article to access the personalization and
article tools:
http://science.sciencemag.org/content/355/6328/950
Science (print ISSN 0036-8075; online ISSN 1095-9203) is published weekly, except the last week
in December, by the American Association for the Advancement of Science, 1200 New York
Avenue NW, Washington, DC 20005. Copyright 2016 by the American Association for the
Advancement of Science; all rights reserved. The title Science is a registered trademark of AAAS.