51在线

Institute researchers reflect on the rise of big data in life science research

Institute researchers reflect on the rise of big data in life science research

Institute researchers reflect on the rise of big data in life science research

How big data is changing science

New biomedical techniques, like next-generation genome sequencing, are creating vast amounts of data and transforming the scientific landscape. They鈥檙e leading to unimaginable breakthroughs 鈥 but leaving researchers racing to keep up.

Anne Corcoran鈥淭his is when I start feeling my age,鈥 says Anne Corcoran. She鈥檚 a scientist at the Babraham Institute, a human biology research centre in Cambridge, UK. Corcoran leads a group that looks at how our genomes 鈥 the DNA coiled in almost every cell in our bodies 鈥 relate to our immune systems, and specifically to the antibodies we make to defend against infection.

She is, in her own words, an 鈥渙ld-school biologist鈥, brought up on the skills of pipettes and Petri dishes and protective goggles, the science of experiments with glassware on benches 鈥 what鈥檚 known as 鈥渨et lab鈥 work. 鈥淚 knew what a gene looked like on a gel,鈥 she says, thinking back to her early career.

These days that skill set is not enough. 鈥淲hen I started hiring PhD students 15 years ago, they were entirely wet lab,鈥 Corcoran says. 鈥淣ow when we recruit them, the first thing we look for is if they can cope with complex bioinformatic analysis.鈥 To be a biologist, nowadays, you need to be a statistician, or even a programmer. You need to be able to work with algorithms.

An algorithm, essentially, is a set of instructions 鈥 a series of predefined steps. A recipe could be seen as an algorithm, although a more obvious example is a computer program. You take your input (ingredients, numbers, or anything), run it through the algorithm鈥檚 steps 鈥 which could be as simple as 鈥渁dd one to each number鈥, or as complex as Google鈥檚 search algorithm 鈥 and it provides an output: a cake, search results, or perhaps an Excel spreadsheet.

51在线ers like Corcoran need to use algorithms because, in the 17 years since she became a group leader, biology has changed. And the thing that has changed it is the vast 鈥 the overwhelmingly, dizzyingly vast 鈥 flood of data generated by new biomedical techniques, especially .

Not long ago, sequencing an entire genome 鈥 determining the order of all 3 billion pairs of DNA letters in the helix 鈥 took years. The Human Genome Project, the first completed sequence of an entire human genome, took around 13 years from conception to its completion in 2003, and cost more than 拢2 billion. Today, next-generation sequencing can do the same thing in 24 hours for not much more than a thousand pounds.

This has completely changed how scientists work. It鈥檚 not just that they get their hands dirty less often, nor simply that the required skills have changed. It鈥檚 that the whole process of science 鈥 how you come by an idea and test it 鈥 has been upended.

This has left a lot of senior scientists needing to understand and supervise techniques that didn鈥檛 exist when they trained. It鈥檚 left universities playing catch-up, with many degrees not teaching the skills that modern biologists need. But above all, it鈥檚 led to ground-breaking scientific discoveries 鈥 breakthroughs that simply wouldn鈥檛 have been possible 20 or even 10 years ago.

A 10-minute drive from Babraham, in a village called Hinxton, there鈥檚 another major life-sciences centre, the Wellcome Sanger Institute. It鈥檚 25 years old this week, and the rapidly moving history of genomics is written in its very architecture.

鈥淚 did my postdoc at the Sanger,鈥 says Moritz Gerstung, now a research group leader at the European Bioinformatics Institute next door. He chuckles at the memory. 鈥淵ou can almost sense when the building was conceived,鈥 he says. 鈥淭here鈥檚 so much space for laboratory work, and not so much for where scientists can sit and analyse data on a computer.鈥

This is true everywhere, says Gil McVean, a professor of statistical genetics at the University of Oxford鈥檚 Big Data Institute. Genomic research has become something done mainly on a laptop, not a workbench. 鈥淚f you look at any 15-year-old research lab, they鈥檙e 90 per cent wet lab,鈥 he says. 鈥淎nd if you go into one, almost all the people are sitting at computers. If you were to build a biomedical research centre today, you鈥檇 build it 10 per cent wet lab and 90 per cent computing.鈥

But that鈥檚 not the only change. 鈥淥ne of the big changes in science,鈥 says McVean, 鈥渉as been the move away from a very focused, targeted, hypothesis-driven approach, the 鈥業鈥檝e got this idea, I design the experiment, I run the experiment, and decide whether I was right or wrong鈥 model.鈥

It used to be that you had to have some plausible idea about why a gene might do something 鈥 that you could imagine some sensible-sounding biochemical pathway which could link the gene to a disease or trait. The time it took to sequence genes and the limited computing power available meant you had to be quite sure you were going to find something before you dedicated all that expensive lab and analysis time.

Now you just collect a lot of data and let the data decide what the hypothesis should be, says McVean. If you look at 10,000 genomes of people with a disease and 10,000 without, you can use an algorithm to compare them, find the differences and then work out which genes are linked to the disease, without having to think in advance about which ones they might be.

DNA codeThis approach is known as a genome-wide association study, a common form of analysis in the data-driven era. It鈥檚 a fairly simple idea. You take the genomes of a large number of people, sequence them, and then use an algorithm to compare all of the DNA 鈥 not just the 24,000 or so genes, which make up just 1鈥2 per cent of the genome, but also all of the still-somewhat-mysterious non-coding DNA too. The algorithm can be quite simple: for instance, comparing how frequently a certain DNA variant appears in people with a certain trait or condition and people without it. If the variant appears alongside a trait or condition significantly more often than you鈥檇 expect by chance, then the algorithm flags it up as a possible cause.

Where it gets difficult is that diseases are almost all complex, and have tens or sometimes hundreds of genes or sections of non-coding DNA involved. This quickly leads to the need for complicated multidimensional analysis, and while the maths involved isn鈥檛 new, the sheer scale of the task means that algorithms are essential. Often they can be comparing tens or hundreds of parameters at a time.

It鈥檚 a bit like the Google search algorithm. The process it uses to rank each web page isn鈥檛 that complex 鈥 for instance, measuring how frequently your search terms appear on a page, then where on the page they appear, then how many links there are to that page, and so on. But it combines hundreds of these measures and applies them to billions of web pages simultaneously. It would be impossible for a human to do.

The algorithmic approach has brought great dividends. Gerstung鈥檚 field, the genomics of cancer, has perhaps had the most exciting developments, .

This devastating and often fatal disease can 鈥 in some cases 鈥 be successfully treated with a full bone-marrow transplant. But that is a major procedure whose complications can sometimes be fatal themselves. You only want to give it to patients with the most deadly forms of leukaemia.

Predicting which leukaemias will be the most deadly, though, is enormously difficult. The symptoms are complex and don鈥檛 always tell you enough about the prognosis.

So what Gerstung鈥檚 team did was sequence the genomes of 1,500 people鈥檚 cancers to find the DNA mutations driving them, and then see which mutations correlated with which outcomes. There were 5,000 different mutations among the patients, and around 1,000 different combinations, which the team divided up into 11 categories of greater or lesser risk. 鈥淚t enables clinicians to make much more focused decisions,鈥 Gerstung says.

The influence of the data-driven approach extends much further. Sequencing the genomes of tumours has caused a 鈥渕ind change鈥 in our approach to cancer in general, says Edd James, a professor of cancer immunology at the University of Southampton. 鈥淲e鈥檙e now much more appreciative that a cancer isn鈥檛 just a mass of copied cells.鈥

A single cancer may contain dozens of different kinds of cell, each with different combinations of DNA mutations and each vulnerable to different drugs. So sequencing allows clinicians to better target drugs at the patients 鈥 and tumours 鈥 upon which they will work. 鈥淏efore, people were treated as members of populations: 鈥榅 per cent of people given this treatment will do well,鈥欌 says James. 鈥淏ut with this information, you can understand whether [individually] they鈥檙e going to get the benefit.鈥

As well as spotting differences, gene sequencing has revealed unexpected similarities between cancers too. Historically, says James, we鈥檝e defined cancers by their anatomical site: as lung cancers, liver cancers, head-and-neck cancers and so on. 鈥淏ut using next-generation sequencing, you can see that there are cancers in different sites that share more in common with each other than with cancers in the same site. It鈥檚 made us realise that some drugs that work for, say, breast cancer might work on others,鈥 he says.

Gerstung backs this up: 鈥淔rom a genetic perspective, there鈥檚 substantial overlap between cancers from different anatomical sites. One even finds BRCA1 [a gene heavily involved in breast cancer] in some prostate cancers.鈥

This is going to become increasingly important. The US Food and Drug Administration has recently licensed a cancer drug 鈥 pembrolizumab 鈥 for use in any cancer that shows signs of mismatch-repair deficiency, a form of DNA repair error. This is the beginning of drugs being licensed on the basis of a cancer鈥檚 genetics rather than location.

And it鈥檚 all because of the constant, gushing flow of data.

鈥淲e got so good at producing data,鈥 says Nicole Wheeler, a data scientist at the Sanger Institute who looks at the genomes of pathogenic bacteria, 鈥渢hat we ended up with too much of it.鈥 McVean agrees. 鈥淚n Moore鈥檚 Law, the computing power you have doubles every 18 months,鈥 he says. 鈥淭he growth of biomedical data capture 鈥 through sequencing genomes, but also through medical imaging or digital pathology 鈥 is much faster than that. We鈥檙e super-Moore鈥檚-Law-ing in biomedical data.鈥

It became completely impossible, in the early years of this century, for biological scientists to check their data themselves. And this meant that biologists had to recruit, or become, data scientists.

鈥淲e reached a bottleneck a few years ago,鈥 says Anne Corcoran. 鈥淲e had lots of data, but we didn鈥檛 know what to do with it. So algorithms had to be invented on the fly, to deal with the data and maximise it,鈥 she continues. 鈥淲hen you鈥檙e looking at single genes, or a few, you can do it manually, but when you鈥檙e looking at the expression of 20,000 genes, you can鈥檛 even do the statistics by yourself.鈥

Biologists 鈥 many of whom grew up, as Corcoran did, working on benches with glassware, not desks and laptops 鈥 have had to learn to use these algorithms. 鈥淚 think senior scientists are often intimidated by it,鈥 she says, 鈥渁nd more reliant on their junior colleagues than they probably should be, or would like to admit that they are.鈥

She鈥檚 evolved a 鈥渨orking knowledge鈥 of how these algorithms function, but admits that 鈥渋t鈥檚 a slightly vulnerable period, where the people at the top don鈥檛 have the skills to check the work of the people beneath them鈥.

Wolf ReikWolf Reik, one of Corcoran鈥檚 colleagues at the Babraham Institute, who runs a research team looking at epigenetics, agrees. Older scientists have a completely different mindset, he says. 鈥淚t鈥檚 quite funny 鈥 my staff in lab meetings think in terms of what the genome as a whole does. But I think about single genes and generalise from them 鈥 that鈥檚 how I learned to think.鈥

It鈥檚 important for people in his position, he says, to understand junior scientists鈥 work, and 鈥渕ost importantly develop an intuition about how to use the tools鈥 because ultimately I put my name to the work鈥.

The younger scientists, on the other hand, have grown up with data. Some of them have come from that background 鈥 Gerstung did a physics undergraduate degree 鈥 although that鈥檚 true of some group leaders as well, such as McVean. But others who came through a more biological route have ended up talking in terms of coding. 鈥淚 did biology as an undergrad, that鈥檚 my domain knowledge,鈥 says Na Cai, a postdoctoral researcher at the Sanger Institute who studies how genotypes relate to various human traits.

鈥淣ow I鈥檓 doing statistical analysis every day. It鈥檚 been like learning another language, or several,鈥 she says. 鈥淚 had to switch my brain from thinking in terms of biochemical pathways and flowcharts to a more structured kind of thinking in terms of code.鈥

The senior scientists she works with have all been 鈥渜uite good at keeping up with the latest developments,鈥 she says. 鈥淭hey might not be able to write the code, but they understand what the analysis does.鈥

Wheeler, a colleague of Cai鈥檚, also came through the biology route and ended up coding. 鈥淚 don鈥檛 have a traditional software-engineering background,鈥 she says. 鈥淚 learned to code on the side, during my PhD. [My coding] isn鈥檛 the most efficient or glamorous, but it鈥檚 about seeing what you have to do computationally and making it happen.鈥

In response to these needs, undergraduate degrees have been changing in the last few years. Newcastle University, for instance, now has a bioinformatics module in its biology undergraduate course, and Reading鈥檚 final-year research projects involve computational biology, although the earlier optional computing modules have a low take-up, so students in their final year are learning the skills last-minute. Imperial College London, which already has bioinformatics courses, is planning to add programming for first- and second-years. 鈥淚 think there鈥檚 a recognition that biology involves more data than we used to have,鈥 says Wheeler, 鈥渟o people need to have the skills to process it.鈥

But the change is slow, and sometimes opposed by students, not all of whom got into biology to code. 鈥淚鈥檇 say some undergrad courses are catching up,鈥 says Corcoran. 鈥淏ut in general they have not, as exemplified by the proliferation of post-degree Master鈥檚 courses teaching these skills.鈥

The change is necessary, though. Even the most wet-lab-oriented scientists interviewed said they spend less than 50 per cent of their time doing experiments; some said it was as little as 10 per cent or even, in Cai鈥檚 case, none at all since she has become a full-time bioinformatician.

The shift towards being data-driven, says Wheeler, can be seen as a move from science that鈥檚 hypothesis-testing to one that鈥檚 hypothesis-generating. One scientist, who preferred not to put their name to the concern, worried that it had reduced the creativity in science, but according to Wheeler that鈥檚 not the case. 鈥淚t鈥檚 moved the creativity around,鈥 she says. 鈥淚n some ways there鈥檚 more room for creativity. You can really try out some crazy ideas at relatively low cost.鈥

This has other advantages. 鈥淵ou can become attached to hypotheses,鈥 says Matt Bawn, a bioinformatician at the Earlham Institute, a computational biology research centre in Norfolk, UK. 鈥淚t鈥檚 better to be a disinterested observer with no preconceptions, to look at the blank canvas and let the picture emerge.鈥

But the greatest benefit is that data-driven studies are throwing up fascinating new findings all the time, in complex areas that were previously impossible to study.

Stefan SchoenfelderStefan Schoenfelder, another researcher at the B abraham Institute, studies the 3D shapes of chromosomes and how they affect gene expression. When the Human Genome Project was completed, it was discovered that there were far fewer genes than previously expected 鈥 about 24,000, roughly a quarter of what scientists thought was the minimum. The rest of the DNA didn鈥檛 code for proteins at all.

What has since been realised is that part of what those non-coding areas do is regulate the expression of the genes: they turn them on in some cells, off in others. And part of how they do that is by folding themselves into different shapes in different cells.

Chromosomes are usually depicted as X-shaped. But that鈥檚 only when a cell is dividing. The rest of the time, the two metres of DNA inside almost every cell is coiled up in a complex tangle. So a length of DNA can be located a vast distance away from a gene on the chromosome but still be able to regulate it because in practice the two have close physical contact, says Schoenfelder. 鈥淭hat鈥檚 why it鈥檚 important to study this in 3D context: if you just look at the sequences and assume they will regulate the gene next door, that鈥檚 often incorrect.

On top of this, genomes fold very differently, Schoenfelder says. 鈥淭he same genome in a T cell will have a different conformation to in a liver cell or in a brain cell, and that鈥檚 linked to different genes being expressed and the cells acquiring different functions.鈥

Working out the 3D shape in each context is incredibly difficult. It involves sequencing cell types and seeing how they differ from other cell types, as well as which bits of DNA are interacting in that context. But the DNA first has to be treated using a complex technique known as cross-linking and ligation in order to allow the sequencing to see which bits are near each other. If two distant points are found together, it might be that they have been folded that way in order for one to affect the other. But 鈥 much more often 鈥 it鈥檚 just the product of random jiggling.

Abstract representation of DNA and dataFinding the real correlations among the noise requires looking at billions of data points and seeing which links keep coming up slightly more often than others. It鈥檚 then that the algorithms really come into play. Once you know which bits of the chromosome are regularly in contact with which other bits, you can use other algorithms to build 3D models based on those points of contact.

鈥淭his whole field is only about 15 years old,鈥 says Schoenfelder. Before that, he says, 鈥淚 didn鈥檛 think of the genome鈥檚 shape at all, I just thought of it as a ball of spaghetti crushed into the nucleus. I thought it was just a logistical problem, stuffing it into a nucleus that鈥檚 maybe 5 microns across.

鈥淲hat鈥檚 blown me away is the fine level of regulation that exists, despite the extreme compaction, that still allows for this fine-tuning.鈥 The 3D shapes of chromosomes, and which regulatory elements interact with which genes on that shape, will be a large part of the story of how the 200 cell types in the human body arise.

Meanwhile, McVean says that genomic research has forced clinicians to reclassify the disease multiple sclerosis entirely. 鈥淲e鈥檝e found more than 250 bits of the genome which light up in terms of risk for the disease,鈥 he says. 鈥淭hat鈥檚 let us make quite strong statements about the risk for the individual. But it鈥檚 also allowed us to see overlaps with diseases like rheumatoid arthritis: some of the genes that raise your risk of MS decrease your risk of arthritis.

鈥淪o we鈥檝e learned it鈥檚 an autoimmune disease, even though it presents as a neurodegenerative disease,鈥 says McVean. 鈥淭here are four or five companies with new therapeutic programmes coming out of this.鈥

And Wolf Reik at the Babraham Institute has a thrilling, almost science-fiction story to tell. His work is in the field of epigenetics, looking at how the chemical environment of a cell affects the expression of genes; he sequences RNA, the messenger molecule that allows DNA to be read and proteins made, to see how it differs from cell to cell. His group is especially interested in ageing.

Five years ago, it was discovered 鈥 and Reik鈥檚 work has since confirmed 鈥 that there is an ageing clock in all our cells. It鈥檚 called DNA methylation. There are four letters in the DNA alphabet: C (cytosine), A (adenine), G (guanine) and T (thymine). As we get older, more and more of the Cs on our DNA gain a little chemical marker called a methyl group. To read this clock, the work is simple 鈥 just counting the methyl groups up 鈥 but, again, the sheer number of data points returned is so enormous that they absolutely have to be counted by algorithm.

鈥淩eading that clock, we can predict your age, and my age, to within three years,鈥 says Reik. 鈥淲hich is surprisingly accurate: the most accurate biomarker of ageing that we have.鈥

All of which is very interesting, of course: it鈥檚 鈥渆ither a readout of an underlying ageing process, or our programmed life expectancy鈥. But Reik says the implication is that we could interrupt it: 鈥淚鈥檓 sure there will be drugs and small molecules that can slow this ageing clock down."

It may be too much to hope that big data will help us all live for ever. But every scientist I spoke to agreed that the rise of algorithm-led, data-intensive genomic research has transformed the life sciences. It has left senior scientists sometimes unsure what their junior colleagues are doing, and left modern research centres with too much laboratory and not enough space for a laptop. The pace of change can be 鈥渄isorienting鈥, says Schoenfelder.

鈥淟ife is a lot more complex now,鈥 he says. 鈥淭he skill set I had when I did my PhD, only 13 years ago, is absolutely not sufficient to keep up with today鈥檚 science.鈥 But this change has brought an optimism back into genomic research. When the Human Genome Project neared completion, people were excited, believing that many diseases would fall quickly as their genetic components were revealed. But most of them turned out to be complex, polygenic, impossible to understand by looking at single genes. Now, though, it is possible to look at those diseases through the power of next-generation sequencing and tools that can sift the data it provides.

鈥淣ow when I run an experiment, I get 100 million, 200 million data points back,鈥 says Schoenfelder. 鈥淚 didn鈥檛 think that was possible in my lifetime, but it鈥檚 happened over the course of a few years. We can address questions that were completely off-limits 10 years ago. It鈥檚 been an extraordinary revolution.鈥

Wellcome, the publisher of Mosaic, founded the Wellcome Sanger Institute in 1993 and has funded it ever since. The Sanger Institute celebrates its 25th anniversary in October 2018.

Gil McVean currently receives funding from Wellcome through an Investigator Award. Wolf Reik participates in a Sanger Institute resource collaboration that is funded by Wellcome and receives funding from Wellcome through an Investigator Award.

Image removed.This first appeared on and is republished here under a Creative Commons licence (excluding images). Images 漏 Babraham Institute and Shutterstock.