The study of structural variation — large-scale changes in DNA that can, in some cases, refashion entire chromosomes — in the genomic era has lagged behind that of sequence variation. But there’s a growing appreciation of how important structural variants are to human biology and disease. What makes these variants more challenging to study, and what is being done to overcome those challenges?
Pinning the tail on the structural variants
Since DNA sequencers first started pumping out As, Ts, Cs, and Gs, much — some might argue most — of the effort in genetics and genomics has been put toward interpreting sequence variations, changes where one or a few bases within a given gene or non-coding region is lost, duplicated, or replaced.
But a whole other world of variations is out there waiting to find its niche in the genomic era: that of structural variations — duplications, deletions, inversions, etc. involving larger DNA segments (often defined as those greater than 50-100 bases long). Such variations are common occurrences in our genetic blueprints, accounting, by some estimates, for some 25% of variation in the human genome.
Technologies for finding and studying structural variants have evolved dramatically over the last sixty years. Each new technology has allowed scientists to see finer changes in chromosome structure:
Karyotyping (systematized in the 1950s) and its offshoots can reveal changes down to about three megabases long; anything smaller is effectively invisible.
Starting in the 1990s, array-based methods (e.g., comparative genomic hybridization, or aCGH) brought out variations down to tens of kilobases.
Methods developed starting in the 2000s and grounded in next generation sequencing can theoretically provide a simultaneously deeper (tens of base pairs resolution) and more comprehensive (revealing the full scope of structural variability across entire genomes) view.
Note the word “theoretically.” While sequencing has revolutionized how scientists study the relationships between genetics, biology, and traits and disease at the nucleotide level, researchers who study structural variations have struggled to reap the same benefits and make similar connections. The challenges lie at the intersection of technology and computation, and are inspiring a generation of scientists at the Broad and elsewhere to engineer computational methods and tools for squeezing structural insights out of sequence data in new and creative ways.
Structural variations as drivers of disease and more
It is no secret that structural variations can have a major impact on human biology and disease. Take schizophrenia, for instance. In 2016, two studies led by researchers in the Broad’s Stanley Center for Psychiatric Research — one related to the immune factor C4 by institute members Steven McCarroll and Beth Stevens, the other on copy number variants (CNVs, large scale DNA duplications and deletions) by institute member Ben Neale and more than 270 collaborators in the Psychiatric Genomics Consortium — opened new perspectives on how structural variants can fuel schizophrenia risk.
Structural variants can also fuel many forms of cancer. One of the first successful targeted cancer drugs, imatinib, countered the product of a translocation called BCR-ABL in chronic myelogenous leukemia. Extra copies of the MYCN gene portend a poor outcome for children with a neural tumor called neuroblastoma. Xiaoyang Zhang, Peter Choi, Joshua Francis, and Matthew Meyerson of Broad’s Cancer Program found recently that several cancers, including forms of lung and endometrial cancer, carry duplicates of the promoter for MYC, a powerful oncogene.
“Cancer cells can have thousands and thousands of structural rearrangements compared to normal cells,” Meyerson said. “We don’t yet know all of the important forms of variation yet, why they’re there, why they matter, or which ones matter.”
Not all structural variation is bad, however. Linda Boettger and McCarroll, for example, recently associated deletions in the gene for haptoglobin with reduced blood cholesterol levels.
“In and of themselves, these haptoglobin deletions aren’t going to determine a person’s cardiovascular risk,” McCarroll noted, “but these are some of the larger common variant effects that we know of for cholesterol levels.”
There’s also evolution to consider. Case in point: Natural selection seems to be favoring a large inversion in chromosome 17 in European populations.
“I think understanding the roles and mechanisms of structural rearrangements could be extremely important to understanding evolutionary biology,” Meyerson mused.
Why so hard?
What is it about structural variants that makes them so challenging to study using DNA sequence data?
“The challenge goes back to mapping and how sequence data is generated,” said Neale. Sequencing by synthesis, the predominant sequencing method, shears a genome into short pieces (each hundreds of base pairs long) that sophisticated algorithms read individually and then assemble and align them back to a reference genome.
But if a long DNA segment is deleted, inserted, or moved, the mapping algorithms struggle to match what the sequencer records with the what the reference genome says should be present. Especially if the variation is complex (an inversion surrounded by a duplication and a deletion, for instance).
Variants’ neighbors also frequently compound the problem. Repetitive DNA structures such as segmental duplications (long stretches of DNA more than 90% identical in sequence) flank many of the large variants known to contribute to human disease, further befuddling the algorithms.
“When you try to map structural variations from sequence, you need to think about read depth, the amount of sequence data you have for a given genetic position; the implied insert size between two reads’ ends; and split reads, where you have reads that overlap a region that’s been deleted or inserted,” Neale explained. “Integrating all of that together is challenging, and in the end it’s really all about determining what an individual’s sequence truly is in that region.”
In addition, as the genetics field morphed into the genomics field, computational scientists focused their algorithm and tool development efforts primarily on nucleotide variants. Researchers have four sequencing-based approaches — read-pair, read-depth, split-read, and de novo assembly analyses — at their disposal for highlighting structural features with genomes or exomes, but each approach (and the algorithms based on them) has its own biases and weaknesses. And none give a comprehensive, unified view of everything that has gone awry in a sample.
“Structural variation got left behind a little bit,” Mike Talkowski, a Broad associate member who studies structural variations in autism and other neurodevelopmental and germline disorders, said of efforts to build genome and exome analysis tools. “Right now, it’s not possible to take a genome sequence, put it through a filter, and get that genome’s structure.”
The tools that have been developed, he adds, chiefly deal with CNVs. That leaves a lot of variation out.
“CNVs account for most structural variation in the genome,” Talkowski continued. “In a recent scan of variations greater than five kilobases, about 80% of structural variants were what would be considered canonical CNVs.
“The balanced variations,” he added, “the huge inversions and the big exchanges of chromosomes, and the complex variations that make up the other 20%, have been largely ignored.”
Back to the toolbench
All this is not to say that there are no tools available for picking structural variations out of sequence data. Far from it. Just from the Broad, for instance, two tools built on the GATK platform, GenomeSTRIP (developed by McCarroll and Bob Handsaker) and XHMM, probe for CNVs in germline exome and genome data, respectively. GISTIC, built by the Broad Cancer Program and included in their FireCloud cloud-based cancer data and analysis suite, (as well as the program’s Integrated Genomics Viewer and GenePattern portals) seeks CNVs in tumor exomes. (Talkowski’s lab also recently revealed a new visualization tool called CNView.) Looking beyond the Broad community, dozens of genome- and/or exome-friendly algorithms and toolkits have been published.
At the moment, though, there is no do-it-all, standardized set of tools for routinely turning sequence data into structural insights. But maybe not for much longer. GATK4 (currently in alpha) will come with CNV and structural variant detection tools baked in. And Talkowski — along with McCarroll, Meyerson, and several other Broad and MIT researchers and software engineers — is part of a working group trying to make structural analysis simpler by developing a comprehensive suite of computational detection tools.
“We want to build and distribute tools that allow the analysis of structure variation to become just as routine as analysis of sequence variation,” he said. “We want to take leading tools like GenomeSTRIP and other algorithms, integrate them in a box, and make it available in a portal like GATK.”
“I’m enthusiastic about the potential for moving from alignment- to assembly-based methods,” Meyerson added, noting a tool called Snowman being developed in Broad associate member Rameen Beroukhim’s group. “They let you look at complex rearrangements without assuming a reference genome.”
Improved computation is only one part of the solution, though. “There’s a general desire for a better set of algorithms for finding structural events, but it is in a sense a moving target, because as we get longer read sequencing, all of these problems get a lot easier to solve,” Neale said. “So the question is, what will win first, the biological technology or the computational technology?”
“Longer reads will help,” McCarroll said, “but it’s also good to meet the technology where it is. Current sequencing methodology has generated data on tens of thousands of people. I’m excited at the prospect of extracting more knowledge from the data we have.”
Zhang X et al. Identification of focally amplified lineage-specific super-enhancers in human epithelial cancers. Nature Genetics. Online December 14, 2015. DOI: 10.1038/ng.3470
Boettger LM et al. Recurring exon deletions in the haptoglobin (HP) gene associate with lower blood cholesterol levels. Nature Genetics. Online February 22, 2016. DOI: 10.1038/ng.3510
Boettger LM et al. Structural haplotypes and recent evolution of the human 17q21.31 region. Nature Genetics. Online July 1, 2012. DOI: 10.1038/ng.2334
Redin Cet al. The genomic landscape of balanced cytogenetic abnormalities associated with human congenital anomalies. Nature Genetics. Online November 14, 2016. DOI: 10.1038/ng.3720
Chiang C et al. Complex reorganization and predominant non-homologous repair following chromosomal breakage in karyotypically balanced germline rearrangements and transgenic integration. Nature Genetics. Online March 4, 2012. DOI: 10.1038/ng.2202
Handsaker RE et al. Large multi-allelic copy number variations in humans. Nature Genetics. Online January 26, 2016. DOI: 10.1038/ng.3200
Collins RL et al. CNView: a visualization and annotation tool for copy number variation from whole-genome sequencing. Posted to bioRxiv April 20, 2016. DOI: 10.1101/049536
Q&A with Matthew Meyerson and Rameen Beroukhim: Beyond base pairs
Steve McCarroll: Structural variation and disease in the human genome