The Broad’s Data Sciences Platform has decided to make our GATK software package open source. In fact, we’ve decided that all of the software that DSP develops will be open source.
By aphilippakis
I’m writing today to convey an important announcement: the Broad’s Data Sciences Platform (DSP) has decided to make our Genome Analysis Toolkit (GATK) software package open source. In fact, we’ve decided that all of the software that DSP develops will be open source, so that we can better enable progress in biomedical research.
Making data and software available has been a key part of the Broad’s DNA since the days of the Human Genome Project, when we and other members of the international consortium made all the genomic sequences generated immediately and widely available. We’ve also open sourced the vast majority of software packages developed by our central software group.
But, several years ago, we tried an experiment to see if we might better serve the community by adopting a slightly different model for GATK. As the GATK package grew in popularity, we were facing increasing demands for user support — far beyond what we could provide to the many groups seeking to install and maintain GATK on their local servers. So, while continuing to make the source code visible to users, we decided to try licensing the package to an outside vendor that would provide support. We then heard that users preferred working directly with us rather than a third party entity, so we tried that approach next.
After carefully evaluating the experiment, we’ve concluded it’s time to make a change. There are three reasons.
First, we’ve found that supporting a technical software package like the GATK requires more technical expertise than outside vendors can typically provide; our own GATK Forum is, by far, the most utilized support resource.
Second, licensing GATK has limited the community’s ability to modify and share the code. We know that the best ideas come from a community working together.
Third, our concern about supporting hundreds of local installations of GATK has largely evaporated. Deploying GATK is now much easier than it was in the past, when groups had to work hard to install and maintain a running GATK pipeline. This is due to (i) our Cromwell workflow engine which makes it easier to deploy best practices pipelines and (ii) the growing availability of software-as-a-service solutions on public clouds.
We’ve therefore decided to move GATK to open source, as we have done for most of our software in the past. Not only that, we’ve decided to commit to making all existing and future software products written by the DSP open source going forward — all to accelerate progress in biomedical research.
In reaching this decision, we’ve sought input from many academic colleagues and commercial users — including those with whom we’ve been collaborating on the GATK. They were uniformly supportive.
In the coming years, it will be increasingly important for the biomedical community to maximize access to software tools to accelerate progress. In particular, we’ll need software platforms that allow users to easily store, access, analyze and visualize vast amounts of information from diverse sources — including sequencing data (from germline, cancer, and single cells), medical records, imaging data, and more — while respecting the privacy and wishes of the patients who donated their information. It’s important that this be done in an open source way, to ensure that the community can move ahead as quickly, effectively, and collaboratively as possible.