Tagged with #build
4 documentation articles | 0 announcements | 2 forum discussions



Created 2015-07-06 18:30:35 | Updated 2015-07-07 15:32:56 | Tags: queue gatk build
Comments (0)


TL;DR: mvn -Ddisable.shadepackage verify


Background

In addition to Queue's GATK-wrapper codegen, relatively slow scala compilation, etc. there's still a lot of legacy compatibility from our ant days in the Maven scripts. Our mvn verify behaves more like when one runs ant, and builds everything needed to bundle the GATK.

As of GATK 3.4, by default the build for the "protected" code generates jar files that contains every class needed for running, one for the GATK and one for Queue. This is done by the Maven shade plugin, and are each called the "package jar". But, there's a way to generate a jar file that only contains META-INF/MANIFEST.MF pointers to the dependency jar files, instead of zipping/shading them up. These are each the "executable jar", and FYI are always generated as it takes seconds, not minutes.


Instructions for fast compilation

While developing and recompiling Queue, disable the shaded jar with -Ddisable.shadepackage. Then run java -jar target/executable/Queue.jar ... If you need to transfer this jar to another machine / directory, you can't copy (or rsync) just the jar, you'll need the entire executable directory.

# Total expected time, on a local disk, with Queue:
#   ~5.0 min from clean
#   ~1.5 min per recompile
mvn -Ddisable.shadepackage verify

# always available
java -jar target/executable/Queue.jar --help

# not found when shade disabled
java -jar target/package/Queue.jar --help

If one is only developing for the GATK, skip Queue by adding -P\!queue also.

mvn -Ddisable.shadepackage -P\!queue verify

# always available
java -jar target/executable/GenomeAnalysisTK.jar --help

# not found when queue profile disabled
java -jar target/executable/Queue.jar --help

Created 2014-05-16 16:31:03 | Updated 2014-09-29 17:57:56 | Tags: maven build sting
Comments (2)

Overview

The GATK 3.2 source code uses new java package names, directory paths, and executable jars. Post GATK 3.2, any patches submitted via pull requests should also include classes moved to the appropriate artifact.

Note that the document includes references to the private module, which is part of our internal development codebase but is not available to the general public.

Summary

A long term ideal of the GATK is to separate out reusable parts and eventually make them available as compiled libraries via centralized binary repositories. Ahead of publishing a number of steps must be completed. One of the larger steps has been completed for GATK 3.2, where the code base rebranded all references of Sting to GATK.

Currently implemented changes include:

  • Java/Scala package names changed from org.broadinstitute.sting to org.broadinstitute.gatk
  • Renamed Maven artifacts including new directories

As of May 16, 2014, remaining TODOs ahead of publishing to central include:

  • Uploading all transitive GATK dependencies to central repositories
  • Separating a bit more of the intertwined utility, engine, and tool classes

Now that the new package names and Maven artifacts are available, any pull request should include ensuring that updated classes are also moved into the correct GATK Maven artifact. While there are a significant number of classes, cleaning up as we go along will allow the larger task to be completed in a distributed fashion.

The full lists of new Maven artifacts and renamed packages are below under [Renamed Artifact Directories]. For those developers in the middle of a git rebase around commits before and after 3.2, here is an abridged mapping of renamed directories for those trying to locate files:

Old Maven Artifact New Maven Artifact
public/sting-root public/gatk-root
public/sting-utils public/gatk-utils
public/gatk-framework public/gatk-tools-public
public/queue-framework public/gatk-queue
protected/gatk-protected protected/gatk-tools-protected
private/gatk-private private/gatk-tools-private
private/queue-private private/gatk-queue-private

QScripts are no longer located with the Queue engine, and instead are now located with the GATK wrappers implemented as Queue extensions. See [Separated Queue Extensions] for more info.

Changes

Separating the GATK Engine and Tools

Starting with GATK 3.2, separate Maven utility artifacts exist to separate reusable portions of the GATK engine apart from tool specific implementations. The biggest impact this will have on developers is the separation of the walkers packages.

In GATK versions <= 3.1 there was one package for both the base classes and the implementations of walkers:

  • org.broadinstitute.sting.gatk.walkers

In GATK versions >= 3.2 threre are two packages. The first contains the base interfaces, annotations, etc. The latter package is for the concrete tools implemented as walkers:

  • org.broadinstitute.gatk.engine.walkers

    • Ex: ReadWalker, LocusWalker, @PartitionBy, @Requires, etc.
  • org.broadinstitute.gatk.tools.walkers
    • Ex: PrintReads, VariantEval, IndelRealigner, HaplotypeCaller, etc.

Renamed Binary Packages

Previously, depending on how the source code was compiled, the executable gatk-package-3.1.jar and queue-package-3.1.jar (aka GenomeAnalysisTK.jar and Queue.jar) contained various mixes of public/protected/private code. For example, if the private directory was present when the source code was compiled, the same artifact named gatk-package-3.1.jar might, or might not contain private code.

Starting with 3.2, there are two versions of the jar created, each with specific file contents.

New Maven Artifact Alias in the /target folder Packaged contents
gatk-package-distribution-3.2.jar GenomeAnalysisTK.jar public,protected
gatk-package-internal-3.2.jar GenomeAnalysisTK-internal.jar public,protected,private
gatk-queue-package-distribution-3.2.jar Queue.jar public,protected
gatk-queue-package-internal-3.2.jar Queue-internal.jar public,protected,private

Separated Queue Extensions

When creating a packaged version of Queue, the GATKExtensionsGenerator builds Queue engine compatible command line wrappers around each GATK walker. Previously, the wrappers were generated during the compilation of the Queue framework. Similar to the binary packages, depending on who built the source code, queue-framework-3.1.jar would contain various mixes of public/protected/private wrappers.

Starting with GATK 3.2, the gatk-queue-3.2.jar only contains code for the Queue engine. Generated and manually created extensions for wrapping any other command line programs are all included in separate artifacts. Due to a current limitation regarding how the generator uses reflection, the generator cannot build wrappers for just private classes without also generating protected and public classes. Thus, there are three different Maven artifacts generated, that contain different mixes of public, protected and private wrappers.

Extensions Artifact Generated wrappers for GATK tools
gatk-queue-extensions-public-3.2.jar public only
gatk-queue-extensions-distribution-3.2.jar public,protected
gatk-queue-extensions-internal-3.2.jar public,protected,private

As for QScripts that used to be located with the framework, they are now located with the generated wrappers.

Old QScripts Artifact Directory New QScripts Artifact Directory
public/queue-framework/src/main/qscripts public/gatk-queue-extensions-public/src/main/qscripts
private/queue-private/src/main/qscripts private/gatk-queue-extensions-internal/src/main/qscripts

Renamed Artifact Directories

The following list shows the mapping of artifact names pre and post GATK 3.2. In addition to the engine changes, the packaging updates and extensions changes above also affected Maven artifact refactoring. The packaging artifacts have split from a single public to protected and private versions, and new queue extensions artifacts have been added as well.

Maven Artifact <= GATK 3.1 Maven Artifact >= GATK 3.2
/pom.xml (sting-aggregator) /pom.xml _(gatkaggregator)
public/sting-root public/gatk-root
public/sting-utils public/gatk-utils
none public/gatk-engine
public/gatk-framework public/gatk-tools-public
public/queue-framework public/gatk-queue
public/gatk-queue-extgen public/gatk-queue-extensions-generator
protected/gatk-protected protected/gatk-tools-protected
private/gatk-private private/gatk-tools-private
private/queue-private private/gatk-queue-private
public/gatk-package protected/gatk-package-distribution
public/queue-package protected/gatk-queue-package-distribution
none private/gatk-package-internal
none private/gatk-queue-package-internal
none public/gatk-queue-extensions-public
none protected/gatk-queue-extensions-distribution
none private/gatk-queue-extensions-internal

A note regarding the aggregator:

The aggregator is the pom.xml in the top directory level of the GATK source code. When someone clones the GATK source code and runs mvn in the top level directory, the aggregator the pom.xml executed.

The root is a pom.xml that contains all common Maven configuration. There are a couple dependent pom.xml files that inherit configuration from the root, but are NOT aggregated during normal source compilation.

As of GATK 3.2, these un-aggregated child artifacts are VectorPairHMM and picard-maven. They should not run by default with each instance of mvn run on the GATK source code.

For more clarification on Maven Inheritance vs. Aggregation, see the Maven introduction to the pom.

Renamed Java/Scala Package Names

In GATK 3.2, except for classes with Sting in the name, all file names are still the same. To locate migrated files under new java package names, developers should either use Intellij IDEA Navigation or /bin/find to locate the same file they used previously.

The biggest change most developers will face is the new package names for GATK classes. Code entanglement does not permit simply moving the classes into the correct Maven artifacts, as a few number of lines of code must be edited inside a large number of files. So post renaming only a very small number of classes were moved out of the incorrect Maven artifacts as examples.

As of the May 16, 2014, the migrated GATK package distribution is as follows. This list includes only main classes. The table excludes all tests, renamed files such as StingException, certain private Queue wrappers, and qscripts renamed to end in *.scala.

Scope Type <= 3.1 Artifact <= 3.1 Package >= GATK 3.2 Artifact >= 3.2 GATK Package Files
public java gatk-framework o.b.s gatk-utils o.b.g 4
public java gatk-framework o.b.s.gatk gatk-engine o.b.g.engine 2
public java gatk-framework o.b.s gatk-tools-public o.b.g 202
public java gatk-framework o.b.s gatk-tools-public o.b.g.utils 49
public java gatk-framework o.b.s gatk-tools-public o.b.g.engine 34
public java gatk-framework o.b.s.gatk gatk-tools-public o.b.g.engine 244
public java gatk-framework o.b.s.gatk gatk-tools-public o.b.g.tools 134
public java gatk-framework o.b.s.gatk gatk-tools-public o.b.g.tools.walkers 2
protected java gatk-protected o.b.s gatk-tools-protected o.b.g 44
protected java gatk-protected o.b.s.gatk gatk-tools-protected o.b.g.engine 1
protected java gatk-protected o.b.s.gatk gatk-tools-protected o.b.g.tools 209
private java gatk-private o.b.s gatk-tools-private o.b.g 23
private java gatk-private o.b.s gatk-tools-private o.b.g.utils 7
private java gatk-private o.b.s.gatk gatk-tools-private o.b.g.engine 5
private java gatk-private o.b.s.gatk gatk-tools-private o.b.g.tools 133
public java queue-framework o.b.s gatk-queue o.b.g 2
public scala queue-framework o.b.s gatk-queue o.b.g 72
public scala queue-framework o.b.s gatk-queue-extensions-public o.b.g 31
public qscripts queue-framework o.b.s gatk-queue-extensions-public o.b.g 12
private scala queue-private o.b.s gatk-queue-private o.b.g 2
private qscripts queue-private o.b.s gatk-queue-extensions-internal o.b.g 118

During all future code modifications and pull requests, classes should be refactored to correct artifacts and package as follows.

All non-engine tools should be in the tools artifacts, with appropriate sub-package names.

Scope Type Artifact Package(s)
public java gatk-utils o.b.g.utils
public java gatk-engine o.b.g.engine
public java gatk-tools-public o.b.g.tools.walkers
public java gatk-tools-public o.b.g.tools.*
protected java gatk-tools-protected o.b.g.tools.walkers
protected java gatk-tools-protected o.b.g.tools.*
private java gatk-tools-private o.b.g.tools.walkers
private java gatk-tools-private o.b.g.tools.*
public java gatk-queue o.b.g.queue
public scala gatk-queue o.b.g.queue
public scala gatk-queue-extensions-public o.b.g.queue.extensions
public qscripts gatk-queue-extensions-public o.b.g.queue.qscripts
private scala gatk-queue-private o.b.g.queue
private qscripts gatk-queue-extensions-internal o.b.g.queue.qscripts

Renamed Classes

The following class names were updated to replace Sting with GATK.

Old Sting class New GATK class
ArtificialStingSAMFileWriter ArtificialGATKSAMFileWriter
ReviewedStingException ReviewedGATKException
StingException GATKException
StingSAMFileWriter GATKSAMFileWriter
StingSAMIterator GATKSAMIterator
StingSAMIteratorAdapter GATKSAMIteratorAdapter
StingSAMRecordIterator GATKSAMRecordIterator
StingTextReporter GATKTextReporter

Common Git/Maven Issues

Renamed files

The 3.2 renaming patch is actually split into two commits. The first commit renames the files without making any content changes, while the second changes the contents of the files without changing any file paths.

When dealing with renamed files, it is best to work with a clean directory during rebasing. It will be easier for you track files that you may not have added to git.

After running a git rebase or merge, you may first run into problems with files that you renamed and were moved during the GATK 3.2 package renaming. As a general rule, the renaming only changes directory names. The exception to this rule are classes such as StingException that are renamed to GATKException, and are listed under [Renamed Classes]. The workflow for resolving these merge issues is to find the list of your renamed files, put your content in the correct location, then register the changes with git.

To obtain the list of renamed directories and files:

  1. Use git status to get a list of affected files
  2. Find the common old directory and file name under "both deleted"
  3. Find your new file name under "added by them" (yes, you are "them")
  4. Find the new directory under "added by us"

Then, to resolve the issue for each file:

  1. Move your copy of your renamed file to the new directory
  2. git rm the old paths as appropriate
  3. git add the new path
  4. Repeat for other files until git status shows "all conflicts fixed"

Upon first rebasing you will see a lot of text. At this moment, you can ignore most of it, and use git status instead.

For the purposes of illustration, while running git rebase it is perfectly normal to see something similar to:

$ git rebase master
First, rewinding head to replay your work on top of it...
Applying: <<< Your first commit message here >>>
Using index info to reconstruct a base tree...
A   protected/gatk-protected/src/main/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/GenotypingEngine.java
A   protected/gatk-protected/src/test/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/GenotypingEngineUnitTest.java
<<<Other files that you renamed.>>>
warning: squelched 12 whitespace errors
warning: 34 lines add whitespace errors.
Falling back to patching base and 3-way merge...
CONFLICT (rename/rename): Rename "protected/gatk-protected/src/test/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/GenotypingEngineUnitTest.java"->"protected/gatk-tools-protected/src/test/java/org/broadinstitute/gatk/tools/walkers/haplotypecaller/GenotypingEngineUnitTest.java" in branch "HEAD" rename "protected/gatk-protected/src/test/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/GenotypingEngineUnitTest.java"->"protected/gatk-protected/src/test/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCallerGenotypingEngineUnitTest.java" in "<<< Your first commit message here >>>"
CONFLICT (rename/rename): Rename "protected/gatk-protected/src/main/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/GenotypingEngine.java"->"protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/haplotypecaller/GenotypingEngine.java" in branch "HEAD" rename "protected/gatk-protected/src/main/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/GenotypingEngine.java"->"protected/gatk-protected/src/main/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCallerGenotypingEngine.java" in "<<< Your first commit message here >>>"
Failed to merge in the changes.
Patch failed at 0001 Example conflict.
The copy of the patch that failed is found in:
   /Users/zzuser/src/gsa-unstable/.git/rebase-apply/patch

When you have resolved this problem, run "git rebase --continue".
If you prefer to skip this patch, run "git rebase --skip" instead.
To check out the original branch and stop rebasing, run "git rebase --abort".

$

While everything you need to resolve the issue is technically in the message above, it may be much easier to track what's going on using git status.

$ git status
rebase in progress; onto cba4321
You are currently rebasing branch 'zz_renaming_haplotypecallergenotypingengine' on 'cba4321'.
  (fix conflicts and then run "git rebase --continue")
  (use "git rebase --skip" to skip this patch)
  (use "git rebase --abort" to check out the original branch)

Unmerged paths:
  (use "git reset HEAD <file>..." to unstage)
  (use "git add/rm <file>..." as appropriate to mark resolution)

    added by them:      protected/gatk-protected/src/main/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCallerGenotypingEngine.java
    both deleted:       protected/gatk-protected/src/main/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/GenotypingEngine.java
    added by them:      protected/gatk-protected/src/test/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCallerGenotypingEngineUnitTest.java
    both deleted:       protected/gatk-protected/src/test/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/GenotypingEngineUnitTest.java
    added by us:        protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/haplotypecaller/GenotypingEngine.java
    added by us:        protected/gatk-tools-protected/src/test/java/org/broadinstitute/gatk/tools/walkers/haplotypecaller/GenotypingEngineUnitTest.java

Untracked files:
  (use "git add <file>..." to include in what will be committed)

<<< possible untracked files if your working directory is not clean>>>

no changes added to commit (use "git add" and/or "git commit -a")
$ 

Let's look at the main java file as an example. If you are having issues figuring out the new directory and new file name, they are all listed in the output.

Path in the common ancestor branch:
 |      old source directory       |                     old package name                     |   old file name     |
  protected/gatk-protected/src/main/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/GenotypingEngine.java

Path in the new master branch before merge:
 |           new source directory             |                 new package name                    |   old file name     |
  protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/haplotypecaller/GenotypingEngine.java

Path in your branch before merge:
 |      old source directory       |                     old package name                     |           new file name            |
  protected/gatk-protected/src/main/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCallerGenotypingEngine.java

Path in your branch post merge:
 |           new source directory             |                 new package name                    |           new file name            |
  protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/haplotypecaller/HaplotypeCallerGenotypingEngine.java    

After identifying the new paths for use post merge, use the following workflow for each file:

  1. Move or copy your version of the renamed file to the new directory
  2. git rm the three old file paths: common ancestor, old directory with new file name, and new directory with old file name
  3. git add the new file name in the new directory

After you process all files correctly, in the output of git status you should see the "all conflicts fixed" and all your files renamed.

$ git status
rebase in progress; onto cba4321
You are currently rebasing branch 'zz_renaming_haplotypecallergenotypingengine' on 'cba4321'.
  (all conflicts fixed: run "git rebase --continue")

Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

    renamed:    protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/haplotypecaller/GenotypingEngine.java -> protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/haplotypecaller/HaplotypeCallerGenotypingEngine.java
    renamed:    protected/gatk-tools-protected/src/test/java/org/broadinstitute/gatk/tools/walkers/haplotypecaller/GenotypingEngineUnitTest.java -> protected/gatk-tools-protected/src/test/java/org/broadinstitute/gatk/tools/walkers/haplotypecaller/HaplotypeCallerGenotypingEngineUnitTest.java

Untracked files:
  (use "git add <file>..." to include in what will be committed)

<<< possible untracked files if your working directory is not clean>>>

$

Continue your rebase, handling other merges as normal.

$ git rebase --continue

Fixing imports

Because all the packages names are different in 3.2, while rebasing you may run into conflicts due to imports you also changed. Use your favorite editor to fix the imports within the files. Then try recompiling, and repeat as necessary until your code works.

While editing the files with conflicts with a basic text editor may work, IntelliJ IDEA also offers a special merge tool that may help via the menu:

VCS > Git > Resolve Conflicts...

For each file, click on the "Merge" button in the first dialog. Use the various buttons in the Conflict Resolution Tool to automatically accept any changes that are not in conflict. Then find any edit any remaining conflicts that require further manual intervention.

Once you begin editing the import statements in the three way merge tool, another IntelliJ IDEA 13.1 feature that may speed up repairing blocks of import statements is Multiple Selections. Find a block of import lines that need the same changes. Hold down the option key as you drag your cursor vertically down the edit point on each import line. Then begin typing or deleting text from the multiple lines.

Switching branches

Even after a successful merge, you may still run into stale GATK code or links from modifications before and after the 3.2 package renaming. To significantly reduce these chances, run mvn clean before and then again after switching branches.

If this doesn't work, run mvn clean && git status, looking for any directories you don't that shouldn't be in the current branch. It is possible that some files were not correctly moved, including classes or test resources. Find the file still in the old directories via a command such as find public/gatk-framework -type f. Then move them to the correct new directories and commit them into git.

Slow Builds with Queue and Private

Due to the [Renamed Binary Packages], the separate artifacts including and excluding private code are now packaged during the Maven package build lifecycle.

When building packages, to significantly speed up the default packaging time, if you only require the GATK tools run mvn verify -P\!queue.

Alternatively, if you do not require building private source, then disable private compiling via mvn verify -P\!private.

The two may be combined as well via: mvn verify -P\!queue,\!private.

The exclamation mark is a shell command that must be escaped, in the above case with a backslash. Shell quotes may also be used: mvn verify -P'!queue,!private'.

Alternatively, developers with access to private may often want to disable packaging the protected distributions. In this case, use the gsadev profile. This may be done via mvn verify -Pgsadev or, excluding Queue, mvn verify -Pgsadev,\!queue.

Stale symlinks

Users see errors from maven when an unclean repo in git is updated. Because BaseTest.java currently hardcodes relative paths to "public/testdata", maven creates these symbolic links all over the file system to help the various tests in different modules find the relative path "/public/testdata".

However, our Maven support has evolved from 2.8, to 3.0, to now the 3.2 renaming, each time has changed the symbolic link's target directory. Whenever a stale symbolic link to an old testdata directory remains in the users folder, maven is saying it will not remove the link, because maven basically doesn't know why the link is pointing to the wrong folder (answer, the link is from an old git checkout) and thinks it's a bug in the build.

If one doesn't have an stale / unclean maven repo when updating git via merge/rebase/checkout, you will never see this issue.

The script that can remove the stale symlinks, public/src/main/scripts/shell/delete_maven_links.sh, should run automatically during a mvn test-compile or mvn verify.


Created 2014-04-04 18:53:24 | Updated 2015-07-28 15:53:10 | Tags: intellij maven build
Comments (25)

Overview

Since GATK 3.0, we use Apache Maven (instead of Ant) as our build system, and IntelliJ as our IDE (Integrated Development Environment). This document describes how to get set up to use Maven as well as how to create an IntelliJ project around our Maven project structure.

Before you start

  • Ensure that you have git clones of our repositories on your machine. See this document for details on obtaining the GATK source code from our Git repos.

Setting up Maven

  1. Check whether you can run mvn --version on your machine. If you can't, install Maven from here.

  2. Ensure that the JAVA_HOME environment variable is properly set. If it's not, add the appropriate line to your shell's startup file:

    for tcsh:

    setenv JAVA_HOME  \`/usr/libexec/java_home\`

    for bash:

    export JAVA_HOME=\`/usr/libexec/java_home\`

Note that the commands above use backticks, not single quotes.

Basic Maven usage

  1. To compile everything, type:

    mvn verify
  2. To compile the GATK but not Queue (much faster!), the command is:

    mvn verify -P\!queue

    Note that the ! needs to be escaped with a backslash to avoid interpretation by the shell.

  3. To obtain a clean working directory, type:

    mvn clean
  4. If you're used to using ant to compile the GATK, you should be able to feed your old ant commands to the ant-bridge.sh script in the root directory. For example:

    ./ant-bridge.sh test -Dsingle=MyTestClass

Setting up IntelliJ

  1. Run mvn test-compile in your git clone's root directory.

  2. Open IntelliJ

  3. File -> import project, select your git clone directory, then click "ok"

  4. On the next screen, select "import project from external model", then "maven", then click "next"

  5. Click "next" on the next screen without changing any defaults -- in particular:

    • DON'T check "Import maven projects automatically"
    • DON'T check "Create module groups for multi-module maven projects"
  6. On the "Select Profiles" screen, make sure private and protected ARE checked, then click "next".

  7. On the next screen, the "gatk-aggregator" project should already be checked for you -- if not, then check it. Click "next".

  8. Select the 1.7 SDK, then click "next".

  9. Select an appropriate project name (can be anything), then click "next" (or "finish", depending on your version of IntelliJ).

  10. Click "Finish" to create the new IntelliJ project.

  11. That's it! Due to Maven magic, everything else will be set up for you automatically, including modules, libraries, Scala facets, etc.

  12. You will see a popup "Maven projects need to be imported" on every IntelliJ startup. You should click import unless you're working on the actual pom files that make up the build system.

Created 2013-11-05 22:10:00 | Updated 2014-02-17 21:29:11 | Tags: private maven build broadies
Comments (2)

Overview


We're replacing Ant with Maven. To build, run mvn verify.

Background

In the early days of the Genome Analysis Toolkit (GATK), the code base separated the GATK genomics engine from the core java utilities, encompassed in a wider project called Sting. During this time, the build tool of choice was the relatively flexible Java build tool Apache Ant, run via the command ant.

As our code base expanded to more and more packages, groups internal and external to GSA, and the Broad, have expressed interest in using portions of Sting/GATK as modules in larger projects. Unfortunately over time, many parts of the GATK and Sting intermingled, producing the current situation where developers finds it easier to copy the monolithic GATK instead, or individual java files, instead of using the tools as libraries.

The goal of this first stage is to split the parts of the monolithic Sting/GATK into easily recognizable sub artifacts. The tool used to accomplish this task is Apache Maven, also known as Maven, and run via the command mvn. Maven convention encourages developers to separate code, and accompanying resources, into a hierarchical structure of reusable artifacts. Maven attempts to avoid build configuration, preferring source repositories to lay out code in a conventional structure. When needed, a Maven configuration file called pom.xml specifies each artifact's build configuration, that one may think of as similar to an Ant build.xml.

The actual migration consisted of zero changes to the contents of existing Java source files, easing git merges and rebasing. The Java files from public, protected, and private have all moved into Maven conventional child artifacts, with each artifact containing a separate pom.xml.

Examples

Obtaining the GATK with Maven support

Clone the repository:

git clone ssh://git@github.com/broadinstitute/gsa-unstable.git cd gsa-unstable

Building GATK and Queue

Clone the repository:

git clone ssh://git@github.com/broadinstitute/gsa-unstable.git cd gsa-unstable

If running on a Broad server, add maven to your environment via the dotkit:

reuse Maven-3.0.3

Build all of Sting, including packaged versions of the GATK and Queue:

mvn verify

The packaged, executable jar files will be output to:

public/gatk-package/target/gatk-package-2.8-SNAPSHOT.jar public/queue-package/target/queue-package-2.8-SNAPSHOT.jar

Find equivalent maven commands for existing ant targets:

./ant-bridge.sh <target> <properties>

Example output:

$ ./ant-bridge.sh fasttest -Dsingle=GATKKeyUnitTest Equivalent maven command mvn verify -Dsting.committests.skipped=false -pl private/gatk-private -am -Dresource.bundle.skip=true -Dit.test=disabled -Dtest=GATKKeyUnitTest $

Running the GATK and Queue

To run the GATK, or copy the compiled jar, find the packaged jar under public/gatk-package/target

public/gatk-package/target/gatk-package-2.8-SNAPSHOT.jar

To run Queue, the jar is under the similarly named public/queue-package/target

public/queue-package/target/queue-package-2.8-SNAPSHOT.jar

NOTE: Unlike builds with Ant, you cannot execute the jar file built by the gatk-framework module. This is because maven does not include dependent artifacts in the target folder with assembled framework jar. Instead, use the packaged jars, listed above, that contain all the classes and resources needed to run the GATK, or Queue.

Excluding Queue

NOTE: If you make changes to sting-utils, gatk-framework, or any other dependencies and disable queue, you may accidentally end up breaking the full repository build without knowing.

The Queue build contributes a majority portion of the Sting project build time. To exclude Queue from your build, run maven with either (the already shell escaped) -P\!queue or -Ddisable.queue. Currently the latter property also disables the maven queue profile. This allows one other semi-permanent option to disable building Queue as part of the Sting repository. Configure your local Maven settings to always pass the property -Ddisable.queue by adding and activating a custom profile in your local ~/.m2/settings.xml

```$ cat ~/.m2/settings.xml

disable.queuetruedisable.queue

$```

Using the GATK framework as a module

Currently the GATK artifacts are not available via any centralized repository. To build code using the GATK you must still have a checkout of the GATK source code, and install the artifacts to your local mvn repository (by default ~/.m2/repository). The installation copies the artifacts to your local repo such that it may be used by your external project. The checkout of the local repo provides several artifacts under public/repo that will be required for your project.

After updating to the latest version of the Sting source code, install the Sting artifacts via:

mvn install

After the GATK has been installed locally, in your own source repository, include the artifact gatk-framework as a library.

In Apache Maven add this dependency:

```

org.broadinstitute.sting
<artifactId>gatk-framework</artifactId>
<version>2.8-SNAPSHOT</version>

```

For Apache Ivy, you may need to specify ~/.m2/repository as a local repo. Once the local repository has been configured, ivy may find the dependency via:

<dependency org="org.broadinstitute.sting" name="gatk-framework" rev="2.8-SNAPSHOT" />

If you decide to also use Maven to build your project, your source code should go under the conventional directory src/main/java. The pom.xml contains any special configuration for your project. To see an example pom.xml and maven conventional project structure in:

public/external-example

Moved directories

If you have an old git branch that needs to be merged, you may need to know where to move files in order for your classes to now build with Maven. In general, most directories were moved with minimal or no changes.

Old directory New maven directory
private/java/src/ private/gatk-private/src/main/java/
private/R/scripts/ private/gatk-private/src/main/resources/
private/java/test/ private/gatk-private/src/test/java/
private/testdata/ private/gatk-private/src/test/resources/
private/scala/qscript/ private/queue-private/src/main/qscripts/
private/scala/src/ private/queue-private/src/main/scala/
private/scala/test/ private/queue-private/src/test/scala/
protected/java/src/ protected/gatk-protected/src/main/java/
protected/java/test/ protected/gatk-protected/src/test/java/
public/java/src/ public/gatk-framework/src/main/java/
public/java/test/ public/gatk-framework/src/test/java/
public/testdata/ public/gatk-framework/src/test/resources/
public/scala/qscript/ public/queue-framework/src/main/qscripts/
public/scala/src/ public/queue-framework/src/main/scala/
public/scala/test/ public/queue-framework/src/test/scala/

Future Directions

Further segregate source code

Currently, the artifacts sting-utils and the gatk-framework contain intertwined code bases. This leads to the current setup where all sting-utils code is actually found in the gatk-framework artifact, including generic utilities that could be used by other software modules. In the future, all elements under org.broadinstitute.sting.gatk will be located the gatk-framework, while all other packages under org.broadinstitut.sting will be evaluated and then separated under the gatk-framework or sting-utils artifacts.

Publishing artifacts

Tangentially related to segregating sting-utils and the gatk-framework, the current Sting and GATK artifacts are ineligible to be pushed to the Maven Central Repository, due to several other issues:

  • Need to provide trivial workflow for Picard, and possibly SnpEff, to submit to central
  • Missing meta files for the jars:
    • *-sources.jar
    • *-javadoc.jar
    • *.md5
    • *.sha1

NOTE: Artifact jars do NOT need to actually be in Central, and may be available as pom reference only, for example Oracle ojdbc.

In the near term, we could use a private repos based on Artifactory or Nexus (comparison). After more work of adding, cleaning up, or centrally publishing all the dependencies for Sting, we may then publish into the basic Central repo. Or, we could move to a social service like BinTray (think GitHub vs. Git).

Status Updates

February 13, 2014

Maven is now the default in gsa-unstable's master branch. For GATK developers, the git migration is effectively complete. Software engineers are resolving a few remaining issues related to the automated build and testing infrastructure, but the basic workflow for developers should now be up to date.

January 30, 2014

The migration to to maven has begun in the gsa-unstable repository on the ks_new_maven_build_system branch.

November 5, 2013

The maven port of the existing ant build resides in the gsa-qc repository.

This is an old branch of Sting/GATK, with the existing files relocated to Maven appropriate locations, pom.xml files added, along with basic resources to assist in artifact generation.

No posts found with the requested search criteria.

Created 2014-06-13 19:31:06 | Updated | Tags: build genotypegvcfs nightly
Comments (0)

Hi,

I was wondering if it is possible to obtain the source code that accompanies the nightly builds; looking at the protected github repository, it seems that the nightly builds are changing faster than the repository there.

My current issue is that I would like the "AD" annotation for homozygous referent samples coming from the GenotypeGVCFs tool. Reading the forum, I believe that this is currently fixed in the nightly builds. However, I have a couple of customizations (pending review) that I don't think are in the nightly builds just yet, so I would need the source to apply my changes and then build.

I realize that I'm asking for something VERY dangerous and completely unsupported; I'm just too darn impatient to wait for the next release :).

-John Wallace


Created 2013-12-20 15:12:59 | Updated 2013-12-20 15:53:02 | Tags: dependencies error build
Comments (4)

Hi, this took me a while to debug, so I'm posting the solution here. I started by downloading a clean copy of GATK core platform from GitHub. When I first tried building by running ant, I got the compiler errors below. The reason turned out to be that an unrelated jar (gsea2-2.0.12.jar) was on my CLASSPATH (this is from another Broad tool I've been using - Gene Set Enrichment Analysis). gsea2-2.0.12.jar apparently contains outdated versions of apache math and io packages which conflict with the GATK versions. Taking this jar off my CLASSPATH fixed the issue.

-Ben

Ps. the compiler errors were:

gatk.compile.internal.source:
    [javac] Compiling 681 source files to /prog/GATK/gatk_platform_git/build/java/classes
    [javac] /prog/GATK/gatk_platform_git/public/java/src/org/broadinstitute/sting/commandline/ParsingEngine.java:260: error: incompatible types
    [javac]         for (String line: FileUtils.readLines(file))
    [javac]                                              ^
    [javac]   required: String
    [javac]   found:    Object
    [javac] /prog/GATK/gatk_platform_git/public/java/src/org/broadinstitute/sting/utils/MannWhitneyU.java:50: error: no suitable constructor found for NormalDistributionImpl(double,double,double)
    [javac]     private static NormalDistribution APACHE_NORMAL = new NormalDistributionImpl(0.0,1.0,1e-2);
    [javac]                                                       ^
    [javac]     constructor NormalDistributionImpl.NormalDistributionImpl() is not applicable
    [javac]       (actual and formal argument lists differ in length)
    [javac]     constructor NormalDistributionImpl.NormalDistributionImpl(double,double) is not applicable
    [javac]       (actual and formal argument lists differ in length)
    [javac] Note: Some input files use or override a deprecated API.
    [javac] Note: Recompile with -Xlint:deprecation for details.
    [javac] Note: Some input files use unchecked or unsafe operations.
    [javac] Note: Recompile with -Xlint:unchecked for details.
    [javac] 2 errors

BUILD FAILED
/prog/GATK/gatk_platform_git/build.xml:454: Compile failed; see the compiler error output for details.