Tagged with #refseq
1 documentation article | 0 announcements | 2 forum discussions

Created 2012-08-11 06:48:21 | Updated 2015-05-16 06:49:47 | Tags: inputs refseq

Comments (10)

1. About the RefSeq Format

From the NCBI RefSeq website

The Reference Sequence (RefSeq) collection aims to provide a comprehensive, integrated, non-redundant, well-annotated set of sequences, including genomic DNA, transcripts, and proteins. RefSeq is a foundation for medical, functional, and diversity studies; they provide a stable reference for genome annotation, gene identification and characterization, mutation and polymorphism analysis (especially RefSeqGene records), expression studies, and comparative analyses.

2. In the GATK

The GATK uses RefSeq in a variety of walkers, from indel calling to variant annotations. There are many file format flavors of ReqSeq; we've chosen to use the table dump available from the UCSC genome table browser.

3. Generating RefSeq files

Go to the UCSC genome table browser. There are many output options, here are the changes that you'll need to make:

clade:    Mammal
genome:   Human
assembly: ''choose the appropriate assembly for the reference you're using''
group:    Genes abd Gene Prediction Tracks
track:    RefSeq Genes
table:    refGene
region:   ''choose the genome option''

Choose a good output filename, something like geneTrack.refSeq, and click the get output button. You now have your initial RefSeq file, which will not be sorted, and will contain non-standard contigs. To run with the GATK, contigs other than the standard 1-22,X,Y,MT must be removed, and the file sorted in karyotypic order. This can be done with a combination of grep, sort, and a script called sortByRef.pl that is available here.

4. Running with the GATK

You can provide your RefSeq file to the GATK like you would for any other ROD command line argument. The line would look like the following:

-[arg]:REFSEQ /path/to/refSeq

Using the filename from above.


The GATK automatically adjusts the start and stop position of the records from zero-based half-open intervals (UCSC standard) to one-based closed intervals.

For example:

The first 19 bases in Chromosome one:
Chr1:0-19 (UCSC system)
Chr1:1-19 (GATK)

All of the GATK output is also in this format, so if you're using other tools or scripts to process RefSeq or GATK output files, you should be aware of this difference.

No articles to display.

Created 2015-04-24 10:04:20 | Updated 2015-04-24 10:11:27 | Tags: depthofcoverage genenamesinterval refseq depthcoverage calculatecoverageovergenes

Comments (2)

Hello , I ve been trying to write a script for calculating coverage per gene,unsuccessfully(!) ,and I found now that is nicely done by GATK ! I would very much need to use this calculation of depthOfCoverage for each gene but I cannot find the geneList needed in the format explained here. I have a RefSeq gene list downloaded from UCSC table which contains RefSeq name ,cds_start & end and "chr" information. Is this acceptable? I want to do it for exons falling inside the genes, which I have downloaded from : ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/exome_pull_down_targets/ (phase3), so this would be my Intervals List. It contains only chr info and start-end. I have also calculated for my bam files the bedtools-genomecov with option of "bedgraph" ,so I wrote a script to calculate mean coverage for each exon whose reads fall onto. Is the calculation of DepthOfCoverage done in the same principle ? Moreover I cannot find in UCSC a table which combines RefSeq name with used gene Name. Is it combined in this genesList you provide from GATK? Can you guide me where I could find the exact url for /humgen/.../geneList.txt or if mine could work ,and if exons table is ok with only these 3 columns ? I m a registered member, as for writing in the forum. Is there any extra procedure needed to access your database ? Thanks in advance !

Created 2012-10-22 17:31:20 | Updated 2012-10-22 18:22:28 | Tags: depthofcoverage refseq

Comments (3)


I am learning to use the DepthofCoverage function to obtain the gene coverage information for a collection of bacterial contigs that were mapped with metagenomic reads. The original post introducing this function is here: http://gatkforums.broadinstitute.org/discussion/40/depthofcoverage-v3-0-how-much-data-do-i-have#latest

In the post, you mentioned the gene list, as follow:

-geneList /path/to/gene/list.txt

The provided gene list must be of the following format:

585     NM_001005484    chr1    +       58953   59871   58953   59871   1       58953,  59871,  0       OR4F5   cmpl    cmpl    0,
587     NM_001005224    chr1    +       357521  358460  357521  358460  1       357521, 358460, 0       OR4F3   cmpl    cmpl    0,

I have three inquiries:

  1. Can you please provide headers to the values in each column?
  2. I am working with bacterial genomic contigs, can you please specify what basic information is needed for a gene list (e.g., name of contig, name of gene, location of gene in the contig, from... to ..., etc.)?

Thanks so much!