hORFeome V8.1 Library

This site provides annotation for the hORFeome V8.1 Libraries human ORFs.

These libraries are introduced in the 2011 Nature Methods paper: "A public genome-scale lentiviral expression library of human ORFs." (DOI: 10.1038/nmeth.1638).  Supplementary Table 4, containing a summarized view of the contents of the library, is available in two formats (pdf, xls).  For protocols detailing the use of this library, see the Protocols section found below.

If you want clones from these libraries:

Please DO NOT request clones from this site—this site is intended only to provide clone annotation information for users of these libraries. For information on how to obtain materials, please refer to orfeomecollaboration.org.

Protocols:

For additional protocols not found at our TRC Library Database, please refer to the online Methods section of the paper.

hORFeome V8.1 Library Clone Annotations:

The link below provides the latest information on the composition of the hORFeome V8.1 Entry and Expression Libraries produced at the Broad Institute in collaboration with the Dana-Farber Cancer Institute Center for Cancer Systems Biology (CCSB). This annotation will be updated and revised periodically:

Download ORF Clone Annotations (txt.gz: 6MB download, 37MB uncompressed, updated 6/6/2012)

Some background and general comments on the annotation of these Open Reading Frame libraries:

The major challenge in annotating these ORFs is that a substantial minority of these ORFs (and the MGC cDNAs from which they were made) do not perfectly match one unique NCBI RefSeq transcript.  When the correspondence between a RefSeq and ORF sequence is imperfect, questions arise about the proper mapping of the ORF to an active transcript, particularly when the sequence similarity is lower or when there is more than one RefSeq transcript that is similar to the ORF.

For the cases where ORF and RefSeq sequences imperfectly match, the question arises of how one should employ and interpret the effects of these ORFs. The answer is not clear, and we will leave it to you, the user to decide. Here, we attempt to assist library users by providing information about both perfect and partial matches between library ORFs and RefSeq. We will continue to update this annotation as the annotation and underlying sequences in NCBI RefSeq evolve.

In the accompanying file:

  1. Each row in the file represents a distinct ORF in the library collections. The Entry and the Expression vector clones for a given ORF are provided in the same row (identifiers provided in columns A and B, respectively). The sequencing data was collected on the Entry clones.
  2. Three categories of annotation are provided in this file:
    1. Empirical sequence data (when available, columns C, D). We have full assembled sequences for 14,524 out of the 16,172 clones.
    2. Mapping of the ORF sequences to the NCBI RefSeq database (columns E-J).
    3. Mapping of the ORF sequences to the sequence annotation provided by the source of the original materials (usually MGC) (columns K-N).
  3. For the clones with full sequence data, most (85%) of them exhibit a perfect match to the sequence provided by the original cDNA source (usually MGC). However, for the remaining 15% of the clones, there are differences in sequence ranging in degree from minor, perhaps insignificant (e.g. one or a few synonymous changes) to major differences (e.g. multiple non-synonymous mutations, a frame shift, or a major truncations). These differences are described in the "SEQ DIFFERENCES v PRIOR SEQ" column.
  4. The MGC collection of cDNAs, and therefore also this new ORF library which largely originates from MGC clones, does not line up perfectly with the NCBI RefSeq annotation of the transcriptome.

The majority of the ORFs in these libraries do align nearly perfectly to the full open reading frame of at least one, and usually exactly one, RefSeq transcript (70% of fully-sequenced clones have a 99% match to RefSeq and are > 99% full length). As indicated above, this still leaves a significant minority of clones in this library (and their parent MGC cDNAs) that do not have a full length high homology match to a RefSeq open reading frame, and it is currently an open question how to use ORFs that have no clear match to RefSeq. About half of the ORFs in this category have one or more clear but partial matches to RefSeq (at least a 90% match over at least 50% of the length of the clone or intended target CDS), and these are noted in the "Lower Match" columns of the data file.  Some of these will still be quite good (e.g. 98% homologous full-length matches) while others may be too truncated or frame shifted etc.

In summary, the knowledge of the human transcriptome at this moment is far from perfect. Some ORFs have no close alignments with RefSeq, but may represent active species that are actually present in human cells or at least a functionally equivalent form, while others are probably artifacts of the process of creating the library. Only more comprehensive and accurate annotation of the human transcriptome will tell which are which. 

Change Log
2011-06-24: Initial version
2011-07-28: Fixed 67 empirial sequences which had been truncated to 4000 bases
2011-09-30: Remove 49 ORFs which had failed virus production due to vector sequence errors, updated transcriptome alignments
2012-06-06: Update transcriptome alignments