Non-redundant TF motif matches genome-wide

Last updated:

Overview

The redundancy and overlap between various motif model databases complicate downstream analysis and interpretation. We computed the pairwise similarity for >2,000 motif models determined for both human and mouse TFs and clustered them into 286 distinct motif clusters. Within each cluster we aligned motifs to each other to generate an archetypal consensus motif.

Motif clustering workflow

We scanned the entire genome using each motif model (all 2,179 models), and then repositioned the coordinates of each genomic match according to each motifs relative position to its archetypal model. After adjusting the coordinates of each motif match, we remove all duplicates (same position and motif cluster), retaining a single match at any position in the genome.

Browsershot of motif UCSC track

Some key advantages of the this approach:

  • Allow sharing of information about TFBS from many models. Any match to an individual model is a match to the cluster
  • No information is lost about individual motif model matches. The individual motif matches are fully represented in each cluster archetype match
  • Agnostically identifies errors and mislabeling within motif databases

Some notable caveats/shortcomings that should be considered:

  • No optimization on the bit-score/p-value threshold for each motif model
  • Cluster tree is cut at an abitrary height (chosen by the look and feel of the clusters)

Included motif databases (v1.0)

JASPAR 2018: Models derived from published and experimentally defined transcription factor binding sites for eukaryotes. The “CORE vertabrates” non-redundant PFMs were included.

Taipale 2013 HT-SELEX: Models derived from high-throughput SELEX experiments using bacterially expressed human and mouse DNA-binding domains. See Table S3 from publication.

HOCOMOCO version 11: Models derived from ChIP-seq data (680 human + 452 mouse TFs)

Cluster visualization

Visit this page to view the 286 clustered motif models (v1.0).

Browser tracks

To visualize the motif matches in the genome, I have created a trackhub for the UCSC Genome Browser. You can load this track by navigating on any UCSC browser instance to “Data → Trackhubs” and then copying and pasting the following URL:

https://resources.altius.org/~jvierstra/projects/motif-clustering/releases/v1.0/hub.txt

Alternatively, you can click here to automatically load this trackhub at the Genome Browser hosted at UCSC.

Downloads

v1.0 (2020-4-14)

Clustering of 2179 motif models (3 databases above)

  • Motif clustering annotations (excel file)
  • Genome-wide motif scans

    Genome build Full scans Archetype scans
    Human (hg38) bed.gz + tabix, bigBed bed.gz + tabix, bigBed
    Mouse (mm10) bed.gz + tabix, bigBed bed.gz + tabix, bigBed

    Note: Theses files are massive (>~40Gb). We have provided a TABIX-index alongside to facilitate remote access using tabix:

    [jvierstra@test0 $] tabix https://resources.altius.org/~jvierstra/projects/motif-clustering/releases/v1.0/hg38.archetype_motifs.v1.0.bed.gz chr19:45,001,882-45,002,279
    chr19	45001871	45001883	ZNF306	4.4101	-	ZNF306_C2H2_1	1
    chr19	45001891	45001907	TBX/1	7.8289	+	BRAC_HUMAN.H11MO.0.A	1
    chr19	45001896	45001916	GC-tract	8.4712	+	ZN467_HUMAN.H11MO.0.C	1
    chr19	45001897	45001914	KLF/SP/2	13.2278	+	PATZ1_HUMAN.H11MO.0.C	5
    chr19	45001898	45001908	HD/12	8.7818	+	PBX3_MA1114.1	1
    ...
    

File format descriptions

Full scans

  Column Example Description
1 contig chr1 Chromosome
2 start 10003 Start position (0-based)
3 stop 10023 End position
4 motif RREB1_MA0073.1 Motif model name
5 match_score 3.75359698976 MOODS match score
6 strand + Matching strand
7 seq CCCTAACCCTAACCCTAACC Genomic sequence of match

Archetype scans

  Column Example Description
1 contig chr1 Chromosome
2 start 10005 Start position (0-based)
3 stop 10022 End position
4 motif_cluster KLF/SP/2 Motif cluster name
5 match_score 3.7536 MOODS match score for best cluster match
6 strand + Matching strand
7 best_model RREB1_MA0073.1 Best matching motif model from cluster
8 num_models 1 Number of motif model from cluster with a match

Code

Software and scripts are available at GitHub to reproduce the motif clustering.

Notes

Motifs scans were performed with MOODS

python2 moods_dna.py  \
	--sep ";" -s h38.fa --p-value 0.0001 \
	--lo-bg 2.977e-01 2.023e-01 2.023e-01 2.977e-01 \
	-m ${PFM_FILE} -o ${OUTFILE}

Citation

If you use this resource in your research, please kindly cite:

Vierstra2020 Vierstra, J., Lazar, J., Sandstrom, R. et al. Global reference mapping of human transcription factor footprints. Nature 583, 729–736 (2020)

Additional resources

Admittedly, I am far from the first person to cluster motif models. In any case, this resource is/was largely intended to aid in visualizing and “surfing” the genome browser and to accompany other annotations (e.g., ChIP-seq, DGF, etc.).

Below I have compiled a list of publications, websites and databases pertaining to the curation of motif models which may of be of use. If you find something useful or if I have missed something, please contact me and I will include it here.

Clustering:

Motif curation: