Non-redundant TF motif matches genome-wide

Last updated: May 15, 2020

Overview

The redundancy and overlap between various motif model databases complicate downstream analysis and interpretation. We computed the pairwise similarity for >2,000 motif models determined for both human and mouse TFs and clustered them into 286 distinct motif clusters. Within each cluster we aligned motifs to each other to generate an archetypal consensus motif.

We scanned the entire genome using each motif model (all 2,179 models), and then repositioned the coordinates of each genomic match according to each motifs relative position to its archetypal model. After adjusting the coordinates of each motif match, we remove all duplicates (same position and motif cluster), retaining a single match at any position in the genome.

Some key advantages of the this approach:

Allow sharing of information about TFBS from many models. Any match to an individual model is a match to the cluster
No information is lost about individual motif model matches. The individual motif matches are fully represented in each cluster archetype match
Agnostically identifies errors and mislabeling within motif databases

Some notable caveats/shortcomings that should be considered:

No optimization on the bit-score/p-value threshold for each motif model
Cluster tree is cut at an abitrary height (chosen by the look and feel of the clusters)

Included motif databases (v1.0)

JASPAR 2018: Models derived from published and experimentally defined transcription factor binding sites for eukaryotes. The “CORE vertabrates” non-redundant PFMs were included.

Motif models [MEME-format]
Motif weblogos [EPS-format]

Taipale 2013 HT-SELEX: Models derived from high-throughput SELEX experiments using bacterially expressed human and mouse DNA-binding domains. See Table S3 from publication.

Motif models [MEME-format]
Motif weblogos [EPS-format]

HOCOMOCO version 11: Models derived from ChIP-seq data (680 human + 452 mouse TFs)

Motif models [Human: MEME-format, Mouse: MEME-format]
Motif weblogos: [EPS-format]

Cluster visualization

Visit this page to view the 286 clustered motif models (v1.0).

Browser tracks

To visualize the motif matches in the genome, I have created a trackhub for the UCSC Genome Browser. You can load this track by navigating on any UCSC browser instance to “Data → Trackhubs” and then copying and pasting the following URL:

https://resources.altius.org/~jvierstra/projects/motif-clustering/releases/v1.0/hub.txt

Alternatively, you can click here to automatically load this trackhub at the Genome Browser hosted at UCSC.

Downloads

v1.0 (2020-4-14)

Clustering of 2179 motif models (3 databases above)

Motif clustering annotations (excel file)

Genome-wide motif scans

Genome build	Full scans	Archetype scans
Human (hg38)	bed.gz + tabix, bigBed	bed.gz + tabix, bigBed
Mouse (mm10)	bed.gz + tabix, bigBed	bed.gz + tabix, bigBed

Note: Theses files are massive (>~40Gb). We have provided a TABIX-index alongside to facilitate remote access using tabix:

[jvierstra@test0 $] tabix https://resources.altius.org/~jvierstra/projects/motif-clustering/releases/v1.0/hg38.archetype_motifs.v1.0.bed.gz chr19:45,001,882-45,002,279
chr19	45001871	45001883	ZNF306	4.4101	-	ZNF306_C2H2_1	1
chr19	45001891	45001907	TBX/1	7.8289	+	BRAC_HUMAN.H11MO.0.A	1
chr19	45001896	45001916	GC-tract	8.4712	+	ZN467_HUMAN.H11MO.0.C	1
chr19	45001897	45001914	KLF/SP/2	13.2278	+	PATZ1_HUMAN.H11MO.0.C	5
chr19	45001898	45001908	HD/12	8.7818	+	PBX3_MA1114.1	1
...

File format descriptions

Full scans

	Column	Example	Description
1	`contig`	chr1	Chromosome
2	`start`	10003	Start position (0-based)
3	`stop`	10023	End position
4	`motif`	RREB1_MA0073.1	Motif model name
5	`match_score`	3.75359698976	MOODS match score
6	`strand`	+	Matching strand
7	`seq`	CCCTAACCCTAACCCTAACC	Genomic sequence of match

Archetype scans

	Column	Example	Description
1	`contig`	chr1	Chromosome
2	`start`	10005	Start position (0-based)
3	`stop`	10022	End position
4	`motif_cluster`	KLF/SP/2	Motif cluster name
5	`match_score`	3.7536	MOODS match score for best cluster match
6	`strand`	+	Matching strand
7	`best_model`	RREB1_MA0073.1	Best matching motif model from cluster
8	`num_models`	1	Number of motif model from cluster with a match

Code

Software and scripts are available at GitHub to reproduce the motif clustering.

Notes

Motifs scans were performed with MOODS

python2 moods_dna.py  \
	--sep ";" -s h38.fa --p-value 0.0001 \
	--lo-bg 2.977e-01 2.023e-01 2.023e-01 2.977e-01 \
	-m ${PFM_FILE} -o ${OUTFILE}

Citation

If you use this resource in your research, please kindly cite:

Vierstra2020 Vierstra, J., Lazar, J., Sandstrom, R. et al. Global reference mapping of human transcription factor footprints. Nature 583, 729–736 (2020)

Additional resources

Admittedly, I am far from the first person to cluster motif models. In any case, this resource is/was largely intended to aid in visualizing and “surfing” the genome browser and to accompany other annotations (e.g., ChIP-seq, DGF, etc.).

Below I have compiled a list of publications, websites and databases pertaining to the curation of motif models which may of be of use. If you find something useful or if I have missed something, please contact me and I will include it here.

Clustering:

RSAT: Clustering methodology used as part of JASPAR project (publication link)
TOMTOM: Software tools used here for clustering (publication link)

Motif curation:

JASPAR
HOCOMOCO
CisBP (Tim Hughes’ group)
Lambert et al.: “The human transcription factors” (Supplementary Table 1)
UniProbe (Martha Bulyk)