PWMs, PSAMs, FSAMs, and pHMMs

The affinity models supported by the ADB fall into 2 major categories:

Fixed-length Affinity Models:

Fixed-length models include Position Weight Matrices (PWMs), Position Specific Affinity Matrices (PSAMs), and Feature Specific Affinity Models (FSAMs or dinucleotide PSAMs). PWMs model occupancy and the probability of binding while PSAMs and FSAMs model affinity. For more information about occupancy vs. affinity click here. Also, PWMs and PSAMs are positional independence models while FSAMs model positional dependencies between nearest neighbor positions in the binding sites.

Advantages: Simple, easy to understand, few parameters to train, and rapid scoring of long sequences.

Disadvantages: Cannot model variable-length motifs due to variable length spacers or multiple modes of binding.

Variable-Length Affinity Models:

Variable-length motifs are modeled in the ADB using profile Hidden Markov Models (pHMMs). pHMMs contain many additional parameters on top of a PWM or PSAM. These additional parameters model tolerated insertions and deletions (indels) between both the positions in the consensus binding motif and different length spacers between two half-sites. Also, pHMMs by default are completely probabilistic. However, the ADB also support generalized, affinity-based pHMMs (also called Boltzmann Chains) that include a free protein concentration parameter in order to model binding saturation. If enough binding data is available and the correct pHMM topology is used, pHMMs can have greater predictive accuracy compared to PWMs/PSAMs when modeling variable-length motifs.

Advantages: Models tolerated nucleotide insertions and deletions within a binding site, and variable length spacers between half-sites.

Disadvantages: The complex models contain many more parameters that require more binding data to properly train. Optimal pHMMs also require training many models with different HMM topologies. Scoring of sequences requires slower dynamic programming methods compared to fixed-length models.

With so many different affinity model types and different sources of binding data, it can be difficult to choose the best affinity model among many for a given protein. The overall goal of the ADB is to help researchers use the most accurate affinity model with the optimal number of parameters given both the quality of the binding data and the inherent binding properties of the protein or micro-RNA. Future tools scheduled for the next release will include cross-validation features to calculate and compare the accuracies of the different models. In addition, the In Vivo Comparison Browser Prototype demos a new tool planned for the next release that will allow scientists to test the accuracy of the affinity models by comparing estimated affinities with in vivo occupancy measurements.