Home SingleSEQ MultiSEQ Download Docs

Last update: September 2021

The downloadable resources of the BIAPSS database contain raw data, which are calculated using well-established tools as well as our deep statistical analysis, providing the overall characteristics of LLPS sequences.

Statistics on LLPS Set

To get a broad overview of residue-resolution biophysical characteristics for all reported LLPS protein sequences.

CHARGE DECORATION
charge decoration patterns, charge decoration parameters (FCR, OCS, SCD)

AA COMPOSITION
amino acid composition, amino acid rich regions, diversity of amino acids and counts

DISORDERED FRAGMENTS
detailed statistical analysis of disordered fragments
(all physicochemical characteristics)

LOCAL FRAGMENTS
statistical analysis of 20AA-long fragments
(LLPS, LLPS disorder, SwissProt as reference)

Single Sequence Analysis

To analyze individual protein sequences for inferring phase-separating affinities and for identifying "sticky" fragments.

CHEMICAL PROPERTIES
charge decoration patterns, polarity, hydrophobicity, aromaticity, π-bonding

SEQUENCE CONSERVATION
multiple sequence alignment (MSA), evolutionary conservation, consensus profile

SEQUENCE-BASED PREDICTIONS
secondary structure (SS3/SS8), solvent accessibility (SA), disordered (IDR) & low-complexity (LCR) regions

SHORT MOTIFS & REPEATS
detected along the LLPS sequence tandem repeats
and various SLiMs: ELMs, LARKS, ZIPPER, GARs, PTMs

As of September 2021, the BIAPSS database contains the comprehensive analyses of 501 LLPS-related protein sequences, collected as a joined superset of manually curated cases deposited in primary LLPS databases: PhaSePro (121 unique entries), LLPSDB (198 unique entries), and reviewed subset of the PhaSepDB.v1 (352 unique entries).

The biapss.csv file is an easy-to-parse CSV format database containing a summary of key characteristics for each of the analyzed sequences. The following tags in the header are provided:
num - index of alphabetically sorted sequences,
uniprot - UniProt ID, a unique identifier assigned to each sequence,
name - the common name of the protein,
• organism,
• gene,
length - length of the amino acid sequence,
llps - regions in the protein sequence that drive the phase separation (current source: PhaSePro),
disorder - a fraction of residues that lacks an ordered structure (consensus from predictions),
hydroph - a fraction of hydrophobic residues in the sequence [G, A, L, I, V, P, F],
pi - a fraction of π-bond containing systems [R, N, D, Q, E, G*]
aromatic - a fraction of aromatic residues in the sequence [F, W, Y, H]
polar - a fraction of polar residues in the sequence [C, M, S, T, Y, Q, N],
charge - a fraction of charged residues in the sequence [D, E, K, R, H],
ch_neg - a fraction of negatively charged residues in the sequence [D, E],
ch_pos - a fraction of positively charged residues in the sequence [K, R, H],
secstruct - a fraction of residues that present an ordered structure (consensus from predictions),
ssH - a fraction of residues within helical secondary structure elements (consensus from predictions),
ssE - a fraction of residues within extended secondary structure elements (consensus),
buried - a fraction of buried residues (based on consensus from solvent accessibility predictions),
exposed - a fraction of exposed residues (based on consensus from solvent accessibility predictions),
lcr - a fraction of residues located within low-complexity regions (consensus from predictions),
ch_SCD - charge decoration parameter, sequence charge decoration - SCD,
ch_OCS - charge decoration parameter, overall charge symmetry - OCS,
ch_FCR - charge decoration parameter, a fraction of charged residues,
PSP - tendency to phase separation, a score of PSpredictor,

The seq.csv is an easy-to-parse CSV format file containing the amino acid sequences of entries stored in the BIAPSS database. The following tags in the header are provided:
uniprot - UniProt ID, a unique identifier assigned to each sequence,
length - length of the amino acid sequence,
seq - amino acid sequence

The db_links.csv is an easy-to-parse CSV format file containing the cross-references for entries stored in the BIAPSS database. The single hyphen in the particular field means that there is no corresponding entry in the selected database. The following tags in the header are provided:
num - index of alphabetically sorted sequences,
pid - a unique BIAPSS identifier of alphabetically sorted sequences,
uniprot - UniProt ID, a unique identifier assigned to each sequence,
llpsdb - list of hyphen-separated identifiers for a given LLPS sequence that links to the corresponding entries in the LLPSDB,
phasepro - an identifier for a given LLPS sequence that links to the corresponding entry in the PhaSePro,
phasepdb - an identifier for a given LLPS sequence that links to the corresponding entry in the PhaSepDB,
drllps - an identifier for a given LLPS sequence that links to the corresponding entry in the drLLPS,
cellatlas - an identifier for a given LLPS sequence that links to the corresponding entry in the Cell Atlas section in the Human Protein Atlas database,
structure - list of underscore-separated identifiers for a given LLPS sequence that links to the corresponding entries in the structure databases: Swiss-Model Repository, ModBase, PDBe,
disprot - an identifier for a given LLPS sequence that links to the corresponding entry in the DisProt,
image - list of pipe-separated identifiers for a given LLPS sequence that links to the corresponding entries in the structure databases: Human Protein Atlas (the microscopic cell image) and COMPARTMENTS (the subcellular location).

Statistics on the LLPS set delivers a broad overview of general biophysical characteristics and regularities common for known phase separating protein sequences. Among the files available for download you may find:
overall_summary.csv contains the most general characteristics of the LLPS sequence set described by the average from parameters such as length of an entire sequence; length of disordered fragment; disorder fraction; secondary structure content; fraction of buried and exposed residues; fraction of polar, hydrophobic, aromatic, charged (total, positively, negatively) residues; FCR, OCS, and SCD charge patterns parameters; amino acid composition for LLPS sequences and LLPS disordered fragments.
composition.csv and diso_frags_composition.csv contain detailed amino acid composition for individual LLPS sequences and extracted disordered fragments, respectively. Comparison of these fractions with observations for other protein groups (globular, transmembrane) or organism kingdoms (viruses, archaea, bacteria, eukaryotes) is easily accessible thanks to the attached reference_fractions.csv file containing the reference data resulting from the works: Kozlowski (Nucleic Acids Res. 45(D1), 2017), Gromiha et al. (Comput Biol Chem. 29(2), 2005), Basile et al. (PLoS Comput Biol. 15(7), 2019), and BIAPSS-related statistics.
biapss.csv contains a summary of key characteristics for each of the analyzed sequences (for details, see LLPS deposits tab).

The results of deep statistical analysis of short 20 amino acid continuous fragments derived from LLPS sequences as well as from longer disordered LLPS fragments compared with fragments of the same length derived from reference SwissProt sequences are stored in files as follows:
frags_charge_distribution.csv contains the histogram of the number of charged residues (total, negative, positive charge) per 20 amino acid fragment
frags_diversity.csv contains the histogram of the number of diverse residues observed within 20 amino acid fragments (amino acid compositional diversity)
frags_richest_counts.csv contains the histogram of counts of the richest amino acid residue observed within 20 amino acid fragments (amino acid compositional uniformity)
frags_first_second_richest.csv contains the detailed per amino acid histograms of counts of the richest amino acid residue observed within 20 amino acid fragments and the correlation between the two most frequent amino acids in the 20 amino acid fragment.

Short sequential and structural motifs detected in the sequences of BIAPSS entries are divided into:
1. ELM_counts.csv contains the detected Eukaryotic Linear Motifs (ELM), where the following tags in the header are provided:
elm - an identifier of ELM class and instance
counts - the total number of detected motif
n_seq - the number of entries in which the motif was detected
2. known_counts.csv contains the other known short stretches of the protein sequence (e.g., LARKS, steric zipper, phosphosites, GAR regions), where the following tags in the header are provided:
motif - an identifier of SLiM class and instance
counts - the total number of detected motif
n_seq - the number of entries in which the motif was detected
uniprot - a list of UniProt identifiers of entries in which the motif was detected.

The vast majority of sequences identified as LLPS-dependent (466 out of 501) show the presence of structurally disordered fragments (1154 fragments longer than 20 amino acids). In total, we detected 1787 disordered fragments for which statistical sequence analysis provided:
diso_frags_composition.csv contains detailed amino acid composition for extracted LLPS disordered fragments
diso_biapss.csv contains the overall tendency of chemical properties within LLPS disordered fragments (charge decoration patterns, polarity, hydrophobicity, aromaticity); sequence-based predictions of secondary structure, solvent accessibility, structural disorder; detected SLiMs and tandem repeats.
reference_fractions.csv contains the reference amino acid composition for various groups of organisms or protein subsets resulting from the works:
-- Kozlowski (Nucleic Acids Res. 45(D1), 2017),
-- Gromiha et al. (Comput Biol Chem. 29(2), 2005),
-- Basile et al. (PLoS Comput Biol. 15(7), 2019),
-- and BIAPSS-related statistics (Badaczewska-Dawid et al., Integrative assembling of phase separation signals derived from sequence-based analysis, in preparation, 2021).

The results of deep statistical analysis of short 20 amino acid continuous fragments derived from LLPS sequences as well as from longer disordered LLPS fragments compared with fragments of the same length derived from reference SwissProt sequences are stored in files as follows:
frags_charge_distribution.csv contains the histogram of the number of charged residues (total, negative, positive charge) per 20 amino acid fragment
frags_diversity.csv contains the histogram of the number of diverse residues observed within 20 amino acid fragments (amino acid compositional diversity)
frags_richest_counts.csv contains the histogram of counts of the richest amino acid residue observed within 20 amino acid fragments (amino acid compositional uniformity)
frags_first_second_richest.csv contains the detailed per amino acid histograms of counts of the richest amino acid residue observed within 20 amino acid fragments and the correlation between the two most frequent amino acids in the 20 amino acid fragment.

Statistics of a single LLPS sequence allow inferring the phase-separating affinities and analyzing the biophysical properties of the selected sequence against the background of the known LLPS sequence set. Among the files available for download, you may find:
biapss.csv contains a summary of key characteristics for each of the analyzed sequences [data per sequence] (for details of the content, see LLPS deposits section)
seq.csv contains the amino acid sequences of entries stored in BIAPSS database [data per sequence] (for details of the content, see LLPS deposits section)
patterns [dir] contains detailed biochemical characteristics for each amino acid position along the analyzed sequence [data per position at sequence] (for details of the content, see CHEMICAL PATTERNS section or to learn more, go to Docs subpage)
alignment [dir] contains evolutionary conservation parameters for each amino acid position along the analyzed sequence [data per position in the sequence] (for details of the content, see SEQUENCE CONSERVATION section or to learn more, go to Docs subpage)
sec_struct [dir] contains the predicted secondary structure assignment for each amino acid position along the analyzed sequence [data per position in the sequence] (for details of the content, see SECONDARY STRUCTURE section or to learn more, go to Docs subpage)
solvent_acc [dir] contains the predicted solvent accessibility assignment for each amino acid position along the analyzed sequence [data per position in the sequence] (for details of the content, see SOLVENT ACCESSIBILITY section or to learn more, go to Docs subpage)
disorder [dir] contains the predicted disorder assignment for each amino acid position along the analyzed sequence [data per position in the sequence] (for details of the content, see STRUCTURAL DISORDER section or to learn more, go to Docs subpage)
contacts [dir] contains the predicted contacts between amino acid pairs within the analyzed sequence [data per position in the sequence] (for details of the content, see CONTACT MAP section or to learn more, go to Docs subpage)
motifs [dir] contains the detected SLiMs and tandem repeats along the analyzed sequence [data per sequence] (for details of the content, see SHORT MOTIFS section or to learn more, go to Docs subpage).

The patterns [dir] contains multiple files of type UniProtID.csv for each LLPS sequence with known UniProt ID where the properties (evolutionary conservation, secondary structure, solvent accessibility, structural disorder) were predicted against one of the sequence databases: SP - SwissProt, 50 - UniRef50, 90 - UniRef90.
The following tags in the header are provided:
num - index of residue in the sequence,
seq - amino acid type at a given position in the sequence,
polar - ternary assignment of polar character at a given position: 0 - other AA [D, E, K, R, H], 1 - non-polar AA [A, G, I, L, V, F, P, W], 2 - polar AA [C, M, S, T, Y, Q, N],
charge - ternary assignment of charged character at a given position: 0 - no charge, 1 - negative [D, E], 2 - positive [K, R, H],
aromatic - ternary assignment of π-interactions at a given position: 0 - no effect, 1 - other π-bond [R, N, Q, D, E, G*], 2 - aromatic AA [F, W, Y, H],
hydropathy - ternary assignment of hydrophaticity at a given position: 1 - hydrophobic AA [G, A, L, I, V, P, F], 2 - hydrophilic [R, N, D, Q, E, H, S, T], 3 - amphipatic [W, Y, M, C]
KD - [float] assignment of hydropathy in Kyte-Doolittle scale
Hbond - quaternary assignment of donor/acceptor properties for the formation of Hydrogen bonds: 0 - no effect, 1 - donor [R, W, C], 2 - acceptor [D, E], 3 - donor & acceptor [N, Q, H, S, T, Y],
secstruct - a consensus of predicted secondary structure at a given position: H - helical, E - extended, C - coil,
solventacc - a consensus of predicted solvent accessibility at a given position: B - buried, E - exposed, M - medium
disorder - a consensus of predicted structural disorder at a given position: 0 - ordered, 1 - disordered,
lcr - the assignment of the predicted low-complexity at a given position: 0 - high-complexity, 1,2,3,4 of used tools detected low-complexity region at this position
logo_p - amino acid type and position in the MSA profile,
logo_n - conservation sterngth, the measure of evolutionary conservation (how much the position is held by the evolution) provided in discrete scale from 0 - poorly conserved, to 5 - highly conserved (for details see CHEMICAl PATTERNS section in Docs subpage),
logo_d - amino acid diversity, the measure of evolutionary conservation (how many different AA types occur at this position) provided in discrete scale from 0 - highly conserved (single AA), to 5 - poorly conserved (for details see CHEMICAl PATTERNS section in Docs subpage),
logo_a - character/attribute, the measure of evolutionary conservation (the chemical nature of the most common amino acids at a given position in the MSA) provided as simple 1-letter tags: A - aromatic, H - hydrophobic, P - polar, B - π-bond, C - charge, 0 - proline/glycine (for details see CHEMICAl PATTERNS section in Docs subpage),
logo_s - the consensus of the secondary structure derived for a given position based on the known PDBs for the sequences in the MSA,
logo_l - the most common alternative amino acids at a given position in the MSA (taken from MSA),
domain - the Pfam domain aligned with a given position (region) in the query sequence or sequence dadabase used for preparing the MSA,

The alignment [dir] contains 2 sub-folders: [FULL] and [PFAM] for the MSA covering the full-length query sequence, and regions with detected Pfam domains, respectively. Each directory contains 2 CSV files:
1) MSA data:
UniprotID_DB-full_from_to_6.45.csv in [FULL], where the from-to range defines the length of the sequence.
UniprotID_domainID_from_to_6.45.csv in [PFAM], where the from-to range defines the covered fragment of the sequence.
2) derived conservation context:
UniprotID_DB_seed_msa.csv in [FULL], where the multiple sequence alignment (MSA) was calculated against the sequence databases: USP - SwissProt, U50 - UniRef50, U90 - UniRef90.
domainID_seed_msa.csv in [PFAM], where the alignment data come from the Pfam seed-MSA for a given domain.

The following tags in the header are provided in the UniprotID_DB-full_from_to_6.45.csv and UniprotID_domainID_from_to_6.45.csv files:
num - ordinary index of residue,
n_seq - index of residue in the sequence,
n_profile - index of residue in the MSA consensus profile,
n_msa - index of residue in the MSA,
seq - amino acid in the sequence,
profile - consensus amino acid,
pp_cons - posterior probability, the confidence of alignment provided in scale [0:0-5%, 1:5-15%, ... , 9:85-95%, *:>95%]
ref_anot - reference annotation, "x" when position is aligned
sec_struct - secondary structure taken from the corresponding known PDBs
logo_t - total height of the letters at a given position (in the hmmlogo output of conservation LOGO-type)
logo_n - strength, normalized conservation level in the discrete scale [0: not conserved, ... , 5: highly conserved]
logo_d - amino acid diversity at a given position in the MSA, provided in the discrete scale [0: 1AA, 1: 2AA, 3: 4-5AA, 4: 6-10AA, 5: 11-20AA]
logo_a - the physicochemical character of most common amino acids at a given position
logo_m - the heighest amino acid at a given position in the LOGO-type conservation, syntax: height_amino_acid
logo_l - the list of amino acids above background threshold in the LOGO-type conservation

The following tags in the header are provided in the UniprotID_DB_seed_msa.csv and domainID_seed_msa.csv files:
num - index of residue in the sequence,
uniprot - UniProt ID, a unique identifier assigned to each sequence,
range - the range of residues from the sequences used in MSA,
evalue - the measure of statistical significance,
score - HMMER3.3 sequence scores,
bias - a correction term for biased sequence composition,
mpp - mean posterior probability of the MSA
seq - amino acid sequence; row from the MSA,
ss - secondary structure assignment taken from known PDBs
pp - posterior probability assignment
pdb - PDB IDs for known structures
name - the common name of the protein,
gene - the gene encoded by the sequence,
organism - organism for a given sequence,

The sec_struct [dir] contains a single file of type UniProtID.ss.csv for each LLPS sequence with known UniProt ID.
In the file, you can find the sequence-based prediction of secondary structure for each position in the sequence provided in SS3 (H, E, C) or SS8 (H, G, I, E, B, T, S, C/L) notation.
The following tags in the .csv header are provided:
num - index of residue in the sequence,
seq - amino acid sequence,
consensus - secondary structure in SS3 (H, E, C) notation derived as a consensus of sequence-based predictions by using PSIPRED, RAPTOR-X, PORTER-5, SPIDER-3, and FESS.
psipred3 - PSIPRED sequence-based prediction of secondary structure in SS3 notation,
raptor3 - RAPTOR-X sequence-based prediction of secondary structure in SS3 notation,
porter3 - PORTER-5 sequence-based prediction of secondary structure in SS3 notation,
spider3 - SPIDER-3 sequence-based prediction of secondary structure in SS3 notation,
fess3 - FESS,
raptor8 - RAPTOR-X sequence-based prediction of secondary structure in SS8 notation,
porter8 - PORTER-5 sequence-based prediction of secondary structure in SS8 notation.
For each tool, a separate folder is available, where the files of PSIPRED format (UniProtID.ss3) contain the detailed probabilities of each type of secondary structure element at a given position in the sequence are provided.

The solvent_acc [dir] contains a single file of type UniProtID.sa.csv for each LLPS sequence with known UniProt ID.
In the file, you can find the sequence-based prediction of solvent accessibility for each position in the sequence provided in 3-letter notation (B, E, M). The following tags in the .csv header are provided:
num - index of residue in the sequence,
seq - amino acid sequence,
consensus - solvent accessibility in 3-letter notation (B - buried, E - exposed, M - medium) derived as a consensus of sequence-based predictions by using RAPTOR-X, PaleAle5, SPOT-1D.
raptor - RAPTOR-X sequence-based prediction of secondary structure in SS3 notation,
paleale - PaleAle5 sequence-based prediction of secondary structure in SS3 notation,
spot - SPOT-1D sequence-based prediction of secondary structure in SS3 notation.
For each tool, a separate folder is available, where the files of column format (UniProtID.acc) contain the detailed probabilities of each type of solvent accessibility at a given position in the sequence are provided.

The disorder [dir] contains two files of types UniProtID.binary.csv and UniProtID.diso.csv for each LLPS sequence with known UniProt ID.
In the files, you can find the sequence-based prediction of structural disorder for each position in the sequence provided in binary (0, 1) notation or detailed probabilities of the disorder.
For both type of files the following tags in the header are provided:
num - index of residue in the sequence,
seq - amino acid sequence,
consensus - structural disorder in binary of probabilities notation derived as a consensus of sequence-based predictions by using IUPred2A (glob), DISOPRED3, VSL2, RAPTOR-X, SPOT-Disorder,
iupredG - sequence-based prediction of structural disorder obtained using IUPred2A (glob),
anchorG - sequence-based prediction of protein binding regions in disordered fragments obtained using ANCHOR (IUPred2A),
iupredS - sequence-based prediction of structural disorder obtained using IUPred2A (short),
anchorS - sequence-based prediction of protein binding regions in disordered fragments obtained using ANCHOR (IUPred2A),
iupredL - sequence-based prediction of structural disorder obtained using IUPred2A (long),
anchorL - sequence-based prediction of protein binding regions in disordered fragments obtained using ANCHOR (IUPred2A),
disopred2 - sequence-based prediction of structural disorder obtained using DISOPRED2,
disopred3 - sequence-based prediction of structural disorder obtained using DISOPRED3,
vsl2 - sequence-based prediction of structural disorder obtained using VSL2,
raptor - sequence-based prediction of structural disorder obtained using RAPTOR-X,
spot - sequence-based prediction of structural disorder obtained using SPOT-Disorder,
Pfit - sequence-based prediction of structural disorder obtained using PONDR-FIT,
Pvlxt - sequence-based prediction of structural disorder obtained using PONDR-VLXT.

contacts [dir] contains a single file of type UniProtID.tool.csv for each LLPS sequence with known UniProt ID, where one of the tools (RAPTOR-X, RESPRE, SPOT-Contact) was used against the sequence databases: USP - SwissProt, U50 - UniRef50, U90 - UniRef90.
In the single file, you can find the sequence-based prediction of contacts between amino acid pairs within the analyzed sequence provided as the probability of contact.
The following tags in the header are provided:
id1 - index of residue position in the MSA-based sequence profile for the first residue in the contact,
res1 - 1-letter amino acid code and index of residue position in the query LLPS sequence for the first residue in the contact,
id2 - index of residue position in the MSA-based sequence profile for the second residue in the contact,
res2 - 1-letter amino acid code and index of residue position in the query LLPS sequence for the second residue in the contact,
value - the probability of contact.

Short sequential and structural motifs detected in the BIAPSS entries can be found in the [motifs] directory.
1. UniprotID.csv contains the list of detected motifs for a given LLPS sequence:
In the files the following tags in the header are provided:
motif - the category of the detected motif
counts - number of detected instances within a given category
instances - the list of instances, syntax: sequence of detected motif + from-to range defining the location withi the sequence.

The categories of sequential and structural motifs are defined and described in the Documentation.

For the convenience of users interested in investigating a particular biological system, it is possible to download a data set for a selected LLPS sequence.
Among the files available for download you may find:
UniProtID.fasta contains the amino acid sequence of the entry in FASTA format
UniProtID_patterns.csv contains detailed biochemical characteristics for each amino acid position along the analyzed sequence [data per position at sequence] (for details of the content see CHEMICAL PATTERNS section or to learn more, go to Docs subpage)
UniProtID_msa.csv contains evolutionary conservation parameters for each amino acid position along the analyzed sequence [data per position at sequence] (for details of the content see SEQUENCE CONSERVATION section or to learn more, go to Docs subpage)
UniProtID_ss.csv contains the predicted secondary structure assignment for each amino acid position along the analyzed sequence [data per position at sequence] (for details of the content see SECONDARY STRUCTURE section or to learn more, go to Docs subpage)
UniProtID_sa.csv contains the predicted solvent accessibility assignment for each amino acid position along the analyzed sequence [data per position at sequence] (for details of the content see SOLVENT ACCESSIBILITY section or to learn more, go to Docs subpage)
UniProtID_diso.csv contains the predicted disorder assignment for each amino acid position along the analyzed sequence [data per position at sequence] (for details of the content see STRUCTURAL DISORDER section or to learn more, go to Docs subpage)
UniProtID_cm.csv contains the predicted contacts between amino acid pairs within the analyzed sequence [data per position at sequence] (for details of the content see CONTACT MAP section or to learn more, go to Docs subpage)
UniProtID_motifs.csv contains the detected SLiMs (ELM, LARKS, GARs steric zipper, phosphosites) along the analyzed sequence [data per sequence] (for details of the content see SHORT MOTIFS section or to learn more, go to Docs subpage).