HMMTOP(1) HMMTOP User's Guide HMMTOP(1) NAME HMMTOP - Prediction of transmembrane helices and topology for transmembrane proteins using hidden Markov model COMMAND hmmtop -if=name [-of=name] [-sf=format] [-lf=name] [-pi=mode] [-ps=size] [-is=point] [-in=num] [-pp] [-pc] [-pl] [-loc=b-e-sp] [-locf=name] [-noit] [-h] [-sh] DESCRIPTION hmmtop predicts membrane topology of integral membrane proteins using hidden Markov model. The program can inter- pret multiple sequences in two different ways. In mpred mode prediction will be provided for the first sequence interpreting further sequences as homologues to the first one. The homologous sequences provided need not be aligned. In spred mode hmmtop simply evaluates the input sequences one by one, providing independent prediction for each of them using only single sequence information. hmmtop can read from the standard input. It supports three input sequence formats (FASTA, NBRF/PIR, SWISSPROT) and offers various output formats. The options are case sensi- tive, but their values are case insensitive (for example -sf=pir is the same as -sf=PIR, but -SF=PIR is not accepted). The option name, the equal sign and its value have to be written in one word (for example -sf=pir is accepted, but -sf= pir, -sf =pir or -sf = pir is not interpreted). The architecture of the hidden Markov model has to defined in a file (called hmmtop.arch by thedault) and the HMM- TOP_ARCH environment variable has to point this architec- ture file. The program uses a pseudo count method in order to faster optimization. The data of the pseudo count vec- tor are given in the hmmtop.psv file, and the HMMTOP_PSV environment variable has to point this file. INPUT OPTIONS -if=name, --input_file=name name of the input sequence file. If name is -- then the program reads from the standard input. -sf=format, --sequence_format=format format of the sequence(s). format may be FAS for fasta format (default), PIR for NBRF/PIR format or SWP for SWISSPROT format. -pi=mode, --process_inputfile=mode treat sequences in input file as single or homolo- gous sequences. mode may be spred or mpred. In the case of spred prediction will be done for each sequence in the input_file (default). In the case of mpred prediction will be done only for the first sequence in the input file and the remaining sequences will be treated as helpers. -ps=size, --pseudo_size=size size of the pseudo count vector. size may be from 0 (no pseudo count vector used) to 10000 (default=10000). -is=point, --iteration_start=point starting point of the iteration(s). point may be pseudo or random. In the case of pseudo the itera- tion starts from the pseudo countvector (default). In the case of random the iteration starts from random values. -in=num, --iteration_number=num num is the number of iterations (only if the -is flag is random). -loc=b-e-sp, --locate=b-e-sp locates or locks a given sequence piece in a given structural part. The sequence piece is given by b-e numbers, where b is the begin position, e is the end position. sp is the structural part and sp may be i, I, o, O and H for inside tail, inside loop, outside tail, outside loop and helix parts, respec- tively. e position may be the character E meaning the C terminal end of the sequence. -locf=file_name, --loc_file=file_name file for multiple locates. If the input sequence file contains multiple sequences and -pi=spred option is given, then for each sequence the program reads the locates from the file_name given. The locates have to be line by line for each sequence, and the syntaxis is the same as in -loc=b-e-sp option (see above). -noit, --noiteration makes prediction without any iteration (optimiza- tion), i.e. the parameters of the hidden Markov model are set to the pseudo count vector, and no iteration will be done to maximize the probability if the model is given. Therefore, this option makes the program behaviour similar to MEMSAT and TMHMM, i.e. it is more faster but less reliable. OUTPUT OPTIONS In the output hmmtop prints the number of amino acids and the name of the predicted sequence in one line begining with a '>HP: ' string, and following by the localization of the N terminal amino acid (IN or OUT), the number of the predicted transmembrane helices and the begin and end positions of each transmembrane helix. Additional output can be generated by the following options. -of=name, --output_file=name name of the output sequence file. If this option is omitted or name is -- then the program writes to the standard output. -pp, --print_probabilities Print the optimized probabilities. -pc, --print_pseudocount Print the pseudocount vector used. -pl, --print_longprediction Print prediction in a long format. The input sequence and the predicted localization of each amino acid will be printed. MISCELLANEOUS OPTIONS -h, --help Print a long help message. -sh, --short_help Print a short help message. -lf=name, --log_file=name name of the log file for debugging purposes. ENVIRONMENT HMMTOP_ARCH has to point to the file containing the architec- ture of the model. If not set, the program searches for the hmmtop.arch file in the current directory. HMMTOP_PSV has to point to the file containing the pseudo count vector corresponding to the given architec- ture. If not set, the program searches for the hmm- top.psv file in the current directory. EXAMPLES hmmtop -if=sequence.fas predicts the topology of each sequence in the sequence.fas file using only single sequence infor- mation. Sequences are in fasta format. hmmtop -if=sequence.pir -sf=PIR -pi=mpred predicts the topology of the first sequence in the sequence.pir file. If there are more than one sequences in the sequence.pir file then they will be used as helper sequences. hmmtop -if=sprot36.dat -sf=SWP predicts the topology of each sequence in sprot36.dat file, using only single sequence infor- mation. Sequences are in swissprot format (for example the full swissprot database). hmmtop -if=sequence.pir -sf=PIR -pi=spred -loc=123-242-I predicts the topology of each sequence given in the sequence.pir file in pir format with the condition, that the sequence piece between 123 and 242 are intracellular. If sequence.pir file contains mul- tiple sequences for each sequence will be handled with this condition. hmmtop -if=sequence.fas -sf=fas -pi=spred -locf=sequence.loc predicts the topology of each sequence given in the sequence.fas file in fasta format with conditions given in the sequence.loc file. The file has to contain locates in the syntaxis b-e-sp (see above) for each sequence line by line. BUGS Please report bugs to tusi@enzim.hu, after carefully read- ing through all the documentation. In the bug report please include the input file, the output file, the log file (use the -lf=name option) and the operating system. FILES hmmtop.arch The architecture file of the hidden Markov model. hmmtop.psv Data for calculating pseudocount vector used by the optimization. REFERENCES G.E. Tusnady and I. Simon (1998) Principles Governing Amino Acid Composition of Integral Membrane Proteins: Applications to topology prediction J. Mol. Biol. 283, 489-506 http://www.enzim.hu/hmmtop COPYRIGHT Gabor E. Tusnady, 2000, 2001 HMMTOP 2.0 April 2001 1