Arpith Chacko Jacob - since 1982...

Category: Software

  SIMD Smith-Waterman search for x86 - Beta release

August 01, 2004  

Technology: Intel MMX/SSE/SSE2
Algorithm: Diagonal, Horizontal
Precision: Byte, Word

Copyright (C) 2003 Arpith Chacko Jacob


This package includes a parallel version of the Smith Waterman algorithm using Multimedia Extension Instructions on Intel platforms. See [1] for complete details. Two approaches, diagonal [2] and horizontal [3] are used.

The source is written in C and Assembly, and may be compiled using the GNU C compiler and the GNU Assembler. Implementations for MMX (on Intel Pentium with MMX, Intel Pentium II), SSE (Intel Pentium III) and SSE2 (Intel Pentium 4) are available. A processor detect routine that should run on most Intel 486 machines and above selects the best possible implementation during runtime, according to the technology available on the processor.

Byte (maximum possible score 255) and word (maximum possible score 65535) precision implementations are available. The actual maximum score for each precision will vary according to the similarity matrix. Negative scores are eliminated from the SW comparison matrix by biasing the similarity matrix, hence increasing the precision (since one bit is no longer needed for the sign bit). The bias value is the most negative value in the similarity matrix and is added to each element in the matrix. Hence the maximum score for the byte and word precisions are 255 - BIAS and 65535 - BIAS respectively.

The horizontal MMX and SSE approaches perform better than the corresponding diagonal approaches for standard gap penalties, and hence should be prefered on Pentium III machines or lower. The diagonal SSE2 approach performs better than the horizontal SSE2 approach for standard gap penalties, and hence should be used on Pentium 4 machines or higher. The horizontal approach performs far better however, for extremely large gap penalties (of the order of 40 + 2k, see [1]).


  • Does not include a 3dnow implementation for AMD machines. However, SSE2 is supported on AMD processors.
  • The MMX implementations perform poorly due to the lack of appropriate instructions in the MMX instruction set.
  • The MMX and SSE implementations on Intel workstations use MMX registers which are aliased with the FPU registers of the floating point unit. Hence floating point operations and MMX/SSE instructions cannot be mixed together without saving state information. The 'emms' instruction is used for this purpose, but can incur a large time penalty for its execution.
    In this package floating point instructions are used for benchmarking in DEBUG mode, (with flag -DDEBUG in Makefile) and hence the emms instruction is included only in DEBUG mode, and not in the normal compilation mode. 'emms' is not required for the SSE2 implementations.
  • Elements in the similarity matrix are limited to signed char.
  • The gap open and gap extend penalties are limited to unsigned char.

To compile:


To execute:

sw_simd_x86 h|d mmx|sse|sse2|auto b|w|bw a|n database query sim_matrix gap_init gap_ext

h           Horizontal approach
d           Diagonal approach
mmx         MMX Technology
sse         SSE Technology
sse2        SSE2 Technology
auto        Automatically detect best technology available
b           Byte precision for scores
w           Word precision for scores
bw          Byte and Word precision for scores
a           Protein search
n           DNA search
database    FASTA database file containing multiple
            sequences in FASTA format
query       FASTA query file containing a single
            sequence in FASTA format
sim_matrix  Similarity matrix
gap_init    Gap initialization penalty (positive integer)
gap_ext     Gap extension penalty (positive integer)


./sw_simd_x86 h auto b a db_aa.fasta query_aa.fasta blosum50.mat 12 2
./sw_simd_x86 d sse2 b n db.fasta query.fasta pam47.mat 10 5


[1] Arpith Chacko Jacob, "Whole Genome Comparison using Commodity
Workstations," B.E. thesis, 2003.

[2] Andrzej Wozniak, "Using video­oriented instructions to speed up sequence comparison," Computer Applications in the Biosciences, 13(2):145-150, 1997.

[3] Torbjorn Rognes and Erling Seeberg, "Six-fold speed-up of Smith-Waterman sequence database searches using parallel processing on common
microprocessors," Bioinformatics, 16(8):669-706, 2000.