igrep

a fast CUDA implementation of agrep algorithm for approximate nucleotide sequence matching

The input to igrep is twofold:

  • A genome to search. Totally 26 assembled genomes are collected from ftp://ftp.ncbi.nih.gov/genomes. Their sizes vary from 3.50Gnt to 0.19Gnt, accounting for 44Gnt in total.
  • A set of queries. A query consists of a pattern of alphabet A, C, G, T, N, followed by an edit distance. N is a wildcard and can match either A, C, G, or T in the genome. The pattern length must be between 1 and 64. The edit distance must be between 0 and 9, and must not exceed the pattern length. Substitution, insertion and deletion have a uniform cost of one edit distance. For each job, up to 10,000 queries will be processed.

As result, you can download:

  • log.csvlog.csv: summary of queries and results.
  • pos.csvpos.csv: ending positions of matches. For each query, up to 1,000 matches will be returned.

Hongjian Li, Bing Ni, Man-Hon Wong, and Kwong-Sak Leung. A Fast CUDA Implementation of Agrep Algorithm for Approximate Nucleotide Sequence Matching. 9th IEEE Symposium on Application Specific Processors (SASP), pp.74-77, San Diego, United States, 5-6 June 2011. DOI: 10.1109/SASP.2011.5941082