Message Boards Message Boards

Is this an efficient algorithm for BioSequence convolution?

Posted 1 year ago

I'm currently working on analysis of PCR markers in whole chromosome DNA sequences. In this case the markers are SSR primers. The data are digitally rendered in character strings of letters, each representing an amino acid. In the actual laboratory procedures, "primers" (short nucleotide sequences of 12-24 amino acids) are chosen that will hopefully "code" (fluoresce) across a longer target sequence (500 to 3000 in length) for occurrences of a specific "allele" (a very short sequence along one side of a DNA strand). Chemically, the process is an analog equivalent of convolution. For more information see Plant Genotyping, ed. Batley, ISSN 1064-3745.

In my application I am comparing the results of convolution with actual locations of an allele in both target sequences and across entire plant chromosomes which are 16 million to 2 billion in length - depending on species. So computational efficiency is a concern in the long run - literally!

Here is my current algorithm, illustrating a primer sequence that doesn't perform well for the target allele:

targetAllele = "AAAC";
primer = "ATTAAACTCTTCATTCATCAGT";  (* analog kernel *)
primerLength = StringLength[primer];
sequence = 
  "GATGAAAAACACAAAAGTACGAATTAATTTGGTACGTTTCCACCATAATTATTGGTAAAAAGTCGTT\
ACAAACCCTAGTAAACACGATGAAACAAAGAGA";
padding = StringRepeat["X", primerLength];

primerChrs = Characters[primer];
sequenceChrs = Characters[padding <> sequence <> padding];
sequenceChrsLength = Length[sequenceChrs];
convolution = Table[
   {i,
    Total[Table[
      If[primerChrs[[j]] == sequenceChrs[[i + j - 1]], 1, 0],
      {j, 1, primerLength}
      ]]},
   {i, 1, sequenceChrsLength - primerLength}
   ];
compValue = Max[convolution[[;; , 2]]] + 1;
allelePositions = ({# + primerLength, compValue} & /@ 
    StringPosition[sequence, targetAllele][[;; , 1]]);
Print[ListPlot[{convolution, allelePositions}, 
   PlotStyle -> {Black, Red}, AxesStyle -> Directive[Black, 10], 
   ImageSize -> Medium]];

enter image description here

POSTED BY: Richard Frost
2 Replies

I'm not sure what exactly is the goal but here is more compact and likely faster code.

primer = "ATTAAACTCTTCATTCATCAGT"; 
sequence = 
  "GATGAAAAACACAAAAGTACGAATTAATTTGGTACGTTTCCACCATAATTATTGGTAAAAAGTCGTT\
ACAAACCCTAGTAAACACGATGAAACAAAGAGA";
primerChrs = Characters[primer];
sequenceChrs2 = Characters[sequence];
ListCorrelate[primerChrs, sequenceChrs2, {-1, 1}, 0, 
 Boole[SameQ[##]] &]

(* {0, 1, 1, 1, 1, 0, 3, 1, 1, 3, 4, 1, 3, 3, 4, 4, 3, 7, 2, 4, 3, 7, 8, \
7, 9, 5, 5, 7, 7, 6, 6, 11, 7, 3, 6, 8, 9, 6, 6, 6, 6, 4, 8, 10, 6, \
5, 8, 5, 7, 11, 5, 5, 9, 5, 3, 9, 8, 6, 5, 8, 5, 4, 10, 7, 5, 8, 7, \
5, 5, 7, 3, 5, 5, 7, 8, 7, 9, 6, 10, 5, 4, 5, 5, 5, 4, 10, 8, 8, 8, \
6, 4, 4, 6, 2, 5, 8, 4, 9, 7, 8, 4, 4, 5, 2, 4, 2, 4, 6, 3, 2, 3, 4, \
3, 2, 2, 3, 2, 1, 1, 0, 1} *)

It is almost identical to your version, except yours has an extra 0 at the front where you compare primer to a string of all X (probably that string was one too long and the end of the outer loop was one too soon).

POSTED BY: Daniel Lichtblau

That's exactly what I needed, thank you!

POSTED BY: Richard Frost
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract