Message Boards

WOLFRAM COMMUNITY

2430 Views

2 Replies

1 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

Is this an efficient algorithm for BioSequence convolution?

Richard Frost

Richard Frost, Frost Concepts

Posted 1 year ago

I'm currently working on analysis of PCR markers in whole chromosome DNA sequences. In this case the markers are SSR primers. The data are digitally rendered in character strings of letters, each representing an amino acid. In the actual laboratory procedures, "primers" (short nucleotide sequences of 12-24 amino acids) are chosen that will hopefully "code" (fluoresce) across a longer target sequence (500 to 3000 in length) for occurrences of a specific "allele" (a very short sequence along one side of a DNA strand). Chemically, the process is an analog equivalent of convolution. For more information see Plant Genotyping, ed. Batley, ISSN 1064-3745. In my application I am comparing the results of convolution with actual locations of an allele in both target sequences and across entire plant chromosomes which are 16 million to 2 billion in length - depending on species. So computational efficiency is a concern in the long run - literally! Here is my current algorithm, illustrating a primer sequence that doesn't perform well for the target allele: targetAllele = "AAAC"; primer = "ATTAAACTCTTCATTCATCAGT"; (* analog kernel *) primerLength = StringLength[primer]; sequence = "GATGAAAAACACAAAAGTACGAATTAATTTGGTACGTTTCCACCATAATTATTGGTAAAAAGTCGTT\ ACAAACCCTAGTAAACACGATGAAACAAAGAGA"; padding = StringRepeat["X", primerLength]; primerChrs = Characters[primer]; sequenceChrs = Characters[padding <> sequence <> padding]; sequenceChrsLength = Length[sequenceChrs]; convolution = Table[ {i, Total[Table[ If[primerChrs[[j]] == sequenceChrs[[i + j - 1]], 1, 0], {j, 1, primerLength} ]]}, {i, 1, sequenceChrsLength - primerLength} ]; compValue = Max[convolution[[;; , 2]]] + 1; allelePositions = ({# + primerLength, compValue} & /@ StringPosition[sequence, targetAllele][[;; , 1]]); Print[ListPlot[{convolution, allelePositions}, PlotStyle -> {Black, Red}, AxesStyle -> Directive[Black, 10], ImageSize -> Medium]];

POSTED BY: Richard Frost

2 Replies

Sort By:

Daniel Lichtblau

Daniel Lichtblau, Wolfram Research

Posted 1 year ago

I'm not sure what exactly is the goal but here is more compact and likely faster code. primer = "ATTAAACTCTTCATTCATCAGT"; sequence = "GATGAAAAACACAAAAGTACGAATTAATTTGGTACGTTTCCACCATAATTATTGGTAAAAAGTCGTT\ ACAAACCCTAGTAAACACGATGAAACAAAGAGA"; primerChrs = Characters[primer]; sequenceChrs2 = Characters[sequence]; ListCorrelate[primerChrs, sequenceChrs2, {-1, 1}, 0, Boole[SameQ[##]] &] (* {0, 1, 1, 1, 1, 0, 3, 1, 1, 3, 4, 1, 3, 3, 4, 4, 3, 7, 2, 4, 3, 7, 8, \ 7, 9, 5, 5, 7, 7, 6, 6, 11, 7, 3, 6, 8, 9, 6, 6, 6, 6, 4, 8, 10, 6, \ 5, 8, 5, 7, 11, 5, 5, 9, 5, 3, 9, 8, 6, 5, 8, 5, 4, 10, 7, 5, 8, 7, \ 5, 5, 7, 3, 5, 5, 7, 8, 7, 9, 6, 10, 5, 4, 5, 5, 5, 4, 10, 8, 8, 8, \ 6, 4, 4, 6, 2, 5, 8, 4, 9, 7, 8, 4, 4, 5, 2, 4, 2, 4, 6, 3, 2, 3, 4, \ 3, 2, 2, 3, 2, 1, 1, 0, 1} *) It is almost identical to your version, except yours has an extra 0 at the front where you compare `primer` to a string of all `X` (probably that string was one too long and the end of the outer loop was one too soon).

I'm not sure what exactly is the goal but here is more compact and likely faster code.

primer = "ATTAAACTCTTCATTCATCAGT"; 
sequence = 
  "GATGAAAAACACAAAAGTACGAATTAATTTGGTACGTTTCCACCATAATTATTGGTAAAAAGTCGTT\
ACAAACCCTAGTAAACACGATGAAACAAAGAGA";
primerChrs = Characters[primer];
sequenceChrs2 = Characters[sequence];
ListCorrelate[primerChrs, sequenceChrs2, {-1, 1}, 0, 
 Boole[SameQ[##]] &]

(* {0, 1, 1, 1, 1, 0, 3, 1, 1, 3, 4, 1, 3, 3, 4, 4, 3, 7, 2, 4, 3, 7, 8, \
7, 9, 5, 5, 7, 7, 6, 6, 11, 7, 3, 6, 8, 9, 6, 6, 6, 6, 4, 8, 10, 6, \
5, 8, 5, 7, 11, 5, 5, 9, 5, 3, 9, 8, 6, 5, 8, 5, 4, 10, 7, 5, 8, 7, \
5, 5, 7, 3, 5, 5, 7, 8, 7, 9, 6, 10, 5, 4, 5, 5, 5, 4, 10, 8, 8, 8, \
6, 4, 4, 6, 2, 5, 8, 4, 9, 7, 8, 4, 4, 5, 2, 4, 2, 4, 6, 3, 2, 3, 4, \
3, 2, 2, 3, 2, 1, 1, 0, 1} *)

It is almost identical to your version, except yours has an extra 0 at the front where you compare primer to a string of all X (probably that string was one too long and the end of the outer loop was one too soon).

POSTED BY: Daniel Lichtblau

Richard Frost

Richard Frost, Frost Concepts

Posted 1 year ago

That's exactly what I needed, thank you!

POSTED BY: Richard Frost

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Group Abstract

Feedback