Group Abstract

Message Boards

WOLFRAM COMMUNITY

1.9K Views

0 Replies

0 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

Biological Sciences Wolfram Language Tuning and Debugging

Fast algorithms for generalized discrete 1D correlation?

Richard Frost

Richard Frost, Frost Concepts

Posted 2 years ago

[UPDATE: I managed to create a tractable solution to this computation by telescoping the iterative loops from longest (most iterations) on the exterior down to shortest in the innermost, and applying "FinestGrained" ParallelDo on an overclocked i9-10900KF.] I have a brute-force working code for this (below) but I'm seeking something with faster Mathematica runtimes. Discrete 1D correlation usually involves finding occurrences of a relatively short list of values (a kernel) in a longer sequence (the target) that match or partially match the kernel. In standard 1D correlation the matches are of the same length as the kernel. The built-in Mathematica function for this is ListCorrelate. There are many reasons for utilizing 1D correlation, one of interest here is "noise" in biosequence data. The current generation machines are very good but the overall error rate from a high-quality lab is (in my experience) 3% to 7% per data set (fully sequenced chromosomes of a specimen). So for example, if the kernel is 20 values long, then a 10% error rate would yield 2 missed characters of the kernel on average. As a precaution an investigator might filter out correlations having more than 8 missed values. Another reason for using correlation in biosequences is that biological processes do not always produce exact replicas in the manufacture of DNA. For example, a sequence containing "GTGA" in many cells can occur as "GTAA" in a few cells. One issue that arises with correlation is that we usually don't know if a less than perfect match is a transcription error by biology or machine, or simply coincidental. Generalized 1D correlation extends the process above to patterns whose constituents have been separated by interstitial values. For example, this kernel (called a primer in genomics) GGTGTAGCCCAAGCCCTTAT in the long target sequence with the sequence "ACACAC" inserted GGACACACTGTAGCCCAAGCCCTTAT Insertions like this occur when one strand of DNA "mate" with another". We want to know about these occurrences -- and detect them even if they contain transcription errors (a "few" mismatched values). It is very important though that the letters of the original kernel found in the target sequence occur in the same order. In the implementation below, I have introduced the concept of "width allowance". This parameter is used to search for primer occurrences of increasing widths. In the example code, the primer is width 20 (bp = base pairs, a single letter value). Initially, the width allowance is 0 so only occurrences of width 20 are returned. As the iterations progress, occurrences of length 21, 22, etc. are returned. We don't want to return shorter lengths when a longer length has been specified (e.g. length 20 when we specified 22) because these will be duplicates of what was found in a prior iteration. (* * This is a toy example. * To add realism, use * https://frostconcepts.org/data/chr1F.txt for genomeSequence * with * maxWidthAllowance = 32 * and write output to file, not notebook. * Only 1 sequence is provided here, there are actually 2 x 23 of \ varying length. * Only 1 primer is listed here, there are actually many more of \ varying length. ) correlationSup = 6; maxWidthAllowance = 5; genomeSequence = "ATCATCGGATTAGACCTCAAGCACTTACGGTTCTCGGTTAGTAATTGCGTCCCCTTTTGGTGTAGCC\ CAAGCCCTTATTTTTGCCCAGTAATGCCCAAGCCCATATATGGATTAGACCTCAAGCACTTCTCTGCCCA\ GTAATGCCCAAGCCCATGTGAAAGAAAGAAATTTCACCAAGTACCCTATTACCACC"; genomeSequenceBp = StringLength[genomeSequence]; primer = "GGTGTAGCCCAAGCCCTTAT"; primerBp = StringLength[primer]; pfirst = StringTake[primer, 1]; plast = StringTake[primer, -1]; primerCharsOrder = {{"G", 1}, {"G", 2}, {"T", 3}, {"G", 4}, {"T", 5}, {"A", 6}, {"G", 7}, {"C", 8}, {"C", 9}, {"C", 10}, {"A", 11}, {"A", 12}, {"G", 13}, {"C", 14}, {"C", 15}, {"C", 16}, {"T", 17}, {"T", 18}, {"A", 19}, {"T", 20}}; primerCharsOrderLength = Length[primerCharsOrder]; primerFormsHeader = {"Width Allowance", "Extracted Sequence", "Sequence Location", "Sequence Length", "Correlation Error", "Ordered Parts Positions"}; primerFormsHashSet = CreateDataStructure["OrderedHashSet"]; primerFormsHashSet["Insert", primerFormsHeader]; Do[ primerExtentBp = primerBp + widthAllowanceBp; intraBpStr = ToString[primerBp - 2 + widthAllowanceBp]; genomeRegEx = pfirst <> ".{" <> intraBpStr <> "," <> intraBpStr <> "}" <> plast; subsequences = ( {StringTake[genomeSequence, #], #[[1]]} & /@ StringPosition[genomeSequence, RegularExpression[genomeRegEx]] ); subsequencesLength = Length[subsequences]; Do[ seq = seqdata[[1]]; seqLoc = seqdata[[2]]; partsPosHashSet = CreateDataStructure["OrderedHashSet"]; Do[ partsPosHashSet[ "Insert", #] & /@ ({#[[1]], numberedPart[[1]], numberedPart[[2]]} & /@ StringPosition[seq, numberedPart[[1]]]), {numberedPart, primerCharsOrder} ]; partsPos = SortBy[Normal[partsPosHashSet], {First, Last}]; Clear[partsPosHashSet]; partsPosLength = Length[partsPos]; orderedPartsPosHashSet = CreateDataStructure["OrderedHashSet"]; partnoHashSet = CreateDataStructure["HashSet"]; pos = 1; loc = First[partsPos[[pos]]]; partno = Last[partsPos[[pos]]]; orderedPartsPosHashSet["Insert", partsPos[[pos]]]; partnoHashSet["Insert", partno]; While[(pos < partsPosLength) \[And] (partno <= primerCharsOrderLength), posPrior = pos; locPrior = loc; partnoPrior = partno; pos++; loc = First[partsPos[[pos]]]; partno = Last[partsPos[[pos]]]; While[((loc == locPrior) \[Or] (partno <= partnoPrior) \[Or] (loc < partno)) \[And] (pos < partsPosLength), pos++; loc = First[partsPos[[pos]]]; partno = Last[partsPos[[pos]]]; ]; If[! partnoHashSet["MemberQ", partno], partnoHashSet["Insert", partno]; orderedPartsPosHashSet["Insert", partsPos[[pos]]]; ]; ]; Clear[partnoHashSet]; orderedPartsPos = Normal[orderedPartsPosHashSet]; Clear[orderedPartsPosHashSet]; orderedPartsPosLength = Length[orderedPartsPos]; correlationError = primerCharsOrderLength - orderedPartsPosLength; If[(First[Last[orderedPartsPos]] == primerExtentBp) \[And] (correlationError < correlationSup), primerFormsHashSet["Insert", { widthAllowanceBp, seq, seqLoc, Last[orderedPartsPos][[1]] - orderedPartsPos[[1, 1]] + 1, correlationError, orderedPartsPos }]; ]; {}, {seqdata, subsequences} ]; {}, {widthAllowanceBp, 0, 5} ]; primerForms = Normal[primerFormsHashSet]; Clear[primerFormsHashSet]; primerFormsLength = Length[primerForms] - 1; ( sans header *) Print["primerForms (", primerFormsLength, "):"]; Print[Grid[primerForms, Alignment -> Left, Spacings -> Automatic, Frame -> All, FrameStyle -> Gray]];

[UPDATE: I managed to create a tractable solution to this computation by telescoping the iterative loops from longest (most iterations) on the exterior down to shortest in the innermost, and applying "FinestGrained" ParallelDo on an overclocked i9-10900KF.]

I have a brute-force working code for this (below) but I'm seeking something with faster Mathematica runtimes.

Discrete 1D correlation usually involves finding occurrences of a relatively short list of values (a kernel) in a longer sequence (the target) that match or partially match the kernel. In standard 1D correlation the matches are of the same length as the kernel. The built-in Mathematica function for this is ListCorrelate.

There are many reasons for utilizing 1D correlation, one of interest here is "noise" in biosequence data. The current generation machines are very good but the overall error rate from a high-quality lab is (in my experience) 3% to 7% per data set (fully sequenced chromosomes of a specimen). So for example, if the kernel is 20 values long, then a 10% error rate would yield 2 missed characters of the kernel on average. As a precaution an investigator might filter out correlations having more than 8 missed values.

Another reason for using correlation in biosequences is that biological processes do not always produce exact replicas in the manufacture of DNA. For example, a sequence containing "GTGA" in many cells can occur as "GTAA" in a few cells.

One issue that arises with correlation is that we usually don't know if a less than perfect match is a transcription error by biology or machine, or simply coincidental.

Generalized 1D correlation extends the process above to patterns whose constituents have been separated by interstitial values. For example, this kernel (called a primer in genomics)

GGTGTAGCCCAAGCCCTTAT

in the long target sequence with the sequence "ACACAC" inserted

GGACACACTGTAGCCCAAGCCCTTAT

Insertions like this occur when one strand of DNA "mate" with another". We want to know about these occurrences -- and detect them even if they contain transcription errors (a "few" mismatched values). It is very important though that the letters of the original kernel found in the target sequence occur in the same order.

In the implementation below, I have introduced the concept of "width allowance". This parameter is used to search for primer occurrences of increasing widths. In the example code, the primer is width 20 (bp = base pairs, a single letter value). Initially, the width allowance is 0 so only occurrences of width 20 are returned. As the iterations progress, occurrences of length 21, 22, etc. are returned. We don't want to return shorter lengths when a longer length has been specified (e.g. length 20 when we specified 22) because these will be duplicates of what was found in a prior iteration.

(*
 * This is a toy example.
* To add realism, use
* https://frostconcepts.org/data/chr1F.txt for genomeSequence
* with
* maxWidthAllowance = 32
 * and write output to file, not notebook.
  * Only 1 sequence is provided here, there are actually 2 x 23 of \
varying length.
  * Only 1 primer is listed here, there are actually many more of \
varying length.
*)
correlationSup = 6;
maxWidthAllowance = 5;
genomeSequence = 
  "ATCATCGGATTAGACCTCAAGCACTTACGGTTCTCGGTTAGTAATTGCGTCCCCTTTTGGTGTAGCC\
CAAGCCCTTATTTTTGCCCAGTAATGCCCAAGCCCATATATGGATTAGACCTCAAGCACTTCTCTGCCCA\
GTAATGCCCAAGCCCATGTGAAAGAAAGAAATTTCACCAAGTACCCTATTACCACC";
genomeSequenceBp = StringLength[genomeSequence];

primer = "GGTGTAGCCCAAGCCCTTAT";
primerBp = StringLength[primer];
pfirst = StringTake[primer, 1];
plast = StringTake[primer, -1];
primerCharsOrder = {{"G", 1}, {"G", 2}, {"T", 3}, {"G", 4}, {"T", 
    5}, {"A", 6}, {"G", 7}, {"C", 8}, {"C", 9}, {"C", 10}, {"A", 
    11}, {"A", 12}, {"G", 13}, {"C", 14}, {"C", 15}, {"C", 16}, {"T", 
    17}, {"T", 18}, {"A", 19}, {"T", 20}};
primerCharsOrderLength = Length[primerCharsOrder];
primerFormsHeader = {"Width Allowance", "Extracted Sequence", 
   "Sequence Location", "Sequence Length", "Correlation Error", 
   "Ordered Parts Positions"};

primerFormsHashSet = CreateDataStructure["OrderedHashSet"];
primerFormsHashSet["Insert", primerFormsHeader];
Do[
  primerExtentBp = primerBp + widthAllowanceBp;
  intraBpStr = ToString[primerBp - 2 + widthAllowanceBp];
  genomeRegEx = 
   pfirst <> ".{" <> intraBpStr <> "," <> intraBpStr <> "}" <> plast;
  subsequences = (
    {StringTake[genomeSequence, #], #[[1]]} & /@ 
     StringPosition[genomeSequence, RegularExpression[genomeRegEx]]
    );
  subsequencesLength = Length[subsequences];

  Do[
   seq = seqdata[[1]];
   seqLoc = seqdata[[2]];
   partsPosHashSet = CreateDataStructure["OrderedHashSet"];
   Do[
    partsPosHashSet[
       "Insert", #] & /@ ({#[[1]], numberedPart[[1]], 
         numberedPart[[2]]} & /@ 
       StringPosition[seq, numberedPart[[1]]]),
    {numberedPart, primerCharsOrder}
    ];
   partsPos = SortBy[Normal[partsPosHashSet], {First, Last}];
   Clear[partsPosHashSet];
   partsPosLength = Length[partsPos];

   orderedPartsPosHashSet = CreateDataStructure["OrderedHashSet"];
   partnoHashSet = CreateDataStructure["HashSet"];
   pos = 1;
   loc = First[partsPos[[pos]]];
   partno = Last[partsPos[[pos]]];
   orderedPartsPosHashSet["Insert", partsPos[[pos]]];
   partnoHashSet["Insert", partno];
   While[(pos < partsPosLength) \[And] (partno <= 
       primerCharsOrderLength),
    posPrior = pos;
    locPrior = loc;
    partnoPrior = partno;
    pos++;
    loc = First[partsPos[[pos]]];
    partno = Last[partsPos[[pos]]];
    While[((loc == locPrior) \[Or] (partno <= 
          partnoPrior) \[Or] (loc < partno)) \[And] (pos < 
        partsPosLength),
     pos++;
     loc = First[partsPos[[pos]]];
     partno = Last[partsPos[[pos]]];
     ];
    If[! partnoHashSet["MemberQ", partno],
     partnoHashSet["Insert", partno];
     orderedPartsPosHashSet["Insert", partsPos[[pos]]];
     ];
    ];
   Clear[partnoHashSet];
   orderedPartsPos = Normal[orderedPartsPosHashSet];
   Clear[orderedPartsPosHashSet];
   orderedPartsPosLength = Length[orderedPartsPos];

   correlationError = primerCharsOrderLength - orderedPartsPosLength;
   If[(First[Last[orderedPartsPos]] == 
       primerExtentBp) \[And] (correlationError < correlationSup),
    primerFormsHashSet["Insert",
      {
       widthAllowanceBp,
       seq,
       seqLoc,
       Last[orderedPartsPos][[1]] - orderedPartsPos[[1, 1]] + 1,
       correlationError,
       orderedPartsPos
       }];
    ];
   {},
   {seqdata, subsequences}
   ];
  {},
  {widthAllowanceBp, 0, 5}
  ];
primerForms = Normal[primerFormsHashSet];
Clear[primerFormsHashSet];
primerFormsLength = Length[primerForms] - 1; (* sans header *)
Print["primerForms (", primerFormsLength, "):"];
Print[Grid[primerForms, Alignment -> Left, Spacings -> Automatic, 
   Frame -> All, FrameStyle -> Gray]];

POSTED BY: Richard Frost

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Feedback