Message Boards Message Boards

GROUPS:

[✓] Process very large FASTQ files?

Posted 1 year ago
2293 Views
|
3 Replies
|
7 Total Likes
|

Hi,

I apologize in advance if this is a very basic question. I have a lot of experience with other languages (mostly a Python user nowadays), but I have decided to give a try to the Wolfram Language as it is, in my view, clearly superior as a programming language.

I am trying to do a very basic operation: read a FASTQ file. A very big one. The problem is that the only operation that I have found (Import) tries to read the whole file to memory. This does not work.

The alternative that I used was to write a parser myself that returns record by record, but that is not a long term solution, I would like to use Mathematica's internal FASTQ parser and not my own.

Any suggestion on how to read a very big file using the provided FASTQ parser (say record by record)?

Again, I suspect the solution is trivial. Apologies for the simple question, I really did try to find answers elsewhere.

Thanks

3 Replies

The only way I can think to use the built-in "FASTQ" importer chunk by chunk is to read in one entry as a string and then feed it to ImportString. Something like this works

ClearAll@getFASTQRecord
getFASTQRecord[stream_InputStream,elem_:"Sequence"]:=Module[
    {record},
    record = ReadList[stream,String,4];
    If[{}=!=record,
       ImportString[
         StringRiffle[
          record,
          "\n"
         ],
         {"FASTQ", elem}
       ],
       EndOfFile
    ]
]

You can try this on a large "FASTQ" file downloaded from this University of Washington site, it's compressed from ~500MB to 54MB for downloading.

In[168]:= file = "~/Downloads/Xpression_example_dataset/example.fastq";
stream = OpenRead[file];

In[170]:= getFASTQRecord[stream, "LabeledData"]

Out[170]= {"HWUSI-EAS300R_0005_FC62TL2AAXX:8:30:18447:12115#0/1" ->     
    {"CGTAGCTGTGTGTACAAGGCCCGGGAACGTATTCACCGTG", 
       "acdd^aa_Z^d^ddc`^_Q_aaa`_ddc\\dfdffff\\fff"}}

But this method is fairly slow - at 2 milliseconds per entry on my machine it would take 100 minutes to read in all the entries from that example file. It would be faster to skip the importer and partition the data yourself via

ClearAll@getFASTQRecord
getFASTQRecord[ stream_InputStream ]:=Module[
    {record},
    record = ReadList[stream,String,4];
    If[ {} =!= record,
       StringReplace[ 
         First@record,
         "@"->""
       ] -> AssociationThread[
         {"Sequence","Qualities"},
         record[[{2,4}]]
         ]

    ]
]

which returns a rule

In[183]:= getFASTQRecord[stream]

Out[183]= "HWUSI-EAS300R_0005_FC62TL2AAXX:8:30:18447:12115#0/1" -> 
    <|"Sequence" -> "CGTAGCTGTGTGTACAAGGCCCGGGAACGTATTCACCGTG", 
      "Qualities" -> "acdd^aa_Z^d^ddc`^_Q_aaa`_ddc\\dfdffff\\fff"|>

and then you read in all the entries with

dataset = <||>;
Do[
   entry = getFASTQRecord[stream];
   If[
        MatchQ[entry, _Rule],
        AssociateTo[dataset, entry],
        Continue[]
    ],
   {n, 3000000}];~Monitor~n

which runs in 50 seconds on my machine now you have a nested association with 2,979,809 entries.

In[191]:= dataset["HWUSI-EAS300R_0005_FC62TL2AAXX:8:33:12239:1821#0/1", "Sequence"]

Out[191]= "TTAGCTTTTGTATTATGGGCCAGCGACTTAATTTAACGAG"

Amazing answer, thanks. That brings up the more general question on how to process big files, but I will open a separate discussion topic about it.

There are two very nice features to this solution that I would like to point out.

First, by using an Association (the symbol dataset in the code) and AssociateTo, the size of the result can grow incrementally with no performance penalty for copying data, which would have been the case using List and AppendTo.

Second, the header for each sequence is retained as the Key for each sequence Association in dataset, a very economic tactic.

Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract