# [✓] Process very large FASTQ files?

Posted 1 year ago
2723 Views
|
3 Replies
|
7 Total Likes
|
 Hi,I apologize in advance if this is a very basic question. I have a lot of experience with other languages (mostly a Python user nowadays), but I have decided to give a try to the Wolfram Language as it is, in my view, clearly superior as a programming language.I am trying to do a very basic operation: read a FASTQ file. A very big one. The problem is that the only operation that I have found (Import) tries to read the whole file to memory. This does not work.The alternative that I used was to write a parser myself that returns record by record, but that is not a long term solution, I would like to use Mathematica's internal FASTQ parser and not my own.Any suggestion on how to read a very big file using the provided FASTQ parser (say record by record)?Again, I suspect the solution is trivial. Apologies for the simple question, I really did try to find answers elsewhere.Thanks
3 Replies
Sort By:
Posted 1 year ago
 The only way I can think to use the built-in "FASTQ" importer chunk by chunk is to read in one entry as a string and then feed it to ImportString. Something like this works ClearAll@getFASTQRecord getFASTQRecord[stream_InputStream,elem_:"Sequence"]:=Module[ {record}, record = ReadList[stream,String,4]; If[{}=!=record, ImportString[ StringRiffle[ record, "\n" ], {"FASTQ", elem} ], EndOfFile ] ] You can try this on a large "FASTQ" file downloaded from this University of Washington site, it's compressed from ~500MB to 54MB for downloading. In[168]:= file = "~/Downloads/Xpression_example_dataset/example.fastq"; stream = OpenRead[file]; In[170]:= getFASTQRecord[stream, "LabeledData"] Out[170]= {"HWUSI-EAS300R_0005_FC62TL2AAXX:8:30:18447:12115#0/1" -> {"CGTAGCTGTGTGTACAAGGCCCGGGAACGTATTCACCGTG", "acdd^aa_Z^d^ddc^_Q_aaa_ddc\\dfdffff\\fff"}} But this method is fairly slow - at 2 milliseconds per entry on my machine it would take 100 minutes to read in all the entries from that example file. It would be faster to skip the importer and partition the data yourself via ClearAll@getFASTQRecord getFASTQRecord[ stream_InputStream ]:=Module[ {record}, record = ReadList[stream,String,4]; If[ {} =!= record, StringReplace[ First@record, "@"->"" ] -> AssociationThread[ {"Sequence","Qualities"}, record[[{2,4}]] ] ] ] which returns a rule In[183]:= getFASTQRecord[stream] Out[183]= "HWUSI-EAS300R_0005_FC62TL2AAXX:8:30:18447:12115#0/1" -> <|"Sequence" -> "CGTAGCTGTGTGTACAAGGCCCGGGAACGTATTCACCGTG", "Qualities" -> "acdd^aa_Z^d^ddc^_Q_aaa_ddc\\dfdffff\\fff"|> and then you read in all the entries with dataset = <||>; Do[ entry = getFASTQRecord[stream]; If[ MatchQ[entry, _Rule], AssociateTo[dataset, entry], Continue[] ], {n, 3000000}];~Monitor~n which runs in 50 seconds on my machine now you have a nested association with 2,979,809 entries. In[191]:= dataset["HWUSI-EAS300R_0005_FC62TL2AAXX:8:33:12239:1821#0/1", "Sequence"] Out[191]= "TTAGCTTTTGTATTATGGGCCAGCGACTTAATTTAACGAG"