The only way I can think to use the built-in "FASTQ" importer chunk by chunk is to read in one entry as a string and then feed it to ImportString
. Something like this works
ClearAll@getFASTQRecord
getFASTQRecord[stream_InputStream,elem_:"Sequence"]:=Module[
{record},
record = ReadList[stream,String,4];
If[{}=!=record,
ImportString[
StringRiffle[
record,
"\n"
],
{"FASTQ", elem}
],
EndOfFile
]
]
You can try this on a large "FASTQ" file downloaded from this University of Washington site, it's compressed from ~500MB to 54MB for downloading.
In[168]:= file = "~/Downloads/Xpression_example_dataset/example.fastq";
stream = OpenRead[file];
In[170]:= getFASTQRecord[stream, "LabeledData"]
Out[170]= {"HWUSI-EAS300R_0005_FC62TL2AAXX:8:30:18447:12115#0/1" ->
{"CGTAGCTGTGTGTACAAGGCCCGGGAACGTATTCACCGTG",
"acdd^aa_Z^d^ddc`^_Q_aaa`_ddc\\dfdffff\\fff"}}
But this method is fairly slow - at 2 milliseconds per entry on my machine it would take 100 minutes to read in all the entries from that example file. It would be faster to skip the importer and partition the data yourself via
ClearAll@getFASTQRecord
getFASTQRecord[ stream_InputStream ]:=Module[
{record},
record = ReadList[stream,String,4];
If[ {} =!= record,
StringReplace[
First@record,
"@"->""
] -> AssociationThread[
{"Sequence","Qualities"},
record[[{2,4}]]
]
]
]
which returns a rule
In[183]:= getFASTQRecord[stream]
Out[183]= "HWUSI-EAS300R_0005_FC62TL2AAXX:8:30:18447:12115#0/1" ->
<|"Sequence" -> "CGTAGCTGTGTGTACAAGGCCCGGGAACGTATTCACCGTG",
"Qualities" -> "acdd^aa_Z^d^ddc`^_Q_aaa`_ddc\\dfdffff\\fff"|>
and then you read in all the entries with
dataset = <||>;
Do[
entry = getFASTQRecord[stream];
If[
MatchQ[entry, _Rule],
AssociateTo[dataset, entry],
Continue[]
],
{n, 3000000}];~Monitor~n
which runs in 50 seconds on my machine now you have a nested association with 2,979,809 entries.
In[191]:= dataset["HWUSI-EAS300R_0005_FC62TL2AAXX:8:33:12239:1821#0/1", "Sequence"]
Out[191]= "TTAGCTTTTGTATTATGGGCCAGCGACTTAATTTAACGAG"