Group Abstract

Message Boards

WOLFRAM COMMUNITY

11.4K Views

3 Replies

7 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

Biological Sciences Data Science Wolfram Language

[?] Process very large FASTQ files?

Tiago Antao

Tiago Antao, University of Montana

Posted 8 years ago

Hi, I apologize in advance if this is a very basic question. I have a lot of experience with other languages (mostly a Python user nowadays), but I have decided to give a try to the Wolfram Language as it is, in my view, clearly superior as a programming language. I am trying to do a very basic operation: read a FASTQ file. A very big one. The problem is that the only operation that I have found (Import) tries to read the whole file to memory. This does not work. The alternative that I used was to write a parser myself that returns record by record, but that is not a long term solution, I would like to use Mathematica's internal FASTQ parser and not my own. Any suggestion on how to read a very big file using the provided FASTQ parser (say record by record)? Again, I suspect the solution is trivial. Apologies for the simple question, I really did try to find answers elsewhere. Thanks

POSTED BY: Tiago Antao

3 Replies

Sort By:

Robert Nachbar

Robert Nachbar, Wolfram Research Inc.

Posted 8 years ago

There are two very nice features to this solution that I would like to point out. First, by using an Association (the symbol dataset in the code) and AssociateTo, the size of the result can grow incrementally with no performance penalty for copying data, which would have been the case using List and AppendTo. Second, the header for each sequence is retained as the Key for each sequence Association in dataset, a very economic tactic.

POSTED BY: Robert Nachbar

Jason Biggs

Jason Biggs, Wolfram Alpha

Posted 8 years ago

The only way I can think to use the built-in "FASTQ" importer chunk by chunk is to read in one entry as a string and then feed it to `ImportString`. Something like this works ClearAll@getFASTQRecord getFASTQRecord[stream_InputStream,elem_:"Sequence"]:=Module[ {record}, record = ReadList[stream,String,4]; If[{}=!=record, ImportString[ StringRiffle[ record, "\n" ], {"FASTQ", elem} ], EndOfFile ] ] You can try this on a large "FASTQ" file downloaded from this University of Washington site, it's compressed from ~500MB to 54MB for downloading. In[168]:= file = "~/Downloads/Xpression_example_dataset/example.fastq"; stream = OpenRead[file]; In[170]:= getFASTQRecord[stream, "LabeledData"] Out[170]= {"HWUSI-EAS300R_0005_FC62TL2AAXX:8:30:18447:12115#0/1" -> {"CGTAGCTGTGTGTACAAGGCCCGGGAACGTATTCACCGTG", "acdd^aa_Z^d^ddc`^_Q_aaa`_ddc\\dfdffff\\fff"}} But this method is fairly slow - at 2 milliseconds per entry on my machine it would take 100 minutes to read in all the entries from that example file. It would be faster to skip the importer and partition the data yourself via ClearAll@getFASTQRecord getFASTQRecord[ stream_InputStream ]:=Module[ {record}, record = ReadList[stream,String,4]; If[ {} =!= record, StringReplace[ First@record, "@"->"" ] -> AssociationThread[ {"Sequence","Qualities"}, record[[{2,4}]] ] ] ] which returns a rule In[183]:= getFASTQRecord[stream] Out[183]= "HWUSI-EAS300R_0005_FC62TL2AAXX:8:30:18447:12115#0/1" -> <\|"Sequence" -> "CGTAGCTGTGTGTACAAGGCCCGGGAACGTATTCACCGTG", "Qualities" -> "acdd^aa_Z^d^ddc`^_Q_aaa`_ddc\\dfdffff\\fff"\|> and then you read in all the entries with dataset = <\|\|>; Do[ entry = getFASTQRecord[stream]; If[ MatchQ[entry, _Rule], AssociateTo[dataset, entry], Continue[] ], {n, 3000000}];~Monitor~n which runs in 50 seconds on my machine now you have a nested association with 2,979,809 entries. In[191]:= dataset["HWUSI-EAS300R_0005_FC62TL2AAXX:8:33:12239:1821#0/1", "Sequence"] Out[191]= "TTAGCTTTTGTATTATGGGCCAGCGACTTAATTTAACGAG"

The only way I can think to use the built-in "FASTQ" importer chunk by chunk is to read in one entry as a string and then feed it to ImportString. Something like this works

ClearAll@getFASTQRecord
getFASTQRecord[stream_InputStream,elem_:"Sequence"]:=Module[
    {record},
    record = ReadList[stream,String,4];
    If[{}=!=record,
       ImportString[
         StringRiffle[
          record,
          "\n"
         ],
         {"FASTQ", elem}
       ],
       EndOfFile
    ]
]

You can try this on a large "FASTQ" file downloaded from this University of Washington site, it's compressed from ~500MB to 54MB for downloading.

In[168]:= file = "~/Downloads/Xpression_example_dataset/example.fastq";
stream = OpenRead[file];

In[170]:= getFASTQRecord[stream, "LabeledData"]

Out[170]= {"HWUSI-EAS300R_0005_FC62TL2AAXX:8:30:18447:12115#0/1" ->     
    {"CGTAGCTGTGTGTACAAGGCCCGGGAACGTATTCACCGTG", 
       "acdd^aa_Z^d^ddc`^_Q_aaa`_ddc\\dfdffff\\fff"}}

But this method is fairly slow - at 2 milliseconds per entry on my machine it would take 100 minutes to read in all the entries from that example file. It would be faster to skip the importer and partition the data yourself via

ClearAll@getFASTQRecord
getFASTQRecord[ stream_InputStream ]:=Module[
    {record},
    record = ReadList[stream,String,4];
    If[ {} =!= record,
       StringReplace[ 
         First@record,
         "@"->""
       ] -> AssociationThread[
         {"Sequence","Qualities"},
         record[[{2,4}]]
         ]

    ]
]

which returns a rule

In[183]:= getFASTQRecord[stream]

Out[183]= "HWUSI-EAS300R_0005_FC62TL2AAXX:8:30:18447:12115#0/1" -> 
    <|"Sequence" -> "CGTAGCTGTGTGTACAAGGCCCGGGAACGTATTCACCGTG", 
      "Qualities" -> "acdd^aa_Z^d^ddc`^_Q_aaa`_ddc\\dfdffff\\fff"|>

and then you read in all the entries with

dataset = <||>;
Do[
   entry = getFASTQRecord[stream];
   If[
        MatchQ[entry, _Rule],
        AssociateTo[dataset, entry],
        Continue[]
    ],
   {n, 3000000}];~Monitor~n

which runs in 50 seconds on my machine now you have a nested association with 2,979,809 entries.

In[191]:= dataset["HWUSI-EAS300R_0005_FC62TL2AAXX:8:33:12239:1821#0/1", "Sequence"]

Out[191]= "TTAGCTTTTGTATTATGGGCCAGCGACTTAATTTAACGAG"

POSTED BY: Jason Biggs

Tiago Antao

Tiago Antao, University of Montana

Posted 8 years ago

Amazing answer, thanks. That brings up the more general question on how to process big files, but I will open a separate discussion topic about it.

POSTED BY: Tiago Antao

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Feedback