Message Boards Message Boards

1
|
6616 Views
|
4 Replies
|
5 Total Likes
View groups...
Share
Share this post:

Line count/patterns

Posted 11 years ago
Given: contents of a text file, which may have Unix or Windows line endings, as a byte list
Task: compute the line count
Example:
bytes={65, 13, 10, 66, 67, 13, 10} 
I thought Count[bytes, PatternSequence[13,10]] might do the trick, but it gives zero. Count[bytes,10|13]/2 works, if I can know that I'm getting Windows line endings.

I don't want an iterative approach.
POSTED BY: Joel Klein
4 Replies
I think this works
Length[Split[bytes, #1 != 10 && (#1 != 13 || #2 == 10) &]] 
It's a little counterintuitive in this context that Split splits when the test function returns False; thus the test is
!(#1==10 || (#1 == 13&&#2!=10))
to avoid splitting on a 13 if it's followed by a 10, because I assume you also want Mac line endings to work.
POSTED BY: Jeremy Michelson
Posted 11 years ago
If you assume it's consistent within the list (either {10} or {13,10} or {13} for every line ending), then something like
In[103]:= Max[Count[bytes, 10], Count[bytes, 13]]
Out[103]= 2 
is probably pretty efficient. If you want to support a mix and match, something like this should work:
In[104]:= StringCount[FromCharacterCode[bytes], RegularExpression["\r\n?|\n"]]
Out[104]= 2 
Count[bytes, PatternSequence[13,10]] doesn't work because Count doesn't support sequence matching, only element matching. To do that kind of thing for Mathematica expression, one typical approach is
In[109]:= Count[Partition[bytes, 2, 1, 1], {13, 10}]
Out[109]= 2 
but it's not very efficient (and wouldn't cover the other cases anyway).
POSTED BY: Oyvind Tafjord
For those that assume the last line is terminated by a new line, here's another approach
newlines = 1 - Unitize[bytes - 10];
formfeeds = 1 - Unitize[bytes - 13];
both = Rest[newlines]*Most[formfeeds];
Total[newlines] + Total[formfeeds] - Total[both]
I haven't check relative efficiency.
POSTED BY: Jeremy Michelson
What would you expect for the following?  Or are you reasonably certain this won't come up?
bytes={65, 13, 10, 66, 67, 10} 
POSTED BY: Brett Champion
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract