Message Boards Message Boards

8 Replies
1 Total Likes
View groups...
Share this post:

Stripping Strings to Get Numerics

Posted 10 years ago

I am importing a CSV file of tabulation data generated by a market research package.

Each value field starts with a number, but then has character qualifiers and flags that relate to whether or not it is a percentage, whether the sample size is low, and so on. So example fields might include...


...and so on. The numerical values are ALWAYS the first part of the field. So you never see abe91%, for example. I already know which fields represent percentages (so I don't need to preserve the percent sign, nor do I need even to know it is there).

I can think of several inelegant ways of stripping out the text to get at the raw number (and decimals). But I know there must be one -- or several -- elegant solution(s).

Anyone know an elegant solution?

Thanks in advance


POSTED BY: Brad Varey
8 Replies

I don't know if this is "elegant" but it certainly works because your numbers come first...

In[1]:= GetMyNumber[str_String] := 
 ToExpression[StringJoin@StringCases[str, DigitCharacter | "."]]

In[2]:= someNumbers = {"55", "12.7", "27%", "33.98%", "89", "91%abe"}

Out[2]= {"55", "12.7", "27%", "33.98%", "89", "91%abe"}

In[3]:= GetMyNumber /@ someNumbers

Out[3]= {55, 12.7, 27, 33.98, 89, 91}

Unfortunately something more elegant like using Interpreter["Number"] on your strings will not work since a number of the forms you have are not known number forms. Also, often using Interpreter[...] can be quite slow as it often uses the Cloud to do the interpretation.

But here is an example of the problem for a more general (semantic interpreter) case showing examples of an Interpreter working and not working on your examples:

In[4]:= Interpreter["InactiveSemanticExpression"] /@ someNumbers

Out[4]= {55, 12.7, Inactive[Quantity][27, "Percent"], 
 Inactive[Quantity][33.98, "Percent"], 89, Failure[
  "MessageTemplate" :> MessageName[Interpreter, "semantic"], 
   "MessageParameters" -> Association["Input" -> "91%abe"], 
   "Input" -> "91%abe", "Type" -> "Expression"]]}

Which may not format correctly in this forum

POSTED BY: David Reiss

David's solution is much more elegant and can deal with more cases, but I was already trying to get the same, so I thought I might as well post my solution.

extractNumbers[list_List] := ToExpression[StringSplit[#, {"%", CharacterRange["a", "Z"] ..}][[1]]] & /@ list

and then

list={"55", "12.7", "27%", "33.98%", "89", "91%abe"};

which gives

{55, 12.7, 27, 33.98, 89, 91}

I guess I did not know the function DigitCharacter.



POSTED BY: Marco Thiel

Of course my function gives bizarre results in cases like this! So careful restriction to exactly the described cases is important:

In[5]:= GetMyNumber["5 is a number. But 6 is not"]

Out[5]= 5.6

And those cases cannot have either numbers or periods after the actual leading number.

POSTED BY: David Reiss
Posted 10 years ago

Thanks, David and Marco

I've employed David's solution, but it gives rise to another problem.

Import[], which is how I get the CSV file into Mathematica in the first place, returns strings for values such as...

27% 91%abe

...but returns what are effectively integers for values such as...


... and these choke on the str_String parameter specifier in David's solution.

I thought the quick fix would be to supply another function of the same name without the _String type specifier, which should be automatically called if I supplied it with an integer, or any other non-string value. So I coded...

GetMyNumber[val_] := val

This seems to work, but I am never entirely sure about whether relying on such polymorphism is a more efficient route than, say, some method of coding just one function with a non-type-specific parameter, and sorting through the type alternatives at run-time.

Any thoughts?

POSTED BY: Brad Varey

Yes, your solution should work--I should have realized that the import mechanism would already create numbers for some of those things that are unambiguously interpreted as numbers. And as I was reading the first lines of your post I was already gong to write a solution like that for you. But to be more confident that something odd might slip through the generality of what you wrote (since it uses val_ which is a pattern that matches anything) I'd use,

GetMyNumber[val_?NumberQ] := val

I think that this should server you well. Let us know if there's anything that slips through...

POSTED BY: David Reiss

I could do more exclusions such as:

extractNumbers[list_List] := ToExpression[StringSplit[#, {"%", " ", CharacterRange["a", "Z"] ..}][[1]]] & /@ list

which would make

extractNumbers[{"5 is a number. But 6 is not"}]



But this would end up patching every case individually.

Best wishes,


POSTED BY: Marco Thiel

How's this?

list = {"55", "12.7", "27%", "33.98%", "89", "91%abe"};
ToExpression/@StringJoin/@(StringCases[#, DigitCharacter | "."] & /@ list)

Result is

{55, 12.7, 27, 33.98, 89, 91}

Or this one: (I find it theoretically more elegant, but it's slightly longer.)

ToExpression@StringJoin@StringSplit[#, Except[DigitCharacter | "."]] & /@ list
POSTED BY: Jesse Friedman
Posted 10 years ago

other variation:

iNumbers[str_String] := 
 StringCases[str, x : Except[" ", (DigitCharacter | ".") ..] :> ToExpression[x]]

iNumbers[str_List] := Flatten[stringNumber /@ str]

iNumbers[str_] := str


In[74]:= iNumbers[{"5 is a number. But 6 is not", 234, "55", "12.7",   "27%", "33.98%", "91%abe"}]

Out[74]= {5, 6, 234, 55, 12.7, 27, 33.98, 91}
POSTED BY: Jaebum Jung
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
or Discard

Group Abstract Group Abstract