Message Boards Message Boards

0
|
5541 Views
|
8 Replies
|
1 Total Likes
View groups...
Share
Share this post:

Stripping Strings to Get Numerics

Posted 10 years ago

I am importing a CSV file of tabulation data generated by a market research package.

Each value field starts with a number, but then has character qualifiers and flags that relate to whether or not it is a percentage, whether the sample size is low, and so on. So example fields might include...

55
12.7
27%
33.98%
89*
91%*abe

...and so on. The numerical values are ALWAYS the first part of the field. So you never see abe91%, for example. I already know which fields represent percentages (so I don't need to preserve the percent sign, nor do I need even to know it is there).

I can think of several inelegant ways of stripping out the text to get at the raw number (and decimals). But I know there must be one -- or several -- elegant solution(s).

Anyone know an elegant solution?

Thanks in advance

Brad

POSTED BY: Brad Varey
8 Replies

I don't know if this is "elegant" but it certainly works because your numbers come first...

In[1]:= GetMyNumber[str_String] := 
 ToExpression[StringJoin@StringCases[str, DigitCharacter | "."]]

In[2]:= someNumbers = {"55", "12.7", "27%", "33.98%", "89", "91%abe"}

Out[2]= {"55", "12.7", "27%", "33.98%", "89", "91%abe"}

In[3]:= GetMyNumber /@ someNumbers

Out[3]= {55, 12.7, 27, 33.98, 89, 91}

Unfortunately something more elegant like using Interpreter["Number"] on your strings will not work since a number of the forms you have are not known number forms. Also, often using Interpreter[...] can be quite slow as it often uses the Cloud to do the interpretation.

But here is an example of the problem for a more general (semantic interpreter) case showing examples of an Interpreter working and not working on your examples:

In[4]:= Interpreter["InactiveSemanticExpression"] /@ someNumbers

Out[4]= {55, 12.7, Inactive[Quantity][27, "Percent"], 
 Inactive[Quantity][33.98, "Percent"], 89, Failure[
 "InterpretationFailure", 
Association[
  "MessageTemplate" :> MessageName[Interpreter, "semantic"], 
   "MessageParameters" -> Association["Input" -> "91%abe"], 
   "Input" -> "91%abe", "Type" -> "Expression"]]}

Which may not format correctly in this forum

POSTED BY: David Reiss

David's solution is much more elegant and can deal with more cases, but I was already trying to get the same, so I thought I might as well post my solution.

extractNumbers[list_List] := ToExpression[StringSplit[#, {"%", CharacterRange["a", "Z"] ..}][[1]]] & /@ list

and then

list={"55", "12.7", "27%", "33.98%", "89", "91%abe"};
extractNumbers[list]

which gives

{55, 12.7, 27, 33.98, 89, 91}

I guess I did not know the function DigitCharacter.

Cheers,

Marco

POSTED BY: Marco Thiel

Of course my function gives bizarre results in cases like this! So careful restriction to exactly the described cases is important:

In[5]:= GetMyNumber["5 is a number. But 6 is not"]

Out[5]= 5.6

And those cases cannot have either numbers or periods after the actual leading number.

POSTED BY: David Reiss
Posted 10 years ago

Thanks, David and Marco

I've employed David's solution, but it gives rise to another problem.

Import[], which is how I get the CSV file into Mathematica in the first place, returns strings for values such as...

27% 91%abe

...but returns what are effectively integers for values such as...

9356

... and these choke on the str_String parameter specifier in David's solution.

I thought the quick fix would be to supply another function of the same name without the _String type specifier, which should be automatically called if I supplied it with an integer, or any other non-string value. So I coded...

GetMyNumber[val_] := val

This seems to work, but I am never entirely sure about whether relying on such polymorphism is a more efficient route than, say, some method of coding just one function with a non-type-specific parameter, and sorting through the type alternatives at run-time.

Any thoughts?

POSTED BY: Brad Varey

Yes, your solution should work--I should have realized that the import mechanism would already create numbers for some of those things that are unambiguously interpreted as numbers. And as I was reading the first lines of your post I was already gong to write a solution like that for you. But to be more confident that something odd might slip through the generality of what you wrote (since it uses val_ which is a pattern that matches anything) I'd use,

GetMyNumber[val_?NumberQ] := val

I think that this should server you well. Let us know if there's anything that slips through...

POSTED BY: David Reiss

I could do more exclusions such as:

extractNumbers[list_List] := ToExpression[StringSplit[#, {"%", " ", CharacterRange["a", "Z"] ..}][[1]]] & /@ list

which would make

extractNumbers[{"5 is a number. But 6 is not"}]

give

{5}.

But this would end up patching every case individually.

Best wishes,

Marco

POSTED BY: Marco Thiel

How's this?

list = {"55", "12.7", "27%", "33.98%", "89", "91%abe"};
ToExpression/@StringJoin/@(StringCases[#, DigitCharacter | "."] & /@ list)

Result is

{55, 12.7, 27, 33.98, 89, 91}

Or this one: (I find it theoretically more elegant, but it's slightly longer.)

ToExpression@StringJoin@StringSplit[#, Except[DigitCharacter | "."]] & /@ list
POSTED BY: Jesse Friedman
Posted 10 years ago

other variation:

iNumbers[str_String] := 
 StringCases[str, x : Except[" ", (DigitCharacter | ".") ..] :> ToExpression[x]]

iNumbers[str_List] := Flatten[stringNumber /@ str]

iNumbers[str_] := str

result:

In[74]:= iNumbers[{"5 is a number. But 6 is not", 234, "55", "12.7",   "27%", "33.98%", "91%abe"}]

Out[74]= {5, 6, 234, 55, 12.7, 27, 33.98, 91}
POSTED BY: Jaebum Jung
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract