Message Boards

WOLFRAM COMMUNITY

8201 Views

16 Replies

7 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

Simple Key Phrase Extraction For URL Classification

David Johnston

David Johnston, Artificial Brilliance, Inc.

Posted 10 years ago

POSTED BY: David Johnston

16 Replies

Sort By:

David Johnston

David Johnston, Artificial Brilliance, Inc.

Posted 10 years ago

Okay, here is what I got so far. Not working of course. lol Is "markovClassifier.m" just a name you give to be able to retrieve it again later or is it a specific function? Basically I am trying to test to see if there is a classification already. Probably should name it the same as the urlTarget so that way if I change that url parameter it will do a new fresh classification but if I use the same url again it will use the saved version. CloudDeploy[ APIFunction[{"textSample" -> "String", "urlTarget" -> "URL"}, checkURL;, "JSON" &]; checkURL = {If[urlTarget != "", checkSample, "Empty URL Parameter"]}; checkSample = {If[textSample != "", func, "Empty Text Sample Parameter"]}; func = { If[CloudEvaluate[Get["markovClassifier.m"]] != "", c[textSample, "Probabilities"], excludedDomains = {"cart", "my-account", "pricing"}; includedDomains = {urlTarget}; linkList = Commonest[Select[Select[Import[urlTarget, {"HTML", "Hyperlinks"}], StringFreeQ[#, excludedDomains] &], ! StringFreeQ[#, includedDomains] &], 20]; texts = Import /@ linkList; c = Classify[texts -> linkList]; With[{c = Compress[c]}, CloudEvaluate[Put[Uncompress[c], "markovClassifier.m"]]]; Quiet[ CloudDeploy[ FormFunction[{"text" -> "String"}, Get["markovClassifier.m"][#text] &], "myform"]]; c[textSample, "Probabilities"] ]}; ]

Okay, here is what I got so far. Not working of course. lol Is "markovClassifier.m" just a name you give to be able to retrieve it again later or is it a specific function? Basically I am trying to test to see if there is a classification already. Probably should name it the same as the urlTarget so that way if I change that url parameter it will do a new fresh classification but if I use the same url again it will use the saved version.

CloudDeploy[
 APIFunction[{"textSample" -> "String", "urlTarget" -> "URL"}, 
  checkURL;, "JSON" &];
 checkURL = {If[urlTarget != "", checkSample, "Empty URL Parameter"]};
 checkSample = {If[textSample != "", func, 
    "Empty Text Sample Parameter"]};
 func = {
   If[CloudEvaluate[Get["markovClassifier.m"]] != "", 
    c[textSample, "Probabilities"],
    excludedDomains = {"cart", "my-account", "pricing"};
    includedDomains = {urlTarget};
    linkList = 
     Commonest[Select[Select[Import[urlTarget, {"HTML", "Hyperlinks"}],
        StringFreeQ[#, excludedDomains] &], ! 
         StringFreeQ[#, includedDomains] &], 20];
    texts = Import /@ linkList;
    c = Classify[texts -> linkList];
    With[{c = Compress[c]}, 
     CloudEvaluate[Put[Uncompress[c], "markovClassifier.m"]]];
    Quiet[
     CloudDeploy[
      FormFunction[{"text" -> "String"}, 
       Get["markovClassifier.m"][#text] &], "myform"]];
    c[textSample, "Probabilities"]
    ]};
 ]

POSTED BY: David Johnston

Hernan Moraldo

Hernan Moraldo, Wolfram Research

Posted 10 years ago

lol Is "markovClassifier.m" just a name you give to be able to retrieve it again later or is it a specific function? It is a file name that is used to retrieve the data later.

POSTED BY: Hernan Moraldo

Hernan Moraldo

Hernan Moraldo, Wolfram Research

Posted 10 years ago

For filtering the links, you can use something like this: urls = Import[ "http://en.wikipedia.org/wiki/Albert_Einstein", {"HTML", "Hyperlinks"}] excludedDomains = {"http://commons.wikimedia.org/", "http://wiki.ubuntu.com"}; Select[urls, StringFreeQ[#, excludedDomains] &] to have included domains instead, you'd do: Select[urls, !StringFreeQ[#, includedDomains] &]

For filtering the links, you can use something like this:

urls = Import[
  "http://en.wikipedia.org/wiki/Albert_Einstein", {"HTML", 
   "Hyperlinks"}]

excludedDomains = {"http://commons.wikimedia.org/", "http://wiki.ubuntu.com"};

Select[urls, StringFreeQ[#, excludedDomains] &]

to have included domains instead, you'd do:

Select[urls, !StringFreeQ[#, includedDomains] &]

POSTED BY: Hernan Moraldo

David Johnston

David Johnston, Artificial Brilliance, Inc.

Posted 10 years ago

That was an awesome tip. Thanks! Here is what I built from your snippet. urlTarget = "http://www.businesstexter.com" excludedDomains = {"cart", "my-account", "pricing"}; includedDomains = {urlTarget}; Commonest[ Select[Select[Import[urlTarget, {"HTML", "Hyperlinks"}], StringFreeQ[#, excludedDomains] &], ! StringFreeQ[#, includedDomains] &], 20]

That was an awesome tip. Thanks! Here is what I built from your snippet.

urlTarget = "http://www.businesstexter.com"
excludedDomains = {"cart", "my-account", "pricing"};
includedDomains = {urlTarget};

Commonest[
 Select[Select[Import[urlTarget, {"HTML", "Hyperlinks"}], 
   StringFreeQ[#, excludedDomains] &], ! 
    StringFreeQ[#, includedDomains] &], 20]

POSTED BY: David Johnston

David Johnston

David Johnston, Artificial Brilliance, Inc.

Posted 10 years ago

POSTED BY: David Johnston

Hernan Moraldo

Hernan Moraldo, Wolfram Research

Posted 10 years ago

You can use more complex patterns to tell it where the string you want to exclude have to be. For example if I just say "i", this code will exclude all elements that have an "i" within them: In[7]:= bigList = {"a", "item1", "item2", "item3", "item4", "item6", "item7", "item8", "item9", "i5"}; excludedElements = {"item1", "item2", "item5", "item6", "i"}; Select[bigList, StringFreeQ[#, excludedElements] &] Out[9]= {"a"} I can instead specify that elements need to be excluded only if the "i" is surrounded by word boundaries (like symbols, spaces, etc., but not including digits): In[22]:= bigList = {"a", "item1", "item2", "item3", "item4", "item6", "item7", "item8", "item9", "i.5"}; excludedElements = {"item1", "item2", "item5", "item6", WordBoundary ~~ "i" ~~ WordBoundary}; Select[bigList, StringFreeQ[#, excludedElements] &] Out[24]= {"a", "item3", "item4", "item7", "item8", "item9"} (notice the new pattern helps get rid of "i.5" but not of "item3"). Such string patterns are very useful for this kind of thing; check http://reference.wolfram.com/language/tutorial/WorkingWithStringPatterns.html for some examples and http://reference.wolfram.com/language/tutorial/RegularExpressions.html in case you want to use regular expressions as well. (Although it is usually enough with string patterns).

POSTED BY: Hernan Moraldo

David Johnston

David Johnston, Artificial Brilliance, Inc.

Posted 10 years ago

OMG, I feel like such an idiot. The answer is StringMatchQ instead of StringFreeQ. lol bigList = {"a", "item1", "item2", "item3", "item4", "item6", "item7", "item8", "item9", "i5"}; excludedElements = {"item1", "item2", "item5", "item6", "i"}; Select[bigList, ! StringMatchQ[#, excludedElements] &]

OMG, I feel like such an idiot. The answer is StringMatchQ instead of StringFreeQ. lol

bigList = {"a", "item1", "item2", "item3", "item4", "item6", "item7", 
   "item8", "item9", "i5"};
excludedElements = {"item1", "item2", "item5", "item6", "i"};
Select[bigList, ! StringMatchQ[#, excludedElements] &]

POSTED BY: David Johnston

Hernan Moraldo

Hernan Moraldo, Wolfram Research

Posted 10 years ago

Ah, yes, you can also use StringMatchQ. I didn't mention it because I thought you'd want to filter urls by a substring (eg. their domains). Thanks

POSTED BY: Hernan Moraldo

David Johnston

David Johnston, Artificial Brilliance, Inc.

Posted 10 years ago

You are awesome! Playing with it now. :) I didn't see the PUT in this code though. Is there any part of the sequence its best used or should it be PUT separately?

POSTED BY: David Johnston

Hernan Moraldo

Hernan Moraldo, Wolfram Research

Posted 10 years ago

Thanks! In the code above, it is much faster if you limit the size of the ngrams: getNGrams[text_] := ToLowerCase@With[ {words = StringSplit[text]}, StringJoin[Riffle[#, " "]] & /@ Flatten[Partition[words, #, #, 1] & /@ Range[2, 3], 1] ] If you use put inside CloudEvaluate, it will store the result of an expression into a file, for example: CloudEvaluate[Put[N[1/50], "data.m"]] CloudDeploy[FormFunction[{"text" -> "String"}, Get["data.m"] &], "myform"] I am checking with the people who built CloudDeploy to make sure this would actually be the best approach for storing the classifier, though.

POSTED BY: Hernan Moraldo

Hernan Moraldo

Hernan Moraldo, Wolfram Research

Posted 10 years ago

Regarding how to use it with CloudDeploy, you can do something like: urls = {"http://en.wikipedia.org/wiki/Albert_Einstein", "http://en.wikipedia.org/wiki/Niels_Bohr", "http://en.wikipedia.org/wiki/Michael_Faraday", "http://en.wikipedia.org/wiki/Stephen_Hawking", "http://en.wikipedia.org/wiki/Paul_Dirac"}; texts = Import /@ urls; c = Classify[texts -> urls] With[{c = Compress[c]}, CloudEvaluate[Put[Uncompress[c], "markovClassifier.m"]] ] Quiet[CloudDeploy[ FormFunction[{"text" -> "String"}, Get["markovClassifier.m"][#text] &], "myform"]] However I am looking into why the last line throws a message if you don't use Quiet on it.

Regarding how to use it with CloudDeploy, you can do something like:

urls = {"http://en.wikipedia.org/wiki/Albert_Einstein", 
   "http://en.wikipedia.org/wiki/Niels_Bohr", 
   "http://en.wikipedia.org/wiki/Michael_Faraday", 
   "http://en.wikipedia.org/wiki/Stephen_Hawking", 
   "http://en.wikipedia.org/wiki/Paul_Dirac"};

texts = Import /@ urls;

c = Classify[texts -> urls]

With[{c = Compress[c]},
 CloudEvaluate[Put[Uncompress[c], "markovClassifier.m"]]
 ]

Quiet[CloudDeploy[
  FormFunction[{"text" -> "String"}, 
   Get["markovClassifier.m"][#text] &], "myform"]]

However I am looking into why the last line throws a message if you don't use Quiet on it.

POSTED BY: Hernan Moraldo

David Johnston

David Johnston, Artificial Brilliance, Inc.

Posted 10 years ago

I am still lost. Here is my code: CloudDeploy[ APIFunction[{"textSample" -> "String", "urlTarget" -> "URL"}, checkURL, "JSON" &]; checkURL = {If[urlTarget != "", checkSample, "Empty URL Parameter"]}; checkSample = {If[textSample != "", func, "Empty Text Sample Parameter"]}; func = { excludeList = {"cart", "checkout", "thank-you", "login"} , UrlList = Commonest[ DeleteCases[ Import[urlTarget, "HyperLinks"], {!= urlTarget, excludeList}], 20], texts = Import /@ urlList, c = Classify[texts -> urlList], c[textSample, "Probabilities"] }; ] I don't know if this is the right way to accomplish my goal. I want this to be an API where I can pass info to it via parameter and receive a "JSON" or "TEXT" response that includes probability scores. I want to exclude links that don't contain the urlTarget. This would exclude outbound links, etc. Also I want a to exclude urls that contain any of a list of exclude words. And, I just want to classify the most common appearing links. I am a little concerned it will try to run a full classification on it every time we hit the API. Is there a way to save a trained model or cache it or something if the URL parameter is the same each time?

POSTED BY: David Johnston

Hernan Moraldo

Hernan Moraldo, Wolfram Research

Posted 10 years ago

You can store data (for example the classifier), for further use on the cloud. For example: In[1]:= CloudEvaluate[Put[1356, "bla.m"]] In[2]:= CloudEvaluate[Get["bla.m"]] Out[2]= 1356 then I can do: In[3]:= CloudDeploy[APIFunction[{}, Get["bla.m"] &]] Out[3]= CloudObject["https://www.wolframcloud.com/objects/ba83f78f-\ 2f3f-4ea7-8987-c3c296ae4236"] and now https://www.wolframcloud.com/objects/ba83f78f-2f3f-4ea7-8987-c3c296ae4236 shows the stored value. Regarding the first example of nGrams, a quick way of implementing that (that is not optimized at all) is: urls = {"http://en.wikipedia.org/wiki/Albert_Einstein", "http://en.wikipedia.org/wiki/Niels_Bohr", "http://en.wikipedia.org/wiki/Michael_Faraday", "http://en.wikipedia.org/wiki/Stephen_Hawking", "http://en.wikipedia.org/wiki/Paul_Dirac"}; getNGrams[text_] := ToLowerCase@With[ {words = StringSplit[text]}, StringJoin[Riffle[#, " "]] & /@ Flatten[Partition[words, #, #, 1] & /@ Range[2, 5], 1] ] texts = <\|# -> Import[#] & /@ urls\|>; ngrams = getNGrams[#] & /@ texts; allNGrams = Union @@ Values[ngrams]; getTextFeatures[text_] := With[{counts = Counts[getNGrams[text]]}, Lookup[counts, #, 0] & /@ allNGrams ] textFeatures = getTextFeatures /@ texts; (we already have 37687 ngrams which is possibly more than we need and takes a while to classify) c = Classify[textFeatures, Method -> "NearestNeighbors"] In[65]:= c[ textFeatures@ "Heisenberg recollected a conversation with another person"] Out[65]= "http://en.wikipedia.org/wiki/Paul_Dirac" In[66]:= c[ textFeatures@ "Heisenberg recollected a conversation among young participants"] Out[66]= "http://en.wikipedia.org/wiki/Paul_Dirac"

You can store data (for example the classifier), for further use on the cloud. For example:

In[1]:= CloudEvaluate[Put[1356, "bla.m"]]

In[2]:= CloudEvaluate[Get["bla.m"]]

Out[2]= 1356

then I can do:

In[3]:= CloudDeploy[APIFunction[{}, Get["bla.m"] &]]

Out[3]= CloudObject["https://www.wolframcloud.com/objects/ba83f78f-\
2f3f-4ea7-8987-c3c296ae4236"]

and now https://www.wolframcloud.com/objects/ba83f78f-2f3f-4ea7-8987-c3c296ae4236 shows the stored value.

Regarding the first example of nGrams, a quick way of implementing that (that is not optimized at all) is:

urls = {"http://en.wikipedia.org/wiki/Albert_Einstein", 
   "http://en.wikipedia.org/wiki/Niels_Bohr", 
   "http://en.wikipedia.org/wiki/Michael_Faraday", 
   "http://en.wikipedia.org/wiki/Stephen_Hawking", 
   "http://en.wikipedia.org/wiki/Paul_Dirac"};

getNGrams[text_] := ToLowerCase@With[
  {words = StringSplit[text]},
  StringJoin[Riffle[#, " "]] & /@ 
   Flatten[Partition[words, #, #, 1] & /@ Range[2, 5], 1]
  ]

texts = <|# -> Import[#] & /@ urls|>;
ngrams = getNGrams[#] & /@ texts;
allNGrams = Union @@ Values[ngrams];

getTextFeatures[text_] := With[{counts = Counts[getNGrams[text]]},
  Lookup[counts, #, 0] & /@ allNGrams
  ]

textFeatures = getTextFeatures /@ texts;

(*we already have 37687 ngrams which is possibly more than we need and takes a while to classify*)
c = Classify[textFeatures, Method -> "NearestNeighbors"]

In[65]:= c[
 textFeatures@
  "Heisenberg recollected a conversation with another person"]

Out[65]= "http://en.wikipedia.org/wiki/Paul_Dirac"

In[66]:= c[
 textFeatures@
  "Heisenberg recollected a conversation among young participants"]

Out[66]= "http://en.wikipedia.org/wiki/Paul_Dirac"

POSTED BY: Hernan Moraldo

David Johnston

David Johnston, Artificial Brilliance, Inc.

Posted 10 years ago

I was unable to get the first example to work. I love the simplicity of the second example but, I need more fine tune control. I got it to work just fine but the accuracy of the results was lacking because the pages don't have enough text. I actually want to filter out the phrase lists and also extrapolate on them with dictionary and synonym functions. For some reason, I can't figure out how to filter the list of links, before importing them, either. I just want links that contain the original target domain. Actually I would like to follow the links 3 levels deep, create one large list of links and then filter them, order them by which ones appear the most times and then import the top X number of them and run the nGram/Classify on them.

POSTED BY: David Johnston

David Johnston

David Johnston, Artificial Brilliance, Inc.

Posted 10 years ago

This was amazingly helpful. Thank you very very much!

POSTED BY: David Johnston

Hernan Moraldo

Hernan Moraldo, Wolfram Research

Posted 10 years ago

You could use something like this to find the n-grams: getNGrams[text_] := With[ {words = StringSplit[text]}, StringJoin[Riffle[#, " "]] & /@ Flatten[Partition[words, #, #, 1] & /@ Range[2, 5], 1] ] And then you could use Classify using those n-grams as features. However this is really not necessary. Classify will classify texts using Markov models, which is pretty much the same than using the n-grams as you wanted to: urls = {"http://en.wikipedia.org/wiki/Albert_Einstein", "http://en.wikipedia.org/wiki/Niels_Bohr", "http://en.wikipedia.org/wiki/Michael_Faraday", "http://en.wikipedia.org/wiki/Stephen_Hawking", "http://en.wikipedia.org/wiki/Paul_Dirac"}; texts = Import /@ urls; c = Classify[texts -> urls] Now I test it using a short text taken from the Dirac page: In[21]:= c["Heisenberg recollected a conversation among young participants"] Out[21]= "http://en.wikipedia.org/wiki/Paul_Dirac"

You could use something like this to find the n-grams:

getNGrams[text_] := With[
  {words = StringSplit[text]},
  StringJoin[Riffle[#, " "]] & /@ 
   Flatten[Partition[words, #, #, 1] & /@ Range[2, 5], 1]
  ]

And then you could use Classify using those n-grams as features.

However this is really not necessary. Classify will classify texts using Markov models, which is pretty much the same than using the n-grams as you wanted to:

urls = {"http://en.wikipedia.org/wiki/Albert_Einstein", 
   "http://en.wikipedia.org/wiki/Niels_Bohr", 
   "http://en.wikipedia.org/wiki/Michael_Faraday", 
   "http://en.wikipedia.org/wiki/Stephen_Hawking", 
   "http://en.wikipedia.org/wiki/Paul_Dirac"};

texts = Import /@ urls;

c = Classify[texts -> urls]

Now I test it using a short text taken from the Dirac page:

In[21]:= c["Heisenberg recollected a conversation among young participants"]

Out[21]= "http://en.wikipedia.org/wiki/Paul_Dirac"

POSTED BY: Hernan Moraldo

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Group Abstract

Feedback