Message Boards Message Boards

3
|
8201 Views
|
16 Replies
|
7 Total Likes
View groups...
Share
Share this post:

Simple Key Phrase Extraction For URL Classification

POSTED BY: David Johnston
16 Replies

Okay, here is what I got so far. Not working of course. lol Is "markovClassifier.m" just a name you give to be able to retrieve it again later or is it a specific function? Basically I am trying to test to see if there is a classification already. Probably should name it the same as the urlTarget so that way if I change that url parameter it will do a new fresh classification but if I use the same url again it will use the saved version.

CloudDeploy[
 APIFunction[{"textSample" -> "String", "urlTarget" -> "URL"}, 
  checkURL;, "JSON" &];
 checkURL = {If[urlTarget != "", checkSample, "Empty URL Parameter"]};
 checkSample = {If[textSample != "", func, 
    "Empty Text Sample Parameter"]};
 func = {
   If[CloudEvaluate[Get["markovClassifier.m"]] != "", 
    c[textSample, "Probabilities"],
    excludedDomains = {"cart", "my-account", "pricing"};
    includedDomains = {urlTarget};
    linkList = 
     Commonest[Select[Select[Import[urlTarget, {"HTML", "Hyperlinks"}],
        StringFreeQ[#, excludedDomains] &], ! 
         StringFreeQ[#, includedDomains] &], 20];
    texts = Import /@ linkList;
    c = Classify[texts -> linkList];
    With[{c = Compress[c]}, 
     CloudEvaluate[Put[Uncompress[c], "markovClassifier.m"]]];
    Quiet[
     CloudDeploy[
      FormFunction[{"text" -> "String"}, 
       Get["markovClassifier.m"][#text] &], "myform"]];
    c[textSample, "Probabilities"]
    ]};
 ]
POSTED BY: David Johnston

lol Is "markovClassifier.m" just a name you give to be able to retrieve it again later or is it a specific function?

It is a file name that is used to retrieve the data later.

POSTED BY: Hernan Moraldo

For filtering the links, you can use something like this:

urls = Import[
  "http://en.wikipedia.org/wiki/Albert_Einstein", {"HTML", 
   "Hyperlinks"}]

excludedDomains = {"http://commons.wikimedia.org/", "http://wiki.ubuntu.com"};

Select[urls, StringFreeQ[#, excludedDomains] &]

to have included domains instead, you'd do:

Select[urls, !StringFreeQ[#, includedDomains] &]
POSTED BY: Hernan Moraldo

That was an awesome tip. Thanks! Here is what I built from your snippet.

urlTarget = "http://www.businesstexter.com"
excludedDomains = {"cart", "my-account", "pricing"};
includedDomains = {urlTarget};

Commonest[
 Select[Select[Import[urlTarget, {"HTML", "Hyperlinks"}], 
   StringFreeQ[#, excludedDomains] &], ! 
    StringFreeQ[#, includedDomains] &], 20]
POSTED BY: David Johnston
POSTED BY: David Johnston

You can use more complex patterns to tell it where the string you want to exclude have to be. For example if I just say "i", this code will exclude all elements that have an "i" within them:

In[7]:= bigList = {"a", "item1", "item2", "item3", "item4", "item6", 
   "item7", "item8", "item9", "i5"};
excludedElements = {"item1", "item2", "item5", "item6", "i"};
Select[bigList, StringFreeQ[#, excludedElements] &]

Out[9]= {"a"}

I can instead specify that elements need to be excluded only if the "i" is surrounded by word boundaries (like symbols, spaces, etc., but not including digits):

In[22]:= bigList = {"a", "item1", "item2", "item3", "item4", "item6", 
   "item7", "item8", "item9", "i.5"};
excludedElements = {"item1", "item2", "item5", "item6", 
   WordBoundary ~~ "i" ~~ WordBoundary};
Select[bigList, StringFreeQ[#, excludedElements] &]

Out[24]= {"a", "item3", "item4", "item7", "item8", "item9"}

(notice the new pattern helps get rid of "i.5" but not of "item3").

Such string patterns are very useful for this kind of thing; check http://reference.wolfram.com/language/tutorial/WorkingWithStringPatterns.html for some examples and http://reference.wolfram.com/language/tutorial/RegularExpressions.html in case you want to use regular expressions as well. (Although it is usually enough with string patterns).

POSTED BY: Hernan Moraldo

OMG, I feel like such an idiot. The answer is StringMatchQ instead of StringFreeQ. lol

bigList = {"a", "item1", "item2", "item3", "item4", "item6", "item7", 
   "item8", "item9", "i5"};
excludedElements = {"item1", "item2", "item5", "item6", "i"};
Select[bigList, ! StringMatchQ[#, excludedElements] &]
POSTED BY: David Johnston

Ah, yes, you can also use StringMatchQ. I didn't mention it because I thought you'd want to filter urls by a substring (eg. their domains).

Thanks

POSTED BY: Hernan Moraldo

You are awesome! Playing with it now. :)

I didn't see the PUT in this code though. Is there any part of the sequence its best used or should it be PUT separately?

POSTED BY: David Johnston

Thanks!

In the code above, it is much faster if you limit the size of the ngrams:

getNGrams[text_] := ToLowerCase@With[
  {words = StringSplit[text]},
  StringJoin[Riffle[#, " "]] & /@ 
   Flatten[Partition[words, #, #, 1] & /@ Range[2, 3], 1]
  ]

If you use put inside CloudEvaluate, it will store the result of an expression into a file, for example:

CloudEvaluate[Put[N[1/50], "data.m"]]

CloudDeploy[FormFunction[{"text" -> "String"}, Get["data.m"] &], "myform"]

I am checking with the people who built CloudDeploy to make sure this would actually be the best approach for storing the classifier, though.

POSTED BY: Hernan Moraldo

Regarding how to use it with CloudDeploy, you can do something like:

urls = {"http://en.wikipedia.org/wiki/Albert_Einstein", 
   "http://en.wikipedia.org/wiki/Niels_Bohr", 
   "http://en.wikipedia.org/wiki/Michael_Faraday", 
   "http://en.wikipedia.org/wiki/Stephen_Hawking", 
   "http://en.wikipedia.org/wiki/Paul_Dirac"};

texts = Import /@ urls;

c = Classify[texts -> urls]

With[{c = Compress[c]},
 CloudEvaluate[Put[Uncompress[c], "markovClassifier.m"]]
 ]

Quiet[CloudDeploy[
  FormFunction[{"text" -> "String"}, 
   Get["markovClassifier.m"][#text] &], "myform"]]

However I am looking into why the last line throws a message if you don't use Quiet on it.

POSTED BY: Hernan Moraldo

I am still lost. Here is my code:

CloudDeploy[
 APIFunction[{"textSample" -> "String", "urlTarget" -> "URL"}, checkURL, "JSON" &];

 checkURL = {If[urlTarget != "",
    checkSample,
    "Empty URL Parameter"]};

 checkSample = {If[textSample != "",
    func,
    "Empty Text Sample Parameter"]};

 func = {
   excludeList = {"cart", "checkout", "thank-you", "login"} ,
   UrlList = Commonest[
     DeleteCases[
      Import[urlTarget, "HyperLinks"],
      {!= urlTarget, excludeList}], 20],
   texts = Import /@ urlList,
   c = Classify[texts -> urlList],
   c[textSample, "Probabilities"]
   };
 ]

I don't know if this is the right way to accomplish my goal.

I want this to be an API where I can pass info to it via parameter and receive a "JSON" or "TEXT" response that includes probability scores.

I want to exclude links that don't contain the urlTarget. This would exclude outbound links, etc. Also I want a to exclude urls that contain any of a list of exclude words. And, I just want to classify the most common appearing links.

I am a little concerned it will try to run a full classification on it every time we hit the API. Is there a way to save a trained model or cache it or something if the URL parameter is the same each time?

POSTED BY: David Johnston

You can store data (for example the classifier), for further use on the cloud. For example:

In[1]:= CloudEvaluate[Put[1356, "bla.m"]]

In[2]:= CloudEvaluate[Get["bla.m"]]

Out[2]= 1356

then I can do:

In[3]:= CloudDeploy[APIFunction[{}, Get["bla.m"] &]]

Out[3]= CloudObject["https://www.wolframcloud.com/objects/ba83f78f-\
2f3f-4ea7-8987-c3c296ae4236"]

and now https://www.wolframcloud.com/objects/ba83f78f-2f3f-4ea7-8987-c3c296ae4236 shows the stored value.

Regarding the first example of nGrams, a quick way of implementing that (that is not optimized at all) is:

urls = {"http://en.wikipedia.org/wiki/Albert_Einstein", 
   "http://en.wikipedia.org/wiki/Niels_Bohr", 
   "http://en.wikipedia.org/wiki/Michael_Faraday", 
   "http://en.wikipedia.org/wiki/Stephen_Hawking", 
   "http://en.wikipedia.org/wiki/Paul_Dirac"};

getNGrams[text_] := ToLowerCase@With[
  {words = StringSplit[text]},
  StringJoin[Riffle[#, " "]] & /@ 
   Flatten[Partition[words, #, #, 1] & /@ Range[2, 5], 1]
  ]

texts = <|# -> Import[#] & /@ urls|>;
ngrams = getNGrams[#] & /@ texts;
allNGrams = Union @@ Values[ngrams];

getTextFeatures[text_] := With[{counts = Counts[getNGrams[text]]},
  Lookup[counts, #, 0] & /@ allNGrams
  ]

textFeatures = getTextFeatures /@ texts;

(*we already have 37687 ngrams which is possibly more than we need and takes a while to classify*)
c = Classify[textFeatures, Method -> "NearestNeighbors"]

In[65]:= c[
 textFeatures@
  "Heisenberg recollected a conversation with another person"]

Out[65]= "http://en.wikipedia.org/wiki/Paul_Dirac"

In[66]:= c[
 textFeatures@
  "Heisenberg recollected a conversation among young participants"]

Out[66]= "http://en.wikipedia.org/wiki/Paul_Dirac"
POSTED BY: Hernan Moraldo

I was unable to get the first example to work. I love the simplicity of the second example but, I need more fine tune control. I got it to work just fine but the accuracy of the results was lacking because the pages don't have enough text. I actually want to filter out the phrase lists and also extrapolate on them with dictionary and synonym functions.

For some reason, I can't figure out how to filter the list of links, before importing them, either. I just want links that contain the original target domain. Actually I would like to follow the links 3 levels deep, create one large list of links and then filter them, order them by which ones appear the most times and then import the top X number of them and run the nGram/Classify on them.

POSTED BY: David Johnston

This was amazingly helpful. Thank you very very much!

POSTED BY: David Johnston

You could use something like this to find the n-grams:

getNGrams[text_] := With[
  {words = StringSplit[text]},
  StringJoin[Riffle[#, " "]] & /@ 
   Flatten[Partition[words, #, #, 1] & /@ Range[2, 5], 1]
  ]

And then you could use Classify using those n-grams as features.

However this is really not necessary. Classify will classify texts using Markov models, which is pretty much the same than using the n-grams as you wanted to:

urls = {"http://en.wikipedia.org/wiki/Albert_Einstein", 
   "http://en.wikipedia.org/wiki/Niels_Bohr", 
   "http://en.wikipedia.org/wiki/Michael_Faraday", 
   "http://en.wikipedia.org/wiki/Stephen_Hawking", 
   "http://en.wikipedia.org/wiki/Paul_Dirac"};

texts = Import /@ urls;

c = Classify[texts -> urls]

Now I test it using a short text taken from the Dirac page:

In[21]:= c["Heisenberg recollected a conversation among young participants"]

Out[21]= "http://en.wikipedia.org/wiki/Paul_Dirac"
POSTED BY: Hernan Moraldo
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract