Message Boards Message Boards

3
|
7929 Views
|
16 Replies
|
7 Total Likes
View groups...
Share
Share this post:

Simple Key Phrase Extraction For URL Classification

POSTED BY: David Johnston
16 Replies
POSTED BY: David Johnston

lol Is "markovClassifier.m" just a name you give to be able to retrieve it again later or is it a specific function?

It is a file name that is used to retrieve the data later.

POSTED BY: Hernan Moraldo

For filtering the links, you can use something like this:

urls = Import[
  "http://en.wikipedia.org/wiki/Albert_Einstein", {"HTML", 
   "Hyperlinks"}]

excludedDomains = {"http://commons.wikimedia.org/", "http://wiki.ubuntu.com"};

Select[urls, StringFreeQ[#, excludedDomains] &]

to have included domains instead, you'd do:

Select[urls, !StringFreeQ[#, includedDomains] &]
POSTED BY: Hernan Moraldo

That was an awesome tip. Thanks! Here is what I built from your snippet.

urlTarget = "http://www.businesstexter.com"
excludedDomains = {"cart", "my-account", "pricing"};
includedDomains = {urlTarget};

Commonest[
 Select[Select[Import[urlTarget, {"HTML", "Hyperlinks"}], 
   StringFreeQ[#, excludedDomains] &], ! 
    StringFreeQ[#, includedDomains] &], 20]
POSTED BY: David Johnston

I ran into a problem trying to use this for another purpose. This actually excludes the list item if any part of the exclude string exists. It is not matching whole list items. For partial domain matching this works great. However, for just excluding a list of stopwords it does not work the way I need it.

Example:

excluding the letter a would exclude all words and phrases with the letter a.

How would you rewrite this if it was to be required to match exactly?

In[345]:= 
bigList = {"a", "item1", "item2", "item3", "item4", "item6", "item7", 
   "item8", "item9"};
excludedElements = {"item1", "item2", "item5", "item6"};
Select[bigList, StringFreeQ[#, excludedElements] &]

Out[347]= {"a", "item3", "item4", "item7", "item8", "item9"}
POSTED BY: David Johnston

You can use more complex patterns to tell it where the string you want to exclude have to be. For example if I just say "i", this code will exclude all elements that have an "i" within them:

In[7]:= bigList = {"a", "item1", "item2", "item3", "item4", "item6", 
   "item7", "item8", "item9", "i5"};
excludedElements = {"item1", "item2", "item5", "item6", "i"};
Select[bigList, StringFreeQ[#, excludedElements] &]

Out[9]= {"a"}

I can instead specify that elements need to be excluded only if the "i" is surrounded by word boundaries (like symbols, spaces, etc., but not including digits):

In[22]:= bigList = {"a", "item1", "item2", "item3", "item4", "item6", 
   "item7", "item8", "item9", "i.5"};
excludedElements = {"item1", "item2", "item5", "item6", 
   WordBoundary ~~ "i" ~~ WordBoundary};
Select[bigList, StringFreeQ[#, excludedElements] &]

Out[24]= {"a", "item3", "item4", "item7", "item8", "item9"}

(notice the new pattern helps get rid of "i.5" but not of "item3").

Such string patterns are very useful for this kind of thing; check http://reference.wolfram.com/language/tutorial/WorkingWithStringPatterns.html for some examples and http://reference.wolfram.com/language/tutorial/RegularExpressions.html in case you want to use regular expressions as well. (Although it is usually enough with string patterns).

POSTED BY: Hernan Moraldo
POSTED BY: David Johnston
POSTED BY: Hernan Moraldo

You are awesome! Playing with it now. :)

I didn't see the PUT in this code though. Is there any part of the sequence its best used or should it be PUT separately?

POSTED BY: David Johnston
POSTED BY: Hernan Moraldo
POSTED BY: Hernan Moraldo
POSTED BY: David Johnston

You can store data (for example the classifier), for further use on the cloud. For example:

In[1]:= CloudEvaluate[Put[1356, "bla.m"]]

In[2]:= CloudEvaluate[Get["bla.m"]]

Out[2]= 1356

then I can do:

In[3]:= CloudDeploy[APIFunction[{}, Get["bla.m"] &]]

Out[3]= CloudObject["https://www.wolframcloud.com/objects/ba83f78f-\
2f3f-4ea7-8987-c3c296ae4236"]

and now https://www.wolframcloud.com/objects/ba83f78f-2f3f-4ea7-8987-c3c296ae4236 shows the stored value.

Regarding the first example of nGrams, a quick way of implementing that (that is not optimized at all) is:

urls = {"http://en.wikipedia.org/wiki/Albert_Einstein", 
   "http://en.wikipedia.org/wiki/Niels_Bohr", 
   "http://en.wikipedia.org/wiki/Michael_Faraday", 
   "http://en.wikipedia.org/wiki/Stephen_Hawking", 
   "http://en.wikipedia.org/wiki/Paul_Dirac"};

getNGrams[text_] := ToLowerCase@With[
  {words = StringSplit[text]},
  StringJoin[Riffle[#, " "]] & /@ 
   Flatten[Partition[words, #, #, 1] & /@ Range[2, 5], 1]
  ]

texts = <|# -> Import[#] & /@ urls|>;
ngrams = getNGrams[#] & /@ texts;
allNGrams = Union @@ Values[ngrams];

getTextFeatures[text_] := With[{counts = Counts[getNGrams[text]]},
  Lookup[counts, #, 0] & /@ allNGrams
  ]

textFeatures = getTextFeatures /@ texts;

(*we already have 37687 ngrams which is possibly more than we need and takes a while to classify*)
c = Classify[textFeatures, Method -> "NearestNeighbors"]

In[65]:= c[
 textFeatures@
  "Heisenberg recollected a conversation with another person"]

Out[65]= "http://en.wikipedia.org/wiki/Paul_Dirac"

In[66]:= c[
 textFeatures@
  "Heisenberg recollected a conversation among young participants"]

Out[66]= "http://en.wikipedia.org/wiki/Paul_Dirac"
POSTED BY: Hernan Moraldo
POSTED BY: David Johnston

This was amazingly helpful. Thank you very very much!

POSTED BY: David Johnston

You could use something like this to find the n-grams:

getNGrams[text_] := With[
  {words = StringSplit[text]},
  StringJoin[Riffle[#, " "]] & /@ 
   Flatten[Partition[words, #, #, 1] & /@ Range[2, 5], 1]
  ]

And then you could use Classify using those n-grams as features.

However this is really not necessary. Classify will classify texts using Markov models, which is pretty much the same than using the n-grams as you wanted to:

urls = {"http://en.wikipedia.org/wiki/Albert_Einstein", 
   "http://en.wikipedia.org/wiki/Niels_Bohr", 
   "http://en.wikipedia.org/wiki/Michael_Faraday", 
   "http://en.wikipedia.org/wiki/Stephen_Hawking", 
   "http://en.wikipedia.org/wiki/Paul_Dirac"};

texts = Import /@ urls;

c = Classify[texts -> urls]

Now I test it using a short text taken from the Dirac page:

In[21]:= c["Heisenberg recollected a conversation among young participants"]

Out[21]= "http://en.wikipedia.org/wiki/Paul_Dirac"
POSTED BY: Hernan Moraldo
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract