Group Abstract Group Abstract

Message Boards Message Boards

3
|
9.2K Views
|
16 Replies
|
7 Total Likes
View groups...
Share
Share this post:

Simple Key Phrase Extraction For URL Classification

POSTED BY: David Johnston
16 Replies
POSTED BY: Hernan Moraldo

That was an awesome tip. Thanks! Here is what I built from your snippet.

urlTarget = "http://www.businesstexter.com"
excludedDomains = {"cart", "my-account", "pricing"};
includedDomains = {urlTarget};

Commonest[
 Select[Select[Import[urlTarget, {"HTML", "Hyperlinks"}], 
   StringFreeQ[#, excludedDomains] &], ! 
    StringFreeQ[#, includedDomains] &], 20]
POSTED BY: David Johnston

Ah, yes, you can also use StringMatchQ. I didn't mention it because I thought you'd want to filter urls by a substring (eg. their domains).

Thanks

POSTED BY: Hernan Moraldo

OMG, I feel like such an idiot. The answer is StringMatchQ instead of StringFreeQ. lol

bigList = {"a", "item1", "item2", "item3", "item4", "item6", "item7", 
   "item8", "item9", "i5"};
excludedElements = {"item1", "item2", "item5", "item6", "i"};
Select[bigList, ! StringMatchQ[#, excludedElements] &]
POSTED BY: David Johnston
POSTED BY: Hernan Moraldo

lol Is "markovClassifier.m" just a name you give to be able to retrieve it again later or is it a specific function?

It is a file name that is used to retrieve the data later.

POSTED BY: Hernan Moraldo

I ran into a problem trying to use this for another purpose. This actually excludes the list item if any part of the exclude string exists. It is not matching whole list items. For partial domain matching this works great. However, for just excluding a list of stopwords it does not work the way I need it.

Example:

excluding the letter a would exclude all words and phrases with the letter a.

How would you rewrite this if it was to be required to match exactly?

In[345]:= 
bigList = {"a", "item1", "item2", "item3", "item4", "item6", "item7", 
   "item8", "item9"};
excludedElements = {"item1", "item2", "item5", "item6"};
Select[bigList, StringFreeQ[#, excludedElements] &]

Out[347]= {"a", "item3", "item4", "item7", "item8", "item9"}
POSTED BY: David Johnston
POSTED BY: David Johnston
POSTED BY: Hernan Moraldo
POSTED BY: Hernan Moraldo

For filtering the links, you can use something like this:

urls = Import[
  "http://en.wikipedia.org/wiki/Albert_Einstein", {"HTML", 
   "Hyperlinks"}]

excludedDomains = {"http://commons.wikimedia.org/", "http://wiki.ubuntu.com"};

Select[urls, StringFreeQ[#, excludedDomains] &]

to have included domains instead, you'd do:

Select[urls, !StringFreeQ[#, includedDomains] &]
POSTED BY: Hernan Moraldo

You are awesome! Playing with it now. :)

I didn't see the PUT in this code though. Is there any part of the sequence its best used or should it be PUT separately?

POSTED BY: David Johnston
POSTED BY: Hernan Moraldo

I am still lost. Here is my code:

CloudDeploy[
 APIFunction[{"textSample" -> "String", "urlTarget" -> "URL"}, checkURL, "JSON" &];

 checkURL = {If[urlTarget != "",
    checkSample,
    "Empty URL Parameter"]};

 checkSample = {If[textSample != "",
    func,
    "Empty Text Sample Parameter"]};

 func = {
   excludeList = {"cart", "checkout", "thank-you", "login"} ,
   UrlList = Commonest[
     DeleteCases[
      Import[urlTarget, "HyperLinks"],
      {!= urlTarget, excludeList}], 20],
   texts = Import /@ urlList,
   c = Classify[texts -> urlList],
   c[textSample, "Probabilities"]
   };
 ]

I don't know if this is the right way to accomplish my goal.

I want this to be an API where I can pass info to it via parameter and receive a "JSON" or "TEXT" response that includes probability scores.

I want to exclude links that don't contain the urlTarget. This would exclude outbound links, etc. Also I want a to exclude urls that contain any of a list of exclude words. And, I just want to classify the most common appearing links.

I am a little concerned it will try to run a full classification on it every time we hit the API. Is there a way to save a trained model or cache it or something if the URL parameter is the same each time?

POSTED BY: David Johnston
POSTED BY: David Johnston

This was amazingly helpful. Thank you very very much!

POSTED BY: David Johnston
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard