Message Boards Message Boards

GROUPS:

Need help with Market Sentiment Analysis

Posted 2 years ago
4770 Views
|
26 Replies
|
8 Total Likes
|

Dear Wolfram Users, I am new to Mathematica and I wanted to replicate this study in order to learn how to do sentiment analysis on website data.

https://community.wolfram.com/groups/-/m/t/970999

However, it seems to me, that the code is not working. Maybe it was an older Mathematica code?

For example:

archive = 
  Import[StringJoin["http://www.wsj.com/public/page/archive-", 
    DateString[
     datelist[[1]], {"Year", "-", "MonthShort", "-", "DayShort"}], 
    ".html"]];
archive = 
  StringDrop[archive, 
   StringPosition[archive, 
     DateString[
      datelist[[1]], {"MonthName", " ", "DayShort", ", ", 
       "Year"}]][[1, 2]]];
archive = 
  StringTake[
   archive, -1 + StringPosition[archive, "ARCHIVE FILTER"][[1, 1]]];
archivewords = ToLowerCase[DeleteStopwords[TextWords[archive]]];
TakeLargest[Counts[archivewords], 20]
WordCloud[archivewords]

This command should download news from the WSJ archive and format for Machine Learning purposes. But somehow it is not working. Can somebody help me to replicate this code in a new way? I also think that one of the problems is that this code is not able to switch archive dated anymore. Or maybe there is a newer version of the article?

Thank you very much.

26 Replies
Posted 2 years ago

Roman,

Looks like the URL for the archive as well as the format has changed. Try this

datelist = {DateObject[{2012, 1, 3}]};

archive = 
 Import[StringJoin["https://www.wsj.com/news/archive/", 
   DateString[datelist[[1]], {"Year", "Month", "Day"}]]];

archive = 
 StringDrop[archive, 
  StringPosition[archive, 
    DateString[
     datelist[[1]], {"MonthNameShort", " ", "DayShort", " ", 
      "Year"}]][[1, 2]]];

archive = 
 StringTake[
  archive, -1 + 
   StringPosition[archive, "Most Popular Articles"][[1, 1]]];

archivewords = ToLowerCase[DeleteStopwords[TextWords[archive]]];

TakeLargest[Counts[archivewords], 20]

<|"new" -> 34, "2012" -> 33, "markets" -> 27, "business" -> 24,
"year" -> 22, "u.s." -> 21, "asia" -> 18, "opinion" -> 16, "2011" -> 15, "said" -> 12, "$" -> 11, "york" -> 10, "economic" -> 10, "iowa" -> 10, "news" -> 10, "economy" -> 10, "biggest" -> 10, "day" -> 9, "data" -> 9, "december" -> 9|>

WordCloud[archivewords]

enter image description here

Thank you very much for your help, your code works perfectly!!!

If you have time, could you please explain, how did you understood in which format the date should be written.

And also, how exactly does this line work?

archive = 
 StringTake[
  archive, -1 + 
   StringPosition[archive, "Most Popular Articles"][[1, 1]]]

Does it somehow brings us into the "Most Popular Articles" section and gates titles from there?

Posted 2 years ago

could you please explain, how did you understood in which format the date should be written.

Nothing fancy. Did a Google search for "wall street journal archives" and this showed up in the results. So the date format in the URL is YYYYMMDD which is {"Year", "Month", "Day"} in WL.

The page body has some text unrelated to the archive followed by "News Archive for Jun 25 2019". So the actual archive data follows a date matching {"MonthNameShort", " ", "DayShort", " ", "Year"}.

Archive data is followed by "Most Popular Articles" and other text unrelated to the archive. So the actual archive data is between those two strings. The code extracts that portion from the page body.

Hi Roman,

Glad you got it working. Rohit is right, the format changed (which happens quite frequently, with news sites, unfortunately).

Let me know if you run into other issues.

Jonathan

Thank you very much for your reply Jonathan,

I find your article very interesting and inspiring. I am not a proficient Wolfram user, unfortunately. However, I study finance now, and I would really like to learn how you did this sentiment analysis.

With the help of Rohit, I went through the first part of code, and now I have some troubles with the WSJSentimentIndicator:

WSJSentimentIndicator[date_] := 
 Module[{d = date, archive, archivewords, WSJSI}, 
  archive = 
   Import[StringJoin["http://www.wsj.com/public/page/archive-", 
     DateString[d, {"Year", "-", "MonthShort", "-", "DayShort"}], 
     ".html"]];
  archive = 
   StringDrop[archive, 
    StringPosition[archive, 
      DateString[d, {"MonthName", " ", "DayShort", ", ", "Year"}]][[1,
      2]]];
  archive = 
   StringTake[
    archive, -1 + StringPosition[archive, "ARCHIVE FILTER"][[1, 1]]];
  archivewords = ToLowerCase[DeleteStopwords[TextWords[archive]]];
  WSJSI = #Positive/(#Negative + #Positive) &@
     Counts[Classify["Sentiment", archivewords]] // N;
  {WSJSI, archivewords, archive}]

So, if we update the code, it should look like this:

WSJSentimentIndicator[date_ ] := 
 Module[{d = date, archive, archivewords, WSJSI}, 
archive = 
 Import[StringJoin["https://www.wsj.com/news/archive/", 
   DateString[d, {"Year", "Month", "Day"}]]];
archive = 
 StringDrop[archive, 
  StringPosition[archive, 
    DateString[
     d, {"MonthNameShort", " ", "DayShort", " ", 
      "Year"}]][[1, 2]]];
archive = 
 StringTake[
  archive, -1 + 
   StringPosition[archive, "Most Popular Articles"][[1, 1]]];
archivewords = ToLowerCase[DeleteStopwords[TextWords[archive]]];
 WSJSI = #Positive /(#Negative + #Positive) &@
     Counts[Classify["Sentiment", archivewords]] // N;
  {WSJSI, archivewords, archive}]

However, the code which returns us the histogram does not work for me:

WSJSI = Flatten[First@WSJSentimentIndicator[#]&/@datelist]
Histogram[tsWSJSI, PlotLabel -> Style["Histogram of WSJ Sentiment indicator",Bold]]

If you will have time, could you please help me solve this one? I have a feeling, that the WSJSI does not account for the 'datelist' correctly.

Posted 2 years ago

Hi Roman,

My bad. The second DateString format should be {"MonthNameShort", " ", "DayShort", ", ", "Year"}.

For eight days:

datelist = Table[DateObject[{2012, 1, n}], {n, 3, 10}];
WSJSI = Flatten[First@WSJSentimentIndicator[#] & /@ datelist]
tsWSJSI = TimeSeries[Transpose[{datelist, WSJSI}]]
Histogram[tsWSJSI, PlotLabel -> Style["Histogram of WSJ Sentiment indicator", Bold]]

enter image description here

Thank you very much Rohit!

And the module code you did't change?

It looks like that for you?

WSJSentimentIndicator[date_ ] := 
 Module[{d = date, archive, archivewords, WSJSI}, 
archive = 
 Import[StringJoin["https://www.wsj.com/news/archive/", 
   DateString[d, {"Year", "Month", "Day"}]]];
archive = 
 StringDrop[archive, 
  StringPosition[archive, 
    DateString[
     d, {"MonthNameShort", " ", "DayShort", ", ", 
      "Year"}]][[1, 2]]];
archive = 
 StringTake[
  archive, -1 + 
   StringPosition[archive, "Most Popular Articles"][[1, 1]]];
archivewords = ToLowerCase[DeleteStopwords[TextWords[archive]]];
 WSJSI = #Positive /(#Negative + #Positive) &@
     Counts[Classify["Sentiment", archivewords]] // N;
  {WSJSI, archivewords, archive}]
Posted 2 years ago

Hi Roman,

Sorry for the late response.

Yes, that is the code I used. I copied it from your post into a notebook, evaluated and tested with

WSJSentimentIndicator[DateObject[{2012, 1, 3}]]

The first time I evaluated, it failed because Classify returned an empty association. Subsequent evaluations worked fine. Not sure why - perhaps it has to retrieve some data from Wolfram servers and that timed out? If it happens frequently, the code can easily be modified to retry the Classify.

Thank you very much Rohit! I now understand why I have problems with computation - Mathematica just gives me fractured output every time. For example:

{0.6781, #Positive/(#Negative + 
#Positive), #Positive/(#Negative + #Positive), 0.7890, 0.6905, #Positive/(#Negative + \
#Positive), #Positive/(#Negative + #Positive)}

I have to repeat calculation 5-6 times to get the full output. I just wanted to run the analysis based on the data of a couple of years but I barely can run it for 8 day (calculation takes around 5min and not always gives full output). I think that the problem is either in my PC, even though I have i7 and 16 RAM, or maybe the code is too heavy. Do you know, if it is possible to add some line of the code to

WSJSI = Flatten[First@WSJSentimentIndicator[#] & /@ datelist]
tsWSJSI = TimeSeries[Transpose[{datelist, WSJSI}]]
Histogram[tsWSJSI, PlotLabel -> Style["Histogram of WSJ Sentiment indicator", Bold]]

to make it repeat the calculation over and other until success?

Thank you very much for your time.

Posted 2 years ago

Hi Roman,

I took a closer look at why it fails. It is the Import that occasionally returns partial results. The return value has the preamble text, the header text for the archive e.g. "News Archive for Jun 25, 2019" followed by blank lines, followed by "Most Popular Articles". The entire archive section is missing. It is not a bug in Import, I can reproduce it with a simple Ruby program.

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open("https://www.wsj.com/news/archive/20120105"))
txt = doc.xpath("//*[@id='root']/div/div/div/div[2]/div/div/div[2]/div[1]/div/div").text

puts txt

My guess is that the ads and other asynchronous JavaScript have to run before the archive section is rendered.

To work around this, use a helper function and retry the Import operation until it succeeds. A better way would be to use WebExecute if you are on version 12. You could experiment with that.

As far as performance, the Import is quite slow, taking ~15s to complete. With the workaround you can just increase maxRetries and leave it running overnight.

Workaround:

ClearAll[downloadArchiveWords, WSJSentimentIndicator]

downloadArchiveWords[date_] := 
  Module[{end = "Most Popular Articles", url, archive},
   url = StringJoin["https://www.wsj.com/news/archive/", DateString[date, {"Year", "Month", "Day"}]];
   archive = Import[url];
   archive = 
    StringDrop[archive, 
     StringPosition[archive, 
       DateString[date, {"MonthNameShort", " ", "DayShort", ", ", "Year"}]][[1, 2]]];
   archive = StringTake[archive, -1 + StringPosition[archive, end][[1, 1]]];
   {ToLowerCase[DeleteStopwords[TextWords[archive]]], archive}];

WSJSentimentIndicator[date_, maxRetries:_?Positive:5] := 
 Module[{retries, archivewords, archive, WSJSI},
  retries = 0;
  archivewords = "";
  While[Length@archivewords == 0 && retries < maxRetries,
   {archivewords, archive} =  downloadArchiveWords[date]; retries++; Pause[1]];
  WSJSI = #Positive/(#Negative + #Positive) &@Counts[Classify["Sentiment", archivewords]] // N;
  {WSJSI, archivewords, archive}];

Thank you very much Rohit! I will try run the workaround solution tonight and I will check out the WebExecute!

Hello Rohit! The code which you proposed worked fine! I tried it on the WSJ articles for the last 3 years and out of 779 values, I had output for 710, with 69 Null values, which is fine. The rest I calculated manually afterwards, because complex computations too much time. I am now going through the last part of the paper - construction of the Trading Algorithm. Do you have any idea, what stands for tsVTDSPX and tsStrategy?

period = QuantityMagnitude@
   DateDifference[First@datelist, Last@datelist, "Year"];
AnnStd = Sqrt[252]*
   StandardDeviation[
    Transpose@{tsSPXreturns["Values"], strategyreturns}];
cf = {tsVTDSPX[Last@datelist]/1000 - 1, 
   tsVTDstrategy[Last@datelist]/1000 - 1};
CAGR = -1 + (1 + cf)^(1/period);
IR = CAGR/AnnStd;

Print[Style["News Sentiment Strategy", "Subsection"]];
P1 = Style[
   NumberForm[
    TableForm[{CAGR, AnnStd, IR}, 
     TableHeadings -> {{"CAGR", "Ann. StDev.", 
        "IR"}, {Style["SP500 Index", Bold], 
        Style["Strategy", Bold]}}], {6, 2}], FontSize -> 14];
P2 = DateListPlot[{tsVTDSPX, tsVTDstrategy}, Filling -> Axis, 
   PlotLegends -> {"S&P500 Index", "Strategy"}, 
   PlotLabel -> Style["Value of $1,000", Bold], ImageSize -> Medium];
Print[P1];
Print[P2];

In the next computation, Dr.Kinlay defines tsStrategy as tsSPXReturns, however, as for tsVTDSPX I have no idea.

Posted 2 years ago

Hi Roman,

I am sorry, I have no idea what tsVTDSPX and tsVTDstrategy are. I could not find their definitions in Jonathan's original post. You will have to wait for a response from him.

Rohit

Dear Dr Kinlay, i will be writing a course paper in international finance soon and I would like to replicate your market sentiment analysis but within the different timeframe and also with the use of twitter/reddit articles data. May I ask you, whether it is possible for me to use your code and reference your article in my paper? Thank you very much for your time. Roman

Roman, Yes sure, no problem.

Thank you very much Rohit!

And the module code you did't change?

It looks like that for you?

WSJSentimentIndicator[date_ ] := 
 Module[{d = date, archive, archivewords, WSJSI}, 
archive = 
 Import[StringJoin["https://www.wsj.com/news/archive/", 
   DateString[d, {"Year", "Month", "Day"}]]];
archive = 
 StringDrop[archive, 
  StringPosition[archive, 
    DateString[
     d, {"MonthNameShort", " ", "DayShort", ", ", 
      "Year"}]][[1, 2]]];
archive = 
 StringTake[
  archive, -1 + 
   StringPosition[archive, "Most Popular Articles"][[1, 1]]];
archivewords = ToLowerCase[DeleteStopwords[TextWords[archive]]];
 WSJSI = #Positive /(#Negative + #Positive) &@
     Counts[Classify["Sentiment", archivewords]] // N;
  {WSJSI, archivewords, archive}]

Dear Dr. Kinlay, I am replicating the last part of your study now. If you will have time, could you please explain, what tsVTDSPX and tsVTDStrategy functions are?

period = QuantityMagnitude@
   DateDifference[First@datelist, Last@datelist, "Year"];
AnnStd = Sqrt[252]*
   StandardDeviation[
    Transpose@{tsSPXreturns["Values"], strategyreturns}];
cf = {tsVTDSPX[Last@datelist]/1000 - 1, 
   tsVTDstrategy[Last@datelist]/1000 - 1};
CAGR = -1 + (1 + cf)^(1/period);
IR = CAGR/AnnStd;

Print[Style["News Sentiment Strategy", "Subsection"]];
P1 = Style[
   NumberForm[
    TableForm[{CAGR, AnnStd, IR}, 
     TableHeadings -> {{"CAGR", "Ann. StDev.", 
        "IR"}, {Style["SP500 Index", Bold], 
        Style["Strategy", Bold]}}], {6, 2}], FontSize -> 14];
P2 = DateListPlot[{tsVTDSPX, tsVTDstrategy}, Filling -> Axis, 
   PlotLegends -> {"S&P500 Index", "Strategy"}, 
   PlotLabel -> Style["Value of $1,000", Bold], ImageSize -> Medium];
Print[P1];
Print[P2];

Hi Rohit,

VTD stands for "Value of $1,000"

So VTDSPX is the compounded value of $1,000 if "invested" in the S&P 500 index. VTDStrategy is the compounded value of $1,000 is invested in the strategy.

ts(anything) stands for "time series". So tsVTDStrategy is VTDStrategy formulated as a time series.

Hope this helps.

So VTDStrategy and VTDSPX are not functions - they're variables whose value is set using FoldList (as I recall).

Hi Rohit, so, according to Dr.Kinlay's answer, i assume, that code for tsVTDSPX will be the same as for tsVTDStrategy, just without quantiles:

strategyreturns = tsSPXReturns["Values"];
strategyreturns[[bottompercentile]] = (1/2)*
   strategyreturns[[bottompercentile]];
strategyreturns[[toppercentile]] = 2*strategyreturns[[toppercentile]];
tsVTDstrategy = 
 TimeSeries[
  Transpose[{datelist, 1000*FoldList[Times, 1, 1 + strategyreturns]}]]
marketreturns = tsSPXReturns["Values"];
tsVTDSPX = 
 TimeSeries[
  Transpose[{datelist, 1000*FoldList[Times, 1, 1 + marketreturns]}]]

With the leverage factor 2. However it is not working for me - i get transpose error.

More generally:

leveragefactor = 2.0;
strategyreturns = tsSPXreturns["Values"];
strategyreturns[[bottompercentile]] = (1/leveragefactor)*
   strategyreturns[[bottompercentile]];
strategyreturns[[toppercentile]] = 
  leveragefactor*strategyreturns[[toppercentile]];
tsVTDSPX = 
  TimeSeries[
   Transpose[{datelist, 
     1000*FoldList[Times, 1, 1 + tsSPXreturns["Values"]]}]];
tsVTDstrategy = 
  TimeSeries[
   Transpose[{datelist, 
     1000*FoldList[Times, 1, 1 + strategyreturns]}]];

The quantiles are found as follows:

percentiles = Quantile[tsSIchange, {1/3, 2/3}];
bottompercentile = 
  Flatten[Position[tsSIchange["Values"], 
    x_ /; x < percentiles[[1]]]];
toppercentile = 
  Flatten[Position[tsSIchange["Values"], x_ /; x > percentiles[[2]]]]

So we set the strategy returns equal to the S&P 500 Index returns, except in the bottom 1/3 quantile of the sentiment indicator, where we reduce leverage by 1/leveragefactor, and in the top 1/3 quantile of the sentiment indicator, where we increase leverage and returns by multiplying by the leverage factor.

What we are saying is: we will only adjust our strategy away from the market portfolio when the sentiment indicator is in the bottom 1/3 or top 1/3 quantile. In those periods we halve or double our exposure, respectively.

Thank you very much for a thorough explanation, Dr. Kinlay! I have finally built a model based on the articles from the WSJ for the last three years. My CAGR for the strategy is just 0.4 in comparison to 0.3 for the index, but it assume, that is due to the smaller dataset which I used.I would try to do the same now but with Reddit posts.

I have one last question - while building a table of S&P500 conditional returns, you use command "Rest", for the Rest@datelist:

Print[Style["Conditional S&P500 Index Returns Distribution", 
  "Subsection"]]; Print[
 Style[NumberForm[
   TableForm[
    Transpose@{Through[{Mean, Median, StandardDeviation, Skewness, 
         Kurtosis}[tsSPXReturns]], 
      Through[{Mean, Median, StandardDeviation, Skewness, Kurtosis}[
        tsSPXReturns[(Rest@datelist)[[bottompercentile]]]]],
      Through[{Mean, Median, StandardDeviation, Skewness, Kurtosis}[
        tsSPXReturns[(Rest@datelist)[[toppercentile]]]]]},
    TableHeadings -> {{"Mean", "Median", "St.Dev.", "Skewness", 
       "Kurtosis"}, {"All", "Lower Quantile", "Upper Quantile"}}, 
    TableAlignments -> Right], {6, 4}],
  FontSize -> 14]]

May I ask you ,why?

And also where you invest 1000 USD into the index. You invest money daily - invest when market opens and take out when market closes ,right? When you say, that you increase investment by leverage factor 2 in the optimistic days, does that mean, that you increase investment to USD 1002? And in the reverse situation, you decrease it to USD 998?

Thank you very much for your time.

Roman

Roman, without going through all the code I don't recall specifically the reason for using Rest@datelist. But I would imagine it's because the length of the returns vector is one less than the number of dates (because you are differencing the log price series).

increasing leverage by 2x means that you are borrowing 100% of your capital, as investing 2x the amount of capital, as in Regulation T (so $1,000 becomes $2,000). So you earn twice the return.

Decreasing leverage by 1/2 means that you deploy only 50% of your capital in that period, ( $1,000 becomes $500), so earning one half of the return in that period.

Put simply, the idea is to push more chips onto the table when market sentiment is very positive and reduce your exposure when its very negative.

More generally, during periods of positive market sentiment you increase capital deployed and returns by a factor = leveragefactor.

during periods of negative market sentiment you reduce allocated capital by a factor = 1/leveragefactor

leveragefactor need not equal 2, and indeed in the dynamic plot I show the effect of using alternate leverage factors.

Posted 2 years ago

Hi Roman,

I came across this while searching for something else. Too late to enter the competition, but I thought you might be interested in using it as another source of data for your analysis. It includes sentiment classification of news stories from Thomson Reuters.

Rohit

Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract