# Need help with Market Sentiment Analysis

Posted 2 years ago
4770 Views
|
26 Replies
|
8 Total Likes
|
 Dear Wolfram Users, I am new to Mathematica and I wanted to replicate this study in order to learn how to do sentiment analysis on website data. https://community.wolfram.com/groups/-/m/t/970999However, it seems to me, that the code is not working. Maybe it was an older Mathematica code?For example: archive = Import[StringJoin["http://www.wsj.com/public/page/archive-", DateString[ datelist[[1]], {"Year", "-", "MonthShort", "-", "DayShort"}], ".html"]]; archive = StringDrop[archive, StringPosition[archive, DateString[ datelist[[1]], {"MonthName", " ", "DayShort", ", ", "Year"}]][[1, 2]]]; archive = StringTake[ archive, -1 + StringPosition[archive, "ARCHIVE FILTER"][[1, 1]]]; archivewords = ToLowerCase[DeleteStopwords[TextWords[archive]]]; TakeLargest[Counts[archivewords], 20] WordCloud[archivewords] This command should download news from the WSJ archive and format for Machine Learning purposes. But somehow it is not working. Can somebody help me to replicate this code in a new way? I also think that one of the problems is that this code is not able to switch archive dated anymore. Or maybe there is a newer version of the article?Thank you very much.
26 Replies
Sort By:
Posted 2 years ago
 Roman,Looks like the URL for the archive as well as the format has changed. Try this datelist = {DateObject[{2012, 1, 3}]}; archive = Import[StringJoin["https://www.wsj.com/news/archive/", DateString[datelist[[1]], {"Year", "Month", "Day"}]]]; archive = StringDrop[archive, StringPosition[archive, DateString[ datelist[[1]], {"MonthNameShort", " ", "DayShort", " ", "Year"}]][[1, 2]]]; archive = StringTake[ archive, -1 + StringPosition[archive, "Most Popular Articles"][[1, 1]]]; archivewords = ToLowerCase[DeleteStopwords[TextWords[archive]]]; TakeLargest[Counts[archivewords], 20]  <|"new" -> 34, "2012" -> 33, "markets" -> 27, "business" -> 24,"year" -> 22, "u.s." -> 21, "asia" -> 18, "opinion" -> 16, "2011" -> 15, "said" -> 12, "$" -> 11, "york" -> 10, "economic" -> 10, "iowa" -> 10, "news" -> 10, "economy" -> 10, "biggest" -> 10, "day" -> 9, "data" -> 9, "december" -> 9|> WordCloud[archivewords]  Answer Posted 2 years ago  Thank you very much for your help, your code works perfectly!!!If you have time, could you please explain, how did you understood in which format the date should be written. And also, how exactly does this line work? archive = StringTake[ archive, -1 + StringPosition[archive, "Most Popular Articles"][[1, 1]]] Does it somehow brings us into the "Most Popular Articles" section and gates titles from there? Answer Posted 2 years ago  could you please explain, how did you understood in which format the date should be written. Nothing fancy. Did a Google search for "wall street journal archives" and this showed up in the results. So the date format in the URL is YYYYMMDD which is {"Year", "Month", "Day"} in WL.The page body has some text unrelated to the archive followed by "News Archive for Jun 25 2019". So the actual archive data follows a date matching {"MonthNameShort", " ", "DayShort", " ", "Year"}. Archive data is followed by "Most Popular Articles" and other text unrelated to the archive. So the actual archive data is between those two strings. The code extracts that portion from the page body. Answer Posted 2 years ago  Hi Roman,Glad you got it working. Rohit is right, the format changed (which happens quite frequently, with news sites, unfortunately).Let me know if you run into other issues.Jonathan Answer Posted 2 years ago  Thank you very much for your reply Jonathan,I find your article very interesting and inspiring. I am not a proficient Wolfram user, unfortunately. However, I study finance now, and I would really like to learn how you did this sentiment analysis.With the help of Rohit, I went through the first part of code, and now I have some troubles with the WSJSentimentIndicator: WSJSentimentIndicator[date_] := Module[{d = date, archive, archivewords, WSJSI}, archive = Import[StringJoin["http://www.wsj.com/public/page/archive-", DateString[d, {"Year", "-", "MonthShort", "-", "DayShort"}], ".html"]]; archive = StringDrop[archive, StringPosition[archive, DateString[d, {"MonthName", " ", "DayShort", ", ", "Year"}]][[1, 2]]]; archive = StringTake[ archive, -1 + StringPosition[archive, "ARCHIVE FILTER"][[1, 1]]]; archivewords = ToLowerCase[DeleteStopwords[TextWords[archive]]]; WSJSI = #Positive/(#Negative + #Positive) &@ Counts[Classify["Sentiment", archivewords]] // N; {WSJSI, archivewords, archive}] So, if we update the code, it should look like this: WSJSentimentIndicator[date_ ] := Module[{d = date, archive, archivewords, WSJSI}, archive = Import[StringJoin["https://www.wsj.com/news/archive/", DateString[d, {"Year", "Month", "Day"}]]]; archive = StringDrop[archive, StringPosition[archive, DateString[ d, {"MonthNameShort", " ", "DayShort", " ", "Year"}]][[1, 2]]]; archive = StringTake[ archive, -1 + StringPosition[archive, "Most Popular Articles"][[1, 1]]]; archivewords = ToLowerCase[DeleteStopwords[TextWords[archive]]]; WSJSI = #Positive /(#Negative + #Positive) &@ Counts[Classify["Sentiment", archivewords]] // N; {WSJSI, archivewords, archive}] However, the code which returns us the histogram does not work for me: WSJSI = Flatten[First@WSJSentimentIndicator[#]&/@datelist] Histogram[tsWSJSI, PlotLabel -> Style["Histogram of WSJ Sentiment indicator",Bold]] If you will have time, could you please help me solve this one? I have a feeling, that the WSJSI does not account for the 'datelist' correctly. Answer Posted 2 years ago  Hi Roman,My bad. The second DateString format should be {"MonthNameShort", " ", "DayShort", ", ", "Year"}. For eight days: datelist = Table[DateObject[{2012, 1, n}], {n, 3, 10}]; WSJSI = Flatten[First@WSJSentimentIndicator[#] & /@ datelist] tsWSJSI = TimeSeries[Transpose[{datelist, WSJSI}]] Histogram[tsWSJSI, PlotLabel -> Style["Histogram of WSJ Sentiment indicator", Bold]]  Answer Posted 2 years ago  Thank you very much Rohit!And the module code you did't change?It looks like that for you? WSJSentimentIndicator[date_ ] := Module[{d = date, archive, archivewords, WSJSI}, archive = Import[StringJoin["https://www.wsj.com/news/archive/", DateString[d, {"Year", "Month", "Day"}]]]; archive = StringDrop[archive, StringPosition[archive, DateString[ d, {"MonthNameShort", " ", "DayShort", ", ", "Year"}]][[1, 2]]]; archive = StringTake[ archive, -1 + StringPosition[archive, "Most Popular Articles"][[1, 1]]]; archivewords = ToLowerCase[DeleteStopwords[TextWords[archive]]]; WSJSI = #Positive /(#Negative + #Positive) &@ Counts[Classify["Sentiment", archivewords]] // N; {WSJSI, archivewords, archive}]  Answer Posted 2 years ago  Hi Roman,Sorry for the late response.Yes, that is the code I used. I copied it from your post into a notebook, evaluated and tested with WSJSentimentIndicator[DateObject[{2012, 1, 3}]] The first time I evaluated, it failed because Classify returned an empty association. Subsequent evaluations worked fine. Not sure why - perhaps it has to retrieve some data from Wolfram servers and that timed out? If it happens frequently, the code can easily be modified to retry the Classify. Answer Posted 2 years ago  Thank you very much Rohit! I now understand why I have problems with computation - Mathematica just gives me fractured output every time. For example: {0.6781, #Positive/(#Negative + #Positive), #Positive/(#Negative + #Positive), 0.7890, 0.6905, #Positive/(#Negative + \ #Positive), #Positive/(#Negative + #Positive)} I have to repeat calculation 5-6 times to get the full output. I just wanted to run the analysis based on the data of a couple of years but I barely can run it for 8 day (calculation takes around 5min and not always gives full output). I think that the problem is either in my PC, even though I have i7 and 16 RAM, or maybe the code is too heavy. Do you know, if it is possible to add some line of the code to WSJSI = Flatten[First@WSJSentimentIndicator[#] & /@ datelist] tsWSJSI = TimeSeries[Transpose[{datelist, WSJSI}]] Histogram[tsWSJSI, PlotLabel -> Style["Histogram of WSJ Sentiment indicator", Bold]] to make it repeat the calculation over and other until success? Thank you very much for your time. Answer Posted 2 years ago  Hi Roman,I took a closer look at why it fails. It is the Import that occasionally returns partial results. The return value has the preamble text, the header text for the archive e.g. "News Archive for Jun 25, 2019" followed by blank lines, followed by "Most Popular Articles". The entire archive section is missing. It is not a bug in Import, I can reproduce it with a simple Ruby program. require 'nokogiri' require 'open-uri' doc = Nokogiri::HTML(open("https://www.wsj.com/news/archive/20120105")) txt = doc.xpath("//*[@id='root']/div/div/div/div[2]/div/div/div[2]/div[1]/div/div").text puts txt My guess is that the ads and other asynchronous JavaScript have to run before the archive section is rendered.To work around this, use a helper function and retry the Import operation until it succeeds. A better way would be to use WebExecute if you are on version 12. You could experiment with that.As far as performance, the Import is quite slow, taking ~15s to complete. With the workaround you can just increase maxRetries and leave it running overnight.Workaround: ClearAll[downloadArchiveWords, WSJSentimentIndicator] downloadArchiveWords[date_] := Module[{end = "Most Popular Articles", url, archive}, url = StringJoin["https://www.wsj.com/news/archive/", DateString[date, {"Year", "Month", "Day"}]]; archive = Import[url]; archive = StringDrop[archive, StringPosition[archive, DateString[date, {"MonthNameShort", " ", "DayShort", ", ", "Year"}]][[1, 2]]]; archive = StringTake[archive, -1 + StringPosition[archive, end][[1, 1]]]; {ToLowerCase[DeleteStopwords[TextWords[archive]]], archive}]; WSJSentimentIndicator[date_, maxRetries:_?Positive:5] := Module[{retries, archivewords, archive, WSJSI}, retries = 0; archivewords = ""; While[Length@archivewords == 0 && retries < maxRetries, {archivewords, archive} = downloadArchiveWords[date]; retries++; Pause[1]]; WSJSI = #Positive/(#Negative + #Positive) &@Counts[Classify["Sentiment", archivewords]] // N; {WSJSI, archivewords, archive}];  Answer Posted 2 years ago  Thank you very much Rohit! I will try run the workaround solution tonight and I will check out the WebExecute! Answer Posted 2 years ago  Hello Rohit! The code which you proposed worked fine! I tried it on the WSJ articles for the last 3 years and out of 779 values, I had output for 710, with 69 Null values, which is fine. The rest I calculated manually afterwards, because complex computations too much time. I am now going through the last part of the paper - construction of the Trading Algorithm. Do you have any idea, what stands for tsVTDSPX and tsStrategy? period = QuantityMagnitude@ DateDifference[First@datelist, Last@datelist, "Year"]; AnnStd = Sqrt[252]* StandardDeviation[ Transpose@{tsSPXreturns["Values"], strategyreturns}]; cf = {tsVTDSPX[Last@datelist]/1000 - 1, tsVTDstrategy[Last@datelist]/1000 - 1}; CAGR = -1 + (1 + cf)^(1/period); IR = CAGR/AnnStd; Print[Style["News Sentiment Strategy", "Subsection"]]; P1 = Style[ NumberForm[ TableForm[{CAGR, AnnStd, IR}, TableHeadings -> {{"CAGR", "Ann. StDev.", "IR"}, {Style["SP500 Index", Bold], Style["Strategy", Bold]}}], {6, 2}], FontSize -> 14]; P2 = DateListPlot[{tsVTDSPX, tsVTDstrategy}, Filling -> Axis, PlotLegends -> {"S&P500 Index", "Strategy"}, PlotLabel -> Style["Value of$1,000", Bold], ImageSize -> Medium]; Print[P1]; Print[P2]; In the next computation, Dr.Kinlay defines tsStrategy as tsSPXReturns, however, as for tsVTDSPX I have no idea.
Posted 2 years ago
 Hi Roman,I am sorry, I have no idea what tsVTDSPX and tsVTDstrategy are. I could not find their definitions in Jonathan's original post. You will have to wait for a response from him.Rohit
Posted 2 years ago
 Dear Dr Kinlay, i will be writing a course paper in international finance soon and I would like to replicate your market sentiment analysis but within the different timeframe and also with the use of twitter/reddit articles data. May I ask you, whether it is possible for me to use your code and reference your article in my paper? Thank you very much for your time. Roman
Posted 2 years ago
 Roman, Yes sure, no problem.
Posted 2 years ago
 Thank you very much Rohit!And the module code you did't change? It looks like that for you? WSJSentimentIndicator[date_ ] := Module[{d = date, archive, archivewords, WSJSI}, archive = Import[StringJoin["https://www.wsj.com/news/archive/", DateString[d, {"Year", "Month", "Day"}]]]; archive = StringDrop[archive, StringPosition[archive, DateString[ d, {"MonthNameShort", " ", "DayShort", ", ", "Year"}]][[1, 2]]]; archive = StringTake[ archive, -1 + StringPosition[archive, "Most Popular Articles"][[1, 1]]]; archivewords = ToLowerCase[DeleteStopwords[TextWords[archive]]]; WSJSI = #Positive /(#Negative + #Positive) &@ Counts[Classify["Sentiment", archivewords]] // N; {WSJSI, archivewords, archive}] 
Posted 2 years ago
 Dear Dr. Kinlay, I am replicating the last part of your study now. If you will have time, could you please explain, what tsVTDSPX and tsVTDStrategy functions are? period = QuantityMagnitude@ DateDifference[First@datelist, Last@datelist, "Year"]; AnnStd = Sqrt[252]* StandardDeviation[ Transpose@{tsSPXreturns["Values"], strategyreturns}]; cf = {tsVTDSPX[Last@datelist]/1000 - 1, tsVTDstrategy[Last@datelist]/1000 - 1}; CAGR = -1 + (1 + cf)^(1/period); IR = CAGR/AnnStd; Print[Style["News Sentiment Strategy", "Subsection"]]; P1 = Style[ NumberForm[ TableForm[{CAGR, AnnStd, IR}, TableHeadings -> {{"CAGR", "Ann. StDev.", "IR"}, {Style["SP500 Index", Bold], Style["Strategy", Bold]}}], {6, 2}], FontSize -> 14]; P2 = DateListPlot[{tsVTDSPX, tsVTDstrategy}, Filling -> Axis, PlotLegends -> {"S&P500 Index", "Strategy"}, PlotLabel -> Style["Value of $1,000", Bold], ImageSize -> Medium]; Print[P1]; Print[P2];  Answer Posted 2 years ago  Hi Rohit,VTD stands for "Value of$1,000"So VTDSPX is the compounded value of $1,000 if "invested" in the S&P 500 index. VTDStrategy is the compounded value of$1,000 is invested in the strategy.ts(anything) stands for "time series". So tsVTDStrategy is VTDStrategy formulated as a time series.Hope this helps.
Posted 2 years ago
 So VTDStrategy and VTDSPX are not functions - they're variables whose value is set using FoldList (as I recall).
Posted 2 years ago
 Hi Rohit, so, according to Dr.Kinlay's answer, i assume, that code for tsVTDSPX will be the same as for tsVTDStrategy, just without quantiles: strategyreturns = tsSPXReturns["Values"]; strategyreturns[[bottompercentile]] = (1/2)* strategyreturns[[bottompercentile]]; strategyreturns[[toppercentile]] = 2*strategyreturns[[toppercentile]]; tsVTDstrategy = TimeSeries[ Transpose[{datelist, 1000*FoldList[Times, 1, 1 + strategyreturns]}]] marketreturns = tsSPXReturns["Values"]; tsVTDSPX = TimeSeries[ Transpose[{datelist, 1000*FoldList[Times, 1, 1 + marketreturns]}]] With the leverage factor 2. However it is not working for me - i get transpose error.
Posted 2 years ago
 More generally: leveragefactor = 2.0; strategyreturns = tsSPXreturns["Values"]; strategyreturns[[bottompercentile]] = (1/leveragefactor)* strategyreturns[[bottompercentile]]; strategyreturns[[toppercentile]] = leveragefactor*strategyreturns[[toppercentile]]; tsVTDSPX = TimeSeries[ Transpose[{datelist, 1000*FoldList[Times, 1, 1 + tsSPXreturns["Values"]]}]]; tsVTDstrategy = TimeSeries[ Transpose[{datelist, 1000*FoldList[Times, 1, 1 + strategyreturns]}]]; 
Posted 2 years ago
 The quantiles are found as follows: percentiles = Quantile[tsSIchange, {1/3, 2/3}]; bottompercentile = Flatten[Position[tsSIchange["Values"], x_ /; x < percentiles[[1]]]]; toppercentile = Flatten[Position[tsSIchange["Values"], x_ /; x > percentiles[[2]]]] So we set the strategy returns equal to the S&P 500 Index returns, except in the bottom 1/3 quantile of the sentiment indicator, where we reduce leverage by 1/leveragefactor, and in the top 1/3 quantile of the sentiment indicator, where we increase leverage and returns by multiplying by the leverage factor.What we are saying is: we will only adjust our strategy away from the market portfolio when the sentiment indicator is in the bottom 1/3 or top 1/3 quantile. In those periods we halve or double our exposure, respectively.
Posted 2 years ago
 Thank you very much for a thorough explanation, Dr. Kinlay! I have finally built a model based on the articles from the WSJ for the last three years. My CAGR for the strategy is just 0.4 in comparison to 0.3 for the index, but it assume, that is due to the smaller dataset which I used.I would try to do the same now but with Reddit posts.I have one last question - while building a table of S&P500 conditional returns, you use command "Rest", for the Rest@datelist: Print[Style["Conditional S&P500 Index Returns Distribution", "Subsection"]]; Print[ Style[NumberForm[ TableForm[ Transpose@{Through[{Mean, Median, StandardDeviation, Skewness, Kurtosis}[tsSPXReturns]], Through[{Mean, Median, StandardDeviation, Skewness, Kurtosis}[ tsSPXReturns[(Rest@datelist)[[bottompercentile]]]]], Through[{Mean, Median, StandardDeviation, Skewness, Kurtosis}[ tsSPXReturns[(Rest@datelist)[[toppercentile]]]]]}, TableHeadings -> {{"Mean", "Median", "St.Dev.", "Skewness", "Kurtosis"}, {"All", "Lower Quantile", "Upper Quantile"}}, TableAlignments -> Right], {6, 4}], FontSize -> 14]] May I ask you ,why?And also where you invest 1000 USD into the index. You invest money daily - invest when market opens and take out when market closes ,right? When you say, that you increase investment by leverage factor 2 in the optimistic days, does that mean, that you increase investment to USD 1002? And in the reverse situation, you decrease it to USD 998?Thank you very much for your time.Roman
Posted 2 years ago
 Roman, without going through all the code I don't recall specifically the reason for using Rest@datelist. But I would imagine it's because the length of the returns vector is one less than the number of dates (because you are differencing the log price series).increasing leverage by 2x means that you are borrowing 100% of your capital, as investing 2x the amount of capital, as in Regulation T (so $1,000 becomes$2,000). So you earn twice the return.Decreasing leverage by 1/2 means that you deploy only 50% of your capital in that period, ( $1,000 becomes$500), so earning one half of the return in that period.Put simply, the idea is to push more chips onto the table when market sentiment is very positive and reduce your exposure when its very negative.