Skip to content

科技股预测新利器:NLP情绪分析与机器学习的完美融合(一)

作者:老余捞鱼

原创不易,转载请标明出处及原作者。

写在前面的话:因工作需要,我最近深入研究了如何将 NLP 情感分析与机器学习融合应用于科技股投资。本文就是和大家分享我利用特定算法分析社交媒体,并结合股价走势,然后通过机器学习模型判断市场情绪,最终实现精准预测科技股风险趋势的一套方法。希望您读完后能掌握这一策略,轻松作出对科技股市投资的明智决策。

一、本文结构说明

多项研究表明,财经新闻的头条,尤其是围绕公司盈利情况和新产品发布等新闻头条,会直接影响到股票价格。例如,金融市场的情绪分析让分析师们通过监控滚动电视(网络)新闻中的意外事件或负面情绪,及早发现潜在风险。这种积极主动的方法使他们能够在不利的市场反应发生之前调整决策策略。

而最近两年来,ML 算法被广泛应用于分析财经新闻和预测股票价值。NLP 工具似乎可以与有监督的 ML 和深度学习 (DL) 技术相结合,从而建立一个情感分析模型,让我们能够从文本数据中提取和量化主观信息。

给大家看一个实例,金融界大名鼎鼎的穆迪投资(评级)公司,在今年11月推出了一个《利用 GenAI 和实时新闻情绪为公司提供支持》的新产品,其收费不菲。介绍地址如下:https://www.moodys.com/web/en/us/insights/digital-transformation/the-power-of-news-sentiment-in-modern-financial-analysis.html

我开展研究的目的是描述和比较用于股市预测的各种 ML 方法,以及对(网络)抓取的新闻标题进行的 NLP 情感分析。然后分析出哪些集成式 NLP ML 方法可以作为股票表现的领先指标,可用于改进交易策略和加强投资组合风险管理。

这里将免费提供比其更加完整的代码级解决方案,需要的仅仅是大家耐心读完这个系列。由于整理出来的内容较多,我也想写得细致一些,让我的读者能循序渐进的逐步消化,所以本文大概会分三篇写完。主要内容包括:

本篇完成:

  • 各种说明
  • 周标量情绪评分(用例:NVDA 股票)
  • 周矢量情绪评分(用例:AMZN 股票)

第二篇完成:

  • 通过抓取近期新闻获得的矢量情感评分(用例:AMZN 股票)
  • 11 种科技股的每日平均矢量情绪得分
  • 如何利用情绪分析做出明智的投资决策

最终完成:

  • 股票价格预测(用例:GRU 股票)
  • 使用 MLBC 进行市场情绪分析与预测 (用例:道琼斯指数)
  • 全文终结回顾

二、本项目中会用到的库和工具说明

  • beautifulsoup4:一个可以轻松从网页中抓取信息的库。它位于 HTML 或 XML 解析器之上,提供用于迭代、搜索和修改解析树的 Pythonic 习语。
  • nltk:用于 NLP 的 Python 软件包。
  • matplotlib: Python 中的基本绘图库。
  • Plotly:Python 的交互式可视化。
  • seaborn:统计数据可视化。
  • newspaper3k: 文章搜索与整理。
  • GoogleNews:使用 GoogleNews 收集 Google 新闻文本。
  • wordcloud:一个小小的词云生成器。
  • yfinance:使用雅虎公开 API 的开源工具。
  • Keras:以直观、精简的方式进行深度学习模型的原型设计、研究和部署。
  • sklearn:建立在 SciPy 基础上的用于 ML 的 Python 模块。
  • SciKit-Plot:为 scikit-learn 对象添加绘图功能的直观库。
  • Yellowbrick:一套用于 ML 的可视化分析和诊断工具。

三、周标量情绪评分

我选择 NVDA 股票举例,并从 https://finviz.com上获取该公司的每周财经新闻头条。

导入必要的 Python 库并定义 tickers_list

!pip install bs4, nltk

import pandas as pd
from datetime import datetime

import matplotlib.pyplot as plt

from bs4 import BeautifulSoup
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import requests


tickers_list = ['NVDA']

用 BeautifulSoup 浏览本周新闻头条

news = pd.DataFrame()

for ticker in tickers_list:
   url = f'https://finviz.com/quote.ashx?t={ticker}&p=d'
   ret = requests.get(
       url,
       headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36'},
   )
   
   html = BeautifulSoup(ret.content, "html.parser")
   
   try:
     df = pd.read_html(
         str(html),
         attrs={'class': 'fullview-news-outer'}
     )[0]
   except:
     print(f"{ticker} No news found")
     continue
 
   df.columns = ['Date', 'Headline']

df.tail()

   Date                  Headline
95 07:59AM 3 Semiconductor Stocks (and 1 ETF) That Could ...
96 07:12AM NVIDIA Corporation (NVDA) Launches AI Blueprin...
97 07:00AM Meet the Supercharged Growth Stock That Could ...
98 06:30AM 3 Artificial Intelligence ETFs to Buy for Long...
99 05:59AM Nvidia Stock Falls. Why Its Still a Top Pick i...

进行以下文本数据预处理

# Process date and time columns to make sure this is filled in every headline each row
dateNTime = df.Date.apply(lambda x: ','+x if len(x)<8 else x).str.split(r' |,', expand = True).replace("", None).ffill()

df = pd.merge(df, dateNTime, right_index=True, left_index=True).drop('Date', axis=1).rename(columns={0:'Date', 1:'Time'})

df = df[df["Headline"].str.contains("Loading.") == False].loc[:, ['Date', 'Time', 'Headline']]

df["Ticker"] = ticker
news = pd.concat([news, df], ignore_index = True)

news.head()

下图是完成后 NVDA 本周新闻头条

使用 vader_lexicon 对 [‘Headline’] 应用 SentimentIntensityAnalyzer

nltk.download('vader_lexicon')
vader = SentimentIntensityAnalyzer()

scored_news = news.join(
  pd.DataFrame(news['Headline'].apply(vader.polarity_scores).tolist())
)

绘制 NVDA 的每周情绪得分图

news_score = scored_news.loc[:, ['Ticker', 'Date', 'compound']].pivot_table(values='compound', index='Date', columns='Ticker', aggfunc='mean').ewm(15).mean()
news_score.dropna().plot(figsize=(10, 6),linewidth=4,kind='line',legend=True, fontsize=14)
plt.title("Weekly Sentiment Score for NVDA",fontsize=14)
plt.grid()

NVDA 的本周情绪评分如下图:

接下来我们绘制 NVDA 周情绪得分百分比变化图

news_score.pct_change().dropna().plot(figsize=(10, 6),linewidth=4,kind='line',legend=True, fontsize=14)
plt.title("Percentage Change of Weekly Sentiment Score for NVDA",fontsize=14)
plt.grid()

NVDA 的本周情绪评分变化百分比如下图:

直接使用上面的代码,你也可以在当下的时间 为 NVDA 重复上述的 NLP 序列。

四、周矢量情绪评分

按上面的思路我们继续扩展 NLP 分析,接下来我们采集分析 AMZN 的股票新闻并计算每周向量情绪得分 {即:负面、中性、正面、复合}。

导入 Python 库并读取多个该股票的新闻标题

import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
import pandas as pd
from urllib.request import urlopen, Request
from nltk.sentiment.vader import SentimentIntensityAnalyzer

web_url = 'https://finviz.com/quote.ashx?t='

news_tables = {}
tickers = ['AMZN', 'GOOG', 'TSLA']

for tick in tickers:
    url = web_url + tick
    req = Request(url=url,headers={"User-Agent": "Chrome"}) 
    response = urlopen(req)    
    html = BeautifulSoup(response,"html.parser")
    news_table = html.find(id='news-table')
    news_tables[tick] = news_table

amazon = news_tables['AMZN']
amazon_tr = amazon.findAll('tr')

for x, table_row in enumerate(amazon_tr):
    a_text = table_row.a.text
    td_text = table_row.td.text
    print(a_text)
    print(td_text)
    if x == 3:
        break

_________________________________________________________________

Donald Trump Jr.'s Net Worth Is 8 Figures See His Businesses and Investments

            Today 10:38AM
        
Belkin integrates Amazon Buy with Prime into its e-commerce site

            10:26AM
        
Why JPMorgan predicts an AI trade shift in 2025

            10:23AM
        
Amazon.com (AMZN) Set for Strong 2025, Says TD Cowen Analyst

            10:19AM

创建新闻列表

news_list = []

for file_name, news_table in news_tables.items():
    for i in news_table.findAll('tr'):
        
        text = i.a.get_text() 
        
        date_scrape = i.td.text.split()

        if len(date_scrape) == 1:
            time = date_scrape[0]
            
        else:
            date = date_scrape[0]
            time = date_scrape[1]

        tick = file_name.split('_')[0]
        
        news_list.append([tick, date, time, text])

print(news_list)

[['AMZN', 'Today', '10:38AM', "Donald Trump Jr.'s Net Worth Is 8 Figures See His Businesses and Investments"], ['AMZN', 'Today', '10:26AM', 'Belkin integrates Amazon Buy with Prime into its e-commerce site'], ['AMZN', 'Today', '10:23AM', 'Why JPMorgan predicts an AI trade shift in 2025'], ['AMZN', 'Today', '10:19AM', 'Amazon.com (AMZN) Set for Strong 2025, Says TD Cowen Analyst'], ['AMZN', 'Today', '10:18AM', 'Nvidia stock jumps as Wall Street analysts maintain bullish outlooks'], ['AMZN', 'Today', '10:06AM', 'Why the Mag 7 trade is far from over: Chart of the Day'], ['AMZN', 'Today', '09:45AM', 'UBS Bullish on Amazon: Raises Target Price to $264'], ['AMZN', 'Today', '09:01AM', 'Buy with Prime Expands with Launch of New Brand Belkin'], ['AMZN', 'Today', '08:30AM', 'Is Amazon Stock A Buy? Tech Giant Named A Top 2025 Pick'], ['AMZN', 'Today', '08:28AM', 'AT&T is joining Amazon in ordering employees back to the office 5 days a week'], ['AMZN', 'Today', '08:15AM', '2 Stocks Set to Dominate in 2025'], ['AMZN', 'Today', '07:52AM', 'Is Amazon.com (AMZN) Among Billionaire Daniel Sundheims Stock Picks Heading Into 2025?'], ['AMZN', 'Today', '07:21AM', 'Honest Company downgraded to Hold from Buy at Loop Capital'], ['AMZN', 'Today', '07:00AM', 'Teamsters have some legal issues to overcome after wins at Amazon and with DSPs'], ['AMZN', 'Today', '07:00AM', 'Cloud AI Startup Vultr Raises $333 Million at $3.5 Billion Valuation'], ['AMZN', 'Today', '06:00AM', "Big Tech is dominating the market once again  and that's probably just fine: Morning Brief"], ['AMZN', 'Today', '06:00AM', "Nvidia's Blackwell chip could push the company into a new stratosphere as the AI revolution continues"], ['AMZN', 'Today', '05:56AM', 'AT&T is dumping hybrid work as it follows Amazon in demanding employees spend 5 days a week in office'], ['AMZN', 'Today', '05:50AM', 'Meet the 3 Artificial Intelligence (AI) Stocks Dan Ives Says Will Become The First Members of the $4 Trillion Club in 2025'], ['AMZN', 'Today', '05:30AM', '2024 Has Been a Year of Extremes in the Stock Market. Can It Last?'], ['AMZN', 'Today', '05:00AM', 'U.S. Weighs Ban on Chinese-Made Router in Millions of American Homes'], ['AMZN', 'Today', '05:00AM', "Amazon's 10 most viral items of 2024"], ['AMZN', 'Today', '04:57AM', 'Prediction: These 2 Magnificent S&P 500 Growth Stocks Will Crush the Market Over the Next 5 Years'], ['AMZN', 'Today', '04:25AM', 'Could Shopify Be Your Ticket to Becoming a Millionaire by 2035?'], ['AMZN', 'Today', '01:51AM', 'Is Adobe (ADBE) a Safe Bet? Roth MKMs Caution Amid AI Buzz and Market Pressure'], ['AMZN', 'Today', '01:45AM', 'Applied Materials (AMAT): Assessing Roth MKMs Caution on This Semiconductor Giant Amid Industry Tensions'], ['AMZN', 'Today', '01:38AM', 'Cadence Design Systems (CDNS): Why Its Among Roth MKMs Cautious Stock Picks'], ['AMZN', 'Dec-17-24', '04:39PM', 'Nvidia stock slides amid AI spending slowdown fears, increased competition'], ['AMZN', 'Dec-17-24', '03:41PM', "Why the 'Magnificent 7' rally could be a defensive move"], ['AMZN', 'Dec-17-24', '03:30PM', "Move Over 'Rage Applying' And 'Quiet Quitting,' 2025 Will Be The Year Of 'Revenge Quitting'"],

.......................

将上述列表转换为数据帧

vader = SentimentIntensityAnalyzer()

columns = ['ticker', 'date', 'time', 'headline']

news_df = pd.DataFrame(news_list, columns=columns)

scores = news_df['headline'].apply(vader.polarity_scores).tolist()

scores_df = pd.DataFrame(scores)

news_df = news_df.join(scores_df, rsuffix='_right')

news_df=news_df.replace('Today', 'Dec-18-24')

news_df.head()

  ticker date    time    headline                                          neg neu    pos compound
0 AMZN Dec-18-24 10:38AM Donald Trump Jr.'s Net Worth Is 8 Figures See ... 0.0 0.853 0.147 0.2263
1 AMZN Dec-18-24 10:26AM Belkin integrates Amazon Buy with Prime into i... 0.0 0.841 0.159 0.1779
2 AMZN Dec-18-24 10:23AM Why JPMorgan predicts an AI trade shift in 2025 0.0 1.000 0.000 0.0000
3 AMZN Dec-18-24 10:19AM Amazon.com (AMZN) Set for Strong 2025, Says TD... 0.0 0.732 0.268 0.5106
4 AMZN Dec-18-24 10:18AM Nvidia stock jumps as Wall Street analysts mai... 0.0 1.000 0.000 0.0000

创建并绘制平均矢量情感分数

import datetime
mydf=pd.DataFrame(news_df, columns=['neg', 'neu','pos','compound','date'])
mydfneg=mydf.groupby(mydf['date']).mean()

mydfneg.head()

          neg       neu      pos     compound
date    
Dec-13-24 0.061286 0.761429 0.177286 0.145314
Dec-14-24 0.026905 0.823095 0.150000 0.206933
Dec-15-24 0.031679 0.829571 0.138750 0.205850
Dec-16-24 0.021915 0.842064 0.136032 0.193571
Dec-17-24 0.055634 0.840305 0.104061 0.078788

mydfneg['neg'].plot(label='neg')
mydfneg['pos'].plot(label='pos')
mydfneg['compound'].plot(label='compound')
plt.legend()
plt.title('AMZN Sentiment Score')
plt.grid()

这样 AMZN 的负面、正面和复合矢量情绪评分就出来了,如下图:

为便于比较,接着绘制中性情绪平均得分图

mydfneg['neu'].plot(label='neu')
plt.legend()
plt.title('AMZN Sentiment Score')
plt.grid()

AMZN 中性情绪得分如下图:

我们拿出在 AMZN TradingView 周线图上看到的情况,怎么样,比较两张图曲线后的感觉如何?

让我们接着比较一下对 AMZN 的负面和正面评价

news_df.loc[:,'date'] = pd.to_datetime(news_df.date.astype(str)+' '+news_df.time.astype(str))

news_df.head()


  ticker date      time            headline                                           neg neu   pos compound
0 AMZN 2024-12-18 10:38:00 10:38AM Donald Trump Jr.'s Net Worth Is 8 Figures See ... 0.0 0.853 0.147 0.2263
1 AMZN 2024-12-18 10:26:00 10:26AM Belkin integrates Amazon Buy with Prime into i... 0.0 0.841 0.159 0.1779
2 AMZN 2024-12-18 10:23:00 10:23AM Why JPMorgan predicts an AI trade shift in 2025 0.0 1.000 0.000 0.0000
3 AMZN 2024-12-18 10:19:00 10:19AM Amazon.com (AMZN) Set for Strong 2025, Says TD... 0.0 0.732 0.268 0.5106
4 AMZN 2024-12-18 10:18:00 10:18AM Nvidia stock jumps as Wall Street analysts mai... 0.0 1.000 0.000 0.0000

创建负面评论条形图

plt.bar(news_df['date'],news_df['neg'])

AMZN 负面评论图:

plt.bar(news_df['date'],news_df['neg'])
plt.bar(news_df['date'],news_df['pos'],alpha=0.3)

AMZN 负面(蓝色)与正面(橙色)评论对比

我们可以将三个分数合并成一个散点图

plt.figure(figsize=(12,6))
plt.scatter(news_df['date'],news_df['compound'],s=news_df['pos']*200,c=news_df['neg'])
plt.grid()

下图是AMZN 每周复合得分与正面得分(符号大小)和负面得分(符号颜色)的比较。大家是不是开始有些一目了然该股票本周市场情绪的感觉了?

在下一篇中我们将实现:通过抓取最新的谷歌新闻和使用Yahoo Finance API获取的财务数据,为投资者提供即时的市场洞察和财务信息,从而更准确地评估股票的价值。敬请期待。。。

五、观点总结

  • 核心库和工具:列举了实现该分析所需的核心Python库和工具,包括BeautifulSoup、nltk、matplotlib等。
  • NLP情感分析:详细介绍了如何使用VADER情感分析工具对抓取的新闻标题进行情感分析,并计算周标量情绪得分。
  • 周矢量情绪得分:扩展了NLP分析,计算了AMZN股票的每周矢量情绪得分,并通过图表展示了情绪得分的变化。
  • 情感分析对于理解市场情绪和预测股票价格走势至关重要。通过分析新闻标题的情感倾向,可以捕捉到可能影响股票市场的关键信息。
  • 集成NLP和ML的方法可以作为股票表现的领先指标,有助于改进交易策略和加强投资组合风险管理。

感谢您阅读到最后,希望本文能给您带来新的收获。码字不易,请帮我点赞、分享。祝您投资顺利!如果对文中的内容有任何疑问,请给我留言,必复。


本文内容仅限技术探讨和学习,不构成任何投资建议

Published inAI&Invest专栏

Be First to Comment

发表回复