首页 > 分享 > Python爬虫学习分享(疫情数据爬取＋可视化)

Python爬虫学习分享(疫情数据爬取＋可视化)

花匠小妙招
2025-12-11 15:35

Python爬虫学习分享

“The Website is the API."
“未来的所有信息都是通过website（网络）提供的。”

这次重大疫情，每时每刻数据都有可能变化，这篇博文将为大家讲解如何爬取实时疫情数据，并且分析数据，作出数据可视化的效果。

爬取网站数据

疫情网站有两种，一种是类似丁香园（https://ncov.dxy.cn/ncovh5/view/pneumonia）这种疫情数据可以通过网页右键检查源代码获取。
另一种是类似腾讯（https://news.qq.com/zt2020/page/feiyan.htm#/?nojump=1）这类网站的疫情数据是通过网页请求其他url，返回一个json格式的源数据再渲染到网页上显示，这类网站，你不能通过源代码找到这种json格式数据，需要通过检查的Network项里查找。

先是第一种，以丁香医生为例（https://ncov.dxy.cn/ncovh5/view/pneumonia）

#国内各省疫情情况 import requests import re def parse_url(page_url,f): headers = { 'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36'} try: r=requests.get(page_url,headers=headers,timeout=30) r.raise_for_status() r.encoding=r.apparent_encoding html = r.text except: print('访问失败') # print(html) #获取省或城市的信息，为避免遗漏省份，可以先将"provinceShortName"替换"cityName"再分析 html=re.sub(r'provinceShortName','cityName',html) #再把关于中国的部分取出来 html=re.search('{ window.getAreaStat =.+?window.getIndexRecommendList2',html) html=html.group() # print(html) cities=re.findall(r""" {.*?"cityName":"(.+?)", #城市名称 "currentConfirmedCount":(-?d+), #现存确诊 "confirmedCount":(d+), #累计确诊 .+?"curedCount":(d+), #治愈 "deadCount":(d+) #死亡 """,html,re.VERBOSE|re.DOTALL) # print(type(cities)) for city in cities: city=list(city) # print(city) f.write('{},{},{},{},{}n'.format(''.join(city[0]), ''.join(city[1]), ''.join(city[2]), ''.join(city[4]), ''.join(city[3]))) def main(): page_url = 'https://ncov.dxy.cn/ncovh5/view/pneumonia' with open('epidemic situation.csv', 'a', encoding='utf-8') as f: parse_url(page_url,f) main()

python

运行

123456789101112131415161718192021222324252627282930313233343536373839404142434445

这里用到的是正则表达式的方法。为什么选择正则表达式？我们知道此网站的疫情部分数据可以在源代码中查找到，检查中是渲染后的代码，但我们requests爬取网站访问而得到的都是源代码哦，所以需要右键查看源代码才能看到我们能爬取到的完整数据。因此，xpath和BeautifulSoup都不适合操作源代码中json格式的数据，我们选用正则表达式。

那正则表达式爬虫是怎么样的呢？不着急，我们先学习上面这篇代码，日积月累，实践中一点点吸收知识。
导入python自带的re模块，网上有很多关于re的语法介绍，所以re匹配规则这就不介绍，以后可能单独写一篇博客吧。这里介绍几个函数

re.search('正则表达式','被查找的字符串') #从头查找指定字符串，返回第一个 re.match('正则表达式','被查找字符串') #和上面这个区别是正则的开头必须与被查找字符串的开头相匹配，否则返回none re.findall(r""" {.*?"cityName":"(.+?)", #城市名称 "currentConfirmedCount":(-?d+), #现存确诊 "confirmedCount":(d+), #累计确诊 .+?"curedCount":(d+), #治愈 "deadCount":(d+) #死亡 """,html,re.VERBOSE|re.DOTALL) #这后面的re.VERBOSE表示正则中可以使用注释|re.DOTALL表示.在正则中代表所有（原本不包括换行符）

python

运行

1234567891011

第二种腾讯（https://news.qq.com/zt2020/page/feiyan.htm#/?nojump=1）
先上代码

from lxml import etree from selenium import webdriver def get_cities(url): cities=[] try: driver = webdriver.Chrome(executable_path="D:chromedriver.exe") driver.get(url) text = driver.page_source except: print('访问失败') html = etree.HTML(text) tbodys = html.xpath('//*[@id="listWraper"]/table[2]/tbody') for tbody in tbodys: trs = tbody.xpath('./tr') cities.extend(trs) return cities def parse_city(city): area =city.xpath("string(./th)") Today_confirmed = city.xpath("string(./td[2]/p[2])").strip() Existing_confirmed = city.xpath('string(./td[1])').strip() try: cumulative_confirmed= city.xpath('./td[2]//text()')[0].strip() except: cumulative_confirmed='' cure = city.xpath('string(./td[3])').strip() dead = city.xpath('string(./td[4])').strip() data = {} data['Today_confirmed']=Today_confirmed data['dead']=dead data['cure']=cure data['cumulative_confirmed']=cumulative_confirmed data['area']=area data['Existing_confirmed']=Existing_confirmed return data def sava_data(data,f): f.write('{},{},{},{},{},{}n'.format(data['area'],data['Today_confirmed'],data['Existing_confirmed'],data['cumulative_confirmed'],data['cure'],data['dead'])) print('ok') def main(): with open('Domestic outbreak cities in China.csv','a',encoding='utf-8') as f: url="https://news.qq.com/zt2020/page/feiyan.htm#/?nojump=1" cities=get_cities(url) for city in cities: data=parse_city(city) sava_data(data,f) main()

python

运行

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253

这里涉及到了一个selenium模拟点击器，是爬虫较高级的用法，类似模拟人手动打开浏览器。所以通过这个模拟点击器，就可以看到渲染后的源代码里已经有疫情数据了，此时就放心用xpath和BeautifulSoup去解析文档吧！
我用的是谷歌浏览器的chromedriver.exe，若你需要，我博客里有。

其他有什么想问的或不懂的欢迎问我，我会尽力解答，一起学习！

接下来进入数据可视化阶段喽！

from pyecharts import options as opts from pyecharts.charts import Map import time from lxml import etree from selenium import webdriver import csv def map_visualmap() -> Map: c = ( Map(init_opts=opts.InitOpts(page_title="中国疫情地图", bg_color="#FDF5E6")) .add("现存确诊人数", data_pair=current_data_dic, maptype="china") .set_series_opts(label_opts=opts.LabelOpts(color="#8B4C39", font_size=10)) .set_global_opts( title_opts=opts.TitleOpts(title="中国疫情地图", subtitle="数据更新于"+time_format), visualmap_opts=opts.VisualMapOpts(pieces=[ {"value": 0, "label": "无", "color": "#00ccFF"}, {"min": 1, "max": 9, "color": "#FFCCCC"}, {"min": 10, "max": 99, "color": "#DB5A6B"}, {"min": 100, "max": 499, "color": "#FF6666"}, {"min": 500, "max": 999, "color": "#CC2929"}, {"min": 1000, "max": 9999, "color": "#8C0D0D"}, {"min": 10000, "color": "#9d2933"} ], is_piecewise=True), ) ) return c if __name__ == '__main__': url="https://news.qq.com/zt2020/page/feiyan.htm#/?nojump=1" try: driver = webdriver.Chrome(executable_path="D:chromedriver.exe") driver.get(url) text = driver.page_source except: print('访问失败') html = etree.HTML(text) current_data_dic = [] tbodys = html.xpath('//*[@id="listWraper"]/table[2]/tbody') for tbody in tbodys: area = tbody.xpath('./tr/th/p/span/text()')[0] Existing_confirmed = tbody.xpath('./tr[1]/td[1]/p[1]/text()')[0] pcurrent_data_dic=[] pcurrent_data_dic.append(area) pcurrent_data_dic.append(Existing_confirmed) current_data_dic.append(pcurrent_data_dic) time_format = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()) map_visualmap().render("国内疫情地图.html")

python

运行

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849

在这里插入图片描述
这个小编也在学习摸索当中，pyecharts并不复杂，就是设计到的参数特别多，就是函数调用，在参数输入，其中只要弄懂全局配置和局部配置就ok了，再了解一些图的类型，所有这些，都可以靠官方文档自学哦，我也花了很久很久在搞懂一些东西，看了很多视频和资料，欢迎大家和我探讨。
以下分享我找到的重要资料：