Xpath-369IT编程

admin管理员组
文章数量:1026989

Xpath

Xpath–爬取哔哩哔哩排行榜数据

以前对爬取的数据做数据解析都是用我蹩脚的正则表达式，大家看了都直摇头，
然后我就学习了一下Xpath，感觉入门挺快的，没有之前想的那么复杂

这次我选择了哔哩哔哩的排行榜

目标URL:哔哩哔哩排行榜

先按F12查看一下我们需要爬取数据的结构情况

结构还是很清晰的，一目了然

那就开始写代码吧

首先导入我们会用到的包

# 这是Xpath需要用的
from lxml import etreeimport requests

以防万一，先伪装个信息头(headers)

在开发者模式里找到User-Agent、referer、cookie填到下面

headers = {'User-Agent': '','referer': '','cookie': ''
}

添加我们的链接进去

url = ''

开始获取网页数据和解析数据

response = requests.get(url, headers=headers, timeout=5)
# 判断是否响应成功
if (response.status_code) == 200:# 构造了一个XPath解析对象并对HTML文本进行自动修正。html = etree.HTML(response.text)# 标题title_list = html.xpath('//div[@class="info"]/a/text()')# 链接link_list = html.xpath('//div[@class="info"]/a/@href')# 总评分hot_list = html.xpath('//div[@class="pts"]/div/text()')# 播放量watch_list = html.xpath('//div[@class="detail"]/span[1]/text()')# 弹幕量barrage_list = html.xpath('//div[@class="detail"]/span[2]/text()')# 作者author_list = html.xpath('//div[@class="detail"]/a/span/text()')

如果使用Xpath，其实可以在F12(开发者工具)里按Ctrl+F调试，挺方便的。
输入Xpath表达式，就会自动匹配表达式(代码高亮)，而且不需要很长、也不是晦涩难懂的表达式

然后把爬取到的数据组合起来，格式化一下，保存到文件里，我这里是保存到的D盘

zip()用于将可迭代的对象作为参数，将对象中对应的元素打包成一个个元组，然后返回由这些元组组成的列表。
.strip()去除前导和后导空格

        for title, watch, barrage, author, hot, link in zip(title_list, watch_list, barrage_list, author_list, hot_list, link_list):line = ("{0}、{1}\n\t[👀{2}][📺{3}][🧑{4}][🔥{5}]\n\t👉http:{6}".format(title_list.index(title)+1, title, watch.strip(), barrage.strip(), author.strip(), hot, link))with open('D:/bilibili排行榜.txt', 'a+', encoding='utf-8') as file:file.writelines('\n'+line + '\n')print('收集完成')

以下是完整代码

from lxml import etree
import requestsheaders = {'User-Agent': '','referer': '','cookie': ''
}url = ''try:response = requests.get(url, headers=headers, timeout=5)if (response.status_code) == 200:# 构造了一个XPath解析对象并对HTML文本进行自动修正。html = etree.HTML(response.text)# div[contains(@class,"carousel-item")]# 标题title_list = html.xpath('//div[@class="info"]/a/text()')# 链接link_list = html.xpath('//div[@class="info"]/a/@href')# 总评分hot_list = html.xpath('//div[@class="pts"]/div/text()')# 播放量watch_list = html.xpath('//div[@class="detail"]/span[1]/text()')# 弹幕量barrage_list = html.xpath('//div[@class="detail"]/span[2]/text()')# 作者author_list = html.xpath('//div[@class="detail"]/a/span/text()')for title, watch, barrage, author, hot, link in zip(title_list, watch_list, barrage_list, author_list, hot_list, link_list):line = ("{0}、{1}\n\t[👀{2}][📺{3}][🧑{4}][🔥{5}]\n\t👉http:{6}".format(title_list.index(title)+1, title, watch.strip(), barrage.strip(), author.strip(), hot, link))with open('D:/bilibili排行榜.txt', 'a+', encoding='utf-8') as file:file.writelines('\n'+line + '\n')print('收集完成')except Exception as error:print(error)

来看一看结果

感觉还是可以的

到此结束

Xpath

Xpath–爬取哔哩哔哩排行榜数据

以前对爬取的数据做数据解析都是用我蹩脚的正则表达式，大家看了都直摇头，
然后我就学习了一下Xpath，感觉入门挺快的，没有之前想的那么复杂

这次我选择了哔哩哔哩的排行榜

目标URL:哔哩哔哩排行榜

先按F12查看一下我们需要爬取数据的结构情况

结构还是很清晰的，一目了然

那就开始写代码吧

首先导入我们会用到的包

# 这是Xpath需要用的
from lxml import etreeimport requests

以防万一，先伪装个信息头(headers)

在开发者模式里找到User-Agent、referer、cookie填到下面

headers = {'User-Agent': '','referer': '','cookie': ''
}

添加我们的链接进去

url = ''

开始获取网页数据和解析数据

response = requests.get(url, headers=headers, timeout=5)
# 判断是否响应成功
if (response.status_code) == 200:# 构造了一个XPath解析对象并对HTML文本进行自动修正。html = etree.HTML(response.text)# 标题title_list = html.xpath('//div[@class="info"]/a/text()')# 链接link_list = html.xpath('//div[@class="info"]/a/@href')# 总评分hot_list = html.xpath('//div[@class="pts"]/div/text()')# 播放量watch_list = html.xpath('//div[@class="detail"]/span[1]/text()')# 弹幕量barrage_list = html.xpath('//div[@class="detail"]/span[2]/text()')# 作者author_list = html.xpath('//div[@class="detail"]/a/span/text()')

如果使用Xpath，其实可以在F12(开发者工具)里按Ctrl+F调试，挺方便的。
输入Xpath表达式，就会自动匹配表达式(代码高亮)，而且不需要很长、也不是晦涩难懂的表达式

然后把爬取到的数据组合起来，格式化一下，保存到文件里，我这里是保存到的D盘

zip()用于将可迭代的对象作为参数，将对象中对应的元素打包成一个个元组，然后返回由这些元组组成的列表。
.strip()去除前导和后导空格

        for title, watch, barrage, author, hot, link in zip(title_list, watch_list, barrage_list, author_list, hot_list, link_list):line = ("{0}、{1}\n\t[👀{2}][📺{3}][🧑{4}][🔥{5}]\n\t👉http:{6}".format(title_list.index(title)+1, title, watch.strip(), barrage.strip(), author.strip(), hot, link))with open('D:/bilibili排行榜.txt', 'a+', encoding='utf-8') as file:file.writelines('\n'+line + '\n')print('收集完成')

以下是完整代码

from lxml import etree
import requestsheaders = {'User-Agent': '','referer': '','cookie': ''
}url = ''try:response = requests.get(url, headers=headers, timeout=5)if (response.status_code) == 200:# 构造了一个XPath解析对象并对HTML文本进行自动修正。html = etree.HTML(response.text)# div[contains(@class,"carousel-item")]# 标题title_list = html.xpath('//div[@class="info"]/a/text()')# 链接link_list = html.xpath('//div[@class="info"]/a/@href')# 总评分hot_list = html.xpath('//div[@class="pts"]/div/text()')# 播放量watch_list = html.xpath('//div[@class="detail"]/span[1]/text()')# 弹幕量barrage_list = html.xpath('//div[@class="detail"]/span[2]/text()')# 作者author_list = html.xpath('//div[@class="detail"]/a/span/text()')for title, watch, barrage, author, hot, link in zip(title_list, watch_list, barrage_list, author_list, hot_list, link_list):line = ("{0}、{1}\n\t[👀{2}][📺{3}][🧑{4}][🔥{5}]\n\t👉http:{6}".format(title_list.index(title)+1, title, watch.strip(), barrage.strip(), author.strip(), hot, link))with open('D:/bilibili排行榜.txt', 'a+', encoding='utf-8') as file:file.writelines('\n'+line + '\n')print('收集完成')except Exception as error:print(error)

来看一看结果

感觉还是可以的

到此结束

本文标签： xpath

版权声明：本文标题：Xpath 内容由热心网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://it.en369.cn/jiaocheng/1706778242a422681.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

369IT编程

Xpath

Xpath

Xpath–爬取哔哩哔哩排行榜数据

Xpath

Xpath–爬取哔哩哔哩排行榜数据

更多相关文章

lxml,xpath

Xpath

使用selenium,xpath,线程池爬取斗鱼主播信息

Python爬虫：XPath语法

python爬虫之xpath入门

xpath下载安装——Python爬虫xpath插件下载安装（2023.8亲测可用！！）

爬虫插件-XPath Helper下载与安装

XPATH定位到的元素有多个，该怎么办？

发表评论

推荐文章

javascript - Syntax Error with John Resig&#39;s Micro Templating after changing template tags &lt;# {% {{ etc - Stack Ov

type conversion - Autoconversion in javascript: Isn&#39;t it supposed to convert a string to a number when done like stringv

javascript - Jquery append to (dynamic by id) - Stack Overflow

c# - How to call a JavaScript function multiple times in a loop on page reload with ASP.NET - Stack Overflow

powershell - The Copy-PnPFile method does not copy items from the document library to the SharePoint list - Stack Overflow

热门文章

html - How to show CSS loader while syncronous JavaScript is running - Stack Overflow

kotlin - How to share nested nav graph screen viewmodel to parent nav graph screen - Stack Overflow

javascript - Datepicker not refreshing data - Stack Overflow

javascript - Onclick Removedelete image - Stack Overflow

firefox - javascript syntax error expected expression, got &#39;&lt;&#39; - Stack Overflow

javascript - Google Maps &quot;Uncaught TypeError: Cannot read property &#39;defaultView&#39; of undefined&quot;

python - Find all &quot;a&quot; tags in multiple divs with same class with BeautifulSoup - Stack Overflow

installation - Wordpress in subfolder dont work

python - Selenium cannot retrieve url when running in Google Colab - Stack Overflow

themes - Menu doesn&#39;t open on mobile

最新文章

windows设置断电重启开机后自动输入锁屏密码登录

Windows系统设置开机默认开启数字小键盘

Windows11 开机自动同步时间（开机时间不更新问题）

windows配置开机自启动软件或脚本

【Redis】Windows设置Redis为开机自启动

程序员刚毕业，先去大厂镀金还是先去小厂攒经验？

万象2008清空boss账户密码

【Tools】GitBook简明教程

oracle exadata celldisk 闪存盘受损导致性能下降

SDUT 2138 图结构练习——BFSDFS——判断可达性

javascript - Type &#39;undefined&#39; is not assignable to type &#39;menuItemProps[]&#39; - Stack Overflow

javascript - VS 2015 Angular 2 import modules cannot be resolved - Stack Overflow

javascript - Get the JSON objects that are not present in another array - Stack Overflow

javascript - How to dismiss a phonegap notification programmatically - Stack Overflow

c - Solaris 10 make Error code 1 Fatal Error when trying to build python 2.7.16 - Stack Overflow

javascript - Syntax Error with John Resig's Micro Templating after changing template tags <# {% {{ etc - Stack Ov

type conversion - Autoconversion in javascript: Isn't it supposed to convert a string to a number when done like stringv

firefox - javascript syntax error expected expression, got '<' - Stack Overflow

javascript - Google Maps "Uncaught TypeError: Cannot read property 'defaultView' of undefined"

python - Find all "a" tags in multiple divs with same class with BeautifulSoup - Stack Overflow

themes - Menu doesn't open on mobile

javascript - Type 'undefined' is not assignable to type 'menuItemProps[]' - Stack Overflow