爬虫_01获取网页源代码及选择网页内容

2020-02-18

字数统计: 175字 | 阅读时长≈ 1分

摘要：利用urlopen来获取网页源代码，以及用regex正则表达式这种原始的方法来提取网页内容。

1.用urlopen获取网页的html内容：

from urllib.request import urlopen
# if has Chinese, apply decode()
html = urlopen(
    "https://vip.stock.finance.sina.com.cn/corp/view/vCB_AllNewsStock.php?symbol=sh601318&Page=1"
).read().decode('gb2312') 
# print(html)

decode后面的编码根据网页源码中的编码来决定，例如

1582003939736

2.利用re模块来进行网页内容提取：

1
2
3

import re
res = re.findall(r"<title>(.+?)</title>", html)
print("\nPage title is: ", res[0])

结果如下：

1	Page title is: 中国平安(601318)个股资讯_新浪财经_新浪网

本文作者： 随风而行
本文链接： http://yoursite.com/2020/02/18/爬虫-01获取网页源代码及选择网页内容/
版权声明： 本博客所有文章除特别声明外，均采用 MIT 许可协议。转载请注明出处！