说在前面

早就听闻Python强大的爬虫功能，于是就想学习一波，有了上次在并夕夕成功的购买经历，于是我就又买了一本《Python3网络爬虫开发实践》。emmmm结果并不怎么样，写的并不是通俗易懂，全部都是专有名词，只能去求助连央视都夸的知名学习网站bilibili，成功入门!

bilibili网页链接: python新手强烈推荐教程：爬虫入门

下面记录一下这次学习过程：

代码

import requests #requests库是用来获取html的库
import re #使用到正则表达式和删除不必要元素的库

url='http://www.ishisetianxia.com/chaojishenxiang/' #元尊小说地址

output=open('YuanZun.out','w',1,'utf-8') #元尊小说保存地址

response=requests.get(url) #获取网站的所有反馈
response.encoding='utf-8' #使用'utf-8'编码格式
html=response.text #.text可以返回网站反馈里面的网站html代码

datas=re.findall(r'<dd><a href="(.*?)" target="_blank">(.*?)</a></dd>',html,re.S) 
#使用正则表达式获取每一章的网址和章节名，并用元组保存到datas里面

for one in datas: #遍历每一个章节
    new_url="http://www.ishisetianxia.com%s" %one[0] #拼接成完整的章节网址
    new_response=requests.get(new_url) #获取单个章节的url地址
    new_response.encoding='utf-8' #使用'utf-8'编码格式
    new_html=new_response.text #获取本章节的html源码
    new_datas=re.findall(r'<div id="BookText">(.*?)<script type=
    "text/javascript" src="/tb.js"></script>',new_html,re.S)[0]
    #使用正则表达式获取小说内容
    new_datas=new_datas.replace(' ','')
    new_datas=new_datas.replace("['",'')
    new_datas=new_datas.replace("']",'')
    new_datas=new_datas.replace("<p>",'')
    new_datas=new_datas.replace("</p>",'')
    new_datas=new_datas.replace("&nbsp;",'')
    new_datas=new_datas.replace("<!--nextpage-->",'')
    new_datas=new_datas.replace("本章完看《元尊》，就在www.ishisetianxia.com",'')
    #使用replace删除小说中多余字符
    output.write(str(one[1]))
    output.write(str(new_datas))
    #文件输出章节名和小说内容

大概就是这样，其实爬虫复杂的很，还需要慢慢的学习才行，总之加油吧！