【kimol君的无聊小发明】—用python写图片下载器
Tip:本文仅供学习与参考,且勿用作不法用途~
前言
某个夜深人静的夜晚,我打开了自己的文件夹,发现了自己写了许多似乎很无聊的代码。于是乎,一个想法油然而生:“生活已经很无聊了,不如再无聊一点叭”。
说干就干,那就开一个专题,我们称之为
kimol君的无聊小发明
。
妙…啊~~~
网上爬虫入门教程有很多,大多是从下载图片开始~正经人谁不下载一下图片呢,对叭?
kimol君也不例外,咱上图瞧一瞧:
一、单线程版
关于该网站的爬取应该来说是比较入门的了,因为并没涉及到太多的反爬机制,据目前来看主要有两点:
-
headers中Referer参数
:其解决方法也很简单,只需要在请求头中加入这个参数就可以了,而且也不需要动态变化,固定为主页地址即可。
-
请求速度限制
:在实际爬取过程中我们会发现,如果爬取速度过快IP往往会被封了,而这里我们只需要适当限制速度或者加入代理池即可。
具体的爬虫分析,网上随便一搜就是一堆,我这里就直接献上代码好了:
import re
import os
import time
import queue
import requests
from tqdm import tqdm
from termcolor import *
from colorama import init
init(autoreset=False)
class spider_Mzidu():
def __init__(self):
self.url_page = 'https://www.mzitu.com/page/%d/'
self.url_taotu = 'https://www.mzitu.com/%s'
self.headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0',
'Accept': '*/*',
'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
'Accept-Encoding': 'gzip, deflate, br',
'X-Requested-With': 'XMLHttpRequest',
'Connection': 'keep-alive',
'Referer': 'https://www.mzitu.com',
}
self.p_id = '<span><a href="https://www.mzitu.com/(\d*?)" target="_blank">(.*?)</a></span>'
self.p_imgurl = '<img class="blur" src="(.*?)"'
self.p_page = '…</span>.*?<span>(\d*?)</span>'
self.queue_id = queue.Queue()
def getPages(self):
res = requests.get(self.url_page%1,headers=self.headers)
html = res.text
N = re.findall('''class="page-numbers dots">[\s\S]*?>(\d*?)</a>[\s\S]*?"next page-numbers"''',html)[0]
return int(N)
def getID(self):
page_range = input('请输入爬取页数(如1-10):')
p_s = int(page_range.split('-')[0])
p_e = int(page_range.split('-')[1])
time.sleep(0.5)
print(colored('开始获取套图ID'.center(50,'-'),'green'))
bar = tqdm(range(p_s,p_e+1),ncols=60)
for p in bar:
res = requests.get(self.url_page%p,headers=self.headers)
html = res.text
ids = re.findall(self.p_id,html)
for i in ids:
self.queue_id.put(i)
bar.set_description('第%d页'%p)
def downloadImg(self,imgurl):
res = requests.get(imgurl,headers=self.headers)
img = res.content
return img
def parseTaotu(self,taotuID):
res = requests.get(self.url_taotu%taotuID,headers=self.headers)
html = res.text
page = int(re.findall(self.p_page,html)[0])
imgurl = re.findall(self.p_imgurl,html)[0]
imgurl = imgurl[:-6]+'%s'+imgurl[-4:]
return(imgurl,page)
def downloadTaotu(self):
while not self.queue_id.empty():
taotu = self.queue_id.get()
taotuID = taotu[0]
taotuName = taotu[1]
try:
imgurl,page = self.parseTaotu(taotuID)
path = '[P%d]'%page+taotuName
if not os.path.exists(path):
os.mkdir(path)
bar = tqdm(range(1,page+1),ncols=50)
for i in bar:
url = imgurl%(str(i).zfill(2))
img = self.downloadImg(url)
with open('./%s/%d.jpg'%(path,i),'wb') as f:
f.write(img)
print('套图("'+colored(taotuName,'red')+'")爬取完成')
except:
time.sleep(3)
self.queue_id.put(taotu)
def run(self):
os.system('cls')
print('*'*35)
print('*'+'欢迎使用Mzitu下载器'.center(26)+'*')
print('*'*35)
N = self.getPages()
print(('Mzitu当前共有%s页!'%colored(N,'red')).center(30))
print('\n')
self.getID()
print('\n'+colored('开始爬取套图'.center(50,'-'),'green'))
self.downloadTaotu()
spider = spider_Mzidu()
spider.run()
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
- 67
- 68
- 69
- 70
- 71
- 72
- 73
- 74
- 75
- 76
- 77
- 78
- 79
- 80
- 81
- 82
- 83
- 84
- 85
- 86
- 87
- 88
- 89
- 90
- 91
- 92
- 93
- 94
- 95
- 96
- 97
- 98
- 99
- 100
- 101
- 102
- 103
- 104
- 105
- 106
- 107
- 108
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
- 67
- 68
- 69
- 70
- 71
- 72
- 73
- 74
- 75
- 76
- 77
- 78
- 79
- 80
- 81
- 82
- 83
- 84
- 85
- 86
- 87
- 88
- 89
- 90
- 91
- 92
- 93
- 94
- 95
- 96
- 97
- 98
- 99
- 100
- 101
- 102
- 103
- 104
- 105
- 106
- 107
- 108
二、多线程版
有小伙伴估计得问了:“单线程这么慢?您是在开玩笑的叭,等得我不得憋坏咯?”
客官这边请,来试试多线程版的好了:
import re
import os
import time
import queue
import requests
import threading
from tqdm import tqdm
from termcolor import *
from colorama import init
init(autoreset=False)
def Get_proxy():
res = requests.get('xxxxxxxxxxxxxxxxxxx')
html = res.text
return html
class spider_Mzidu():
def __init__(self):
self.url_page = 'https://www.mzitu.com/page/%d/'
self.url_taotu = 'https://www.mzitu.com/%s'
self.headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0',
'Accept': '*/*',
'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
'Accept-Encoding': 'gzip, deflate, br',
'X-Requested-With': 'XMLHttpRequest',
'Connection': 'keep-alive',
'Referer': 'https://www.mzitu.com',
}
self.p_id = '<span><a href="https://www.mzitu.com/(\d*?)" target="_blank">(.*?)</a></span>'
self.p_imgurl = '<img class="blur" src="(.*?)"'
self.p_page = '…</span>.*?<span>(\d*?)</span>'
self.queue_id = queue.Queue()
proxy = Get_proxy()
self.proxies = {'http':'http://'+proxy,
'https':'https://'+proxy}
def getPages(self):
res = requests.get(self.url_page%1,headers=self.headers,proxies=self.proxies,timeout=10)
html = res.text
N = re.findall('''class="page-numbers dots">[\s\S]*?>(\d*?)</a>[\s\S]*?"next page-numbers"''',html)[0]
return int(N)
def getID(self):
page_range = input('请输入爬取页数(如1-10):')
p_s = int(page_range.split('-')[0])
p_e = int(page_range.split('-')[1])
time.sleep(0.5)
print(colored('开始获取套图ID'.center(50,'-'),'green'))
bar = tqdm(range(p_s,p_e+1),ncols=60)
for p in bar:
res = requests.get(self.url_page%p,headers=self.headers,proxies=self.proxies,timeout=10)
html = res.text
ids = re.findall(self.p_id,html)
for i in ids:
self.queue_id.put(i)
bar.set_description('第%d页'%p)
def downloadImg(self,imgurl,proxies):
res = requests.get(imgurl,headers=self.headers,proxies=proxies,timeout=10)
img = res.content
return img
def parseTaotu(self,taotuID,proxies):
res = requests.get(self.url_taotu%taotuID,headers=self.headers,proxies=proxies,timeout=10)
html = res.text
page = int(re.findall(self.p_page,html)[0])
imgurl = re.findall(self.p_imgurl,html)[0]
imgurl = imgurl[:-6]+'%s'+imgurl[-4:]
return(imgurl,page)
def downloadTaotu(self):
proxy = Get_proxy()
proxies = {'http':'http://'+proxy,
'https':'https://'+proxy}
while not self.queue_id.empty():
taotu = self.queue_id.get()
taotuID = taotu[0]
taotuName = taotu[1]
try:
imgurl,page = self.parseTaotu(taotuID,proxies)
path = '[P%d]'%page+taotuName
if not os.path.exists(path):
os.mkdir(path)
bar = tqdm(range(1,page+1),ncols=50)
for i in bar:
url = imgurl%(str(i).zfill(2))
img = self.downloadImg(url,proxies)
with open('./%s/%d.jpg'%(path,i),'wb') as f:
f.write(img)
print('套图("'+colored(taotuName,'red')+'")爬取完成')
except:
time.sleep(3)
proxy = Get_proxy()
proxies = {'http':'http://'+proxy,
'https':'https://'+proxy}
self.queue_id.put(taotu)
def changeProxy(self):
proxy = Get_proxy()
self.proxies = {'http':'http://'+proxy,
'https':'https://'+proxy}
def run(self):
os.system('cls')
print('*'*35)
print('*'+'欢迎使用Mzitu下载器'.center(26)+'*')
print('*'*35)
N = self.getPages()
print(('Mzitu当前共有%s页!'%colored(N,'red')).center(30))
print('\n')
self.getID()
print('\n'+colored('开始爬取套图'.center(50,'-'),'green'))
N_thread = 3
thread_list = []
for i in range(N_thread):
thread_list.append(threading.Thread(target=self.downloadTaotu))
for t in thread_list:
t.start()
for t in thread_list:
t.join()
spider = spider_Mzidu()
spider.run()
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
- 67
- 68
- 69
- 70
- 71
- 72
- 73
- 74
- 75
- 76
- 77
- 78
- 79
- 80
- 81
- 82
- 83
- 84
- 85
- 86
- 87
- 88
- 89
- 90
- 91
- 92
- 93
- 94
- 95
- 96
- 97
- 98
- 99
- 100
- 101
- 102
- 103
- 104
- 105
- 106
- 107
- 108
- 109
- 110
- 111
- 112
- 113
- 114
- 115
- 116
- 117
- 118
- 119
- 120
- 121
- 122
- 123
- 124
- 125
- 126
- 127
- 128
- 129
- 130
- 131
- 132
- 133
- 134
- 135
- 136
- 137
- 138
- 139
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
- 67
- 68
- 69
- 70
- 71
- 72
- 73
- 74
- 75
- 76
- 77
- 78
- 79
- 80
- 81
- 82
- 83
- 84
- 85
- 86
- 87
- 88
- 89
- 90
- 91
- 92
- 93
- 94
- 95
- 96
- 97
- 98
- 99
- 100
- 101
- 102
- 103
- 104
- 105
- 106
- 107
- 108
- 109
- 110
- 111
- 112
- 113
- 114
- 115
- 116
- 117
- 118
- 119
- 120
- 121
- 122
- 123
- 124
- 125
- 126
- 127
- 128
- 129
- 130
- 131
- 132
- 133
- 134
- 135
- 136
- 137
- 138
- 139
细心的大大应该发现了,其实多线程版跟单线程版结构上几乎没有太大的差别(
这里也提供了一种代码思路,这样使得如果我们以后想把原来代码改为多线程,可以更加方便快捷
),主要是这两点:
-
调用
downloadTaotu
()函数的时候,使用threading模块开启多线程多次调用。
-
加入了
HTTP代理模块
。这里大家可以酌情考虑是否保留,不过根据我测试发现,如果是使用多线程的话,建议大家还是加入代理,不然IP很可能被封。
写在最后
如果大家对代码里的
进度条
或者
输出的文字颜色
感兴趣,让自己的代码输出更风骚,大家可以参考这里。(
Python炫酷的颜色输出与进度条打印
)
文中如有不足,还望大家批评指正!
最后,感谢各位大大的耐心阅读~
慢着,大侠请留步… 动起可爱的双手,来个赞再走呗 (๑◕ܫ←๑)