You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
import os
domains_ok = []
for filename in os.listdir('gs_output'):
domain_ok = filename.replace('_', '.')
domains_ok.append(domain_ok)
domains_not_ok = set()
with open('sub_alive.txt', 'r') as f:
urls = f.read().split()
for url in urls:
flag = True
for domain_ok in domains_ok:
if domain_ok in url:
flag = False
break
if flag:
domains_not_ok.add(url)
with open('gs_continue.txt', 'w') as f:
f.write('\n'.join(domains_not_ok))
我想爬取大概2000多个url,结果我需要写一个获取没爬取的url的文件,因为gospider经常被kill
能改改吗?别总被系统kill,我这样重复运行大概能有10次了,这2000个url还没爬完.
然后我通过使用systemd来解决这个问题,自动restart,然后更新未爬取的域名。期望代码赶快更新
The text was updated successfully, but these errors were encountered: