Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

第六章实战项目基础爬虫 #94

Open
nunu969316192 opened this issue Apr 12, 2018 · 12 comments
Open

第六章实战项目基础爬虫 #94

nunu969316192 opened this issue Apr 12, 2018 · 12 comments

Comments

@nunu969316192
Copy link

貌似百度百科用书上的代码已经爬取不了了

@nunu969316192
Copy link
Author

检查几遍代码没有错误,提示crawl faile
就爬取百度百科‘爬虫'的html也是空的

@ZMRWEGO
Copy link

ZMRWEGO commented Apr 13, 2018

你需要去分析百科的前端代码啊,它的代码已经变了。可以参考一下我写的代码https://gitee.com/zmrwego/webCrawler

@nunu969316192
Copy link
Author

403

@ZMRWEGO
Copy link

ZMRWEGO commented Apr 24, 2018

https://gitee.com/zmrwego/a_simple_reptile 这个应该可以了

@chujiangke
Copy link

用 2to3.py 工具迁移下就好了。

@ghost
Copy link

ghost commented May 16, 2018

这个代码最后打开查看只有一半的数据,比如爬100个但html中只有50个。把self.datas.remove(data)这句话去掉html里就有100个了。没想清楚为什么。(python rookie)

@chujiangke
Copy link

重复的去掉了。

@ghost
Copy link

ghost commented May 17, 2018

我对比了一下输出html里的和内存里的,不是去掉重复的。

@chujiangke
Copy link

是dataoutput.py 这个文件么? 没有 self.datas.remove(data) 这个啊 。

@ghost
Copy link

ghost commented May 21, 2018

对就是在dataoutput.py里。在for data in self.datas这个循环里最后一句。

@chujiangke
Copy link

chujiangke commented May 21, 2018

你看看 哪有 更新下Git啊

coding:utf-8
import codecs
import time
class DataOutput(object):
def init(self):
self.filepath='baike_%s.html'%(time.strftime("%Y_%m_%d_%H_%M_%S", time.localtime()) )
self.output_head(self.filepath)
self.datas=[]

def store_data(self,data):
    if data is None:
        return
    self.datas.append(data)
    if len(self.datas)>10:
        self.output_html(self.filepath)


def output_head(self,path):
    '''
    将HTML头写进去
    :return:
    '''
    fout=codecs.open(path,'w',encoding='utf-8')
    fout.write("<html>")
    fout.write(r'''<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />''')
    fout.write("<body>")
    fout.write("<table>")
    fout.close()


def output_html(self,path):
    '''
    将数据写入HTML文件中
    :param path: 文件路径
    :return:
    '''
    fout=codecs.open(path,'a',encoding='utf-8')
    for data in self.datas:
        fout.write("<tr>")
        fout.write("<td>%s</td>"%data['url'])
        fout.write("<td>%s</td>"%data['title'])
        fout.write("<td>%s</td>"%data['summary'])
        fout.write("</tr>") 
    self.datas=[]
    fout.close()


def ouput_end(self,path):
    '''
    输出HTML结束
    :param path: 文件存储路径
    :return:
    '''
    fout=codecs.open(path,'a',encoding='utf-8')
    fout.write("</table>")
    fout.write("</body>")
    fout.write("</html>")
    fout.close()

@ZMRWEGO
Copy link

ZMRWEGO commented May 22, 2018

datas满10个进行一次读写,减轻cpu负担,然后去掉已经写入的datas.remove(data),如果没有这句的话,只会重复写入前10个data。在结束前加入time.sleep(3),是数据完全写入后关闭进程。具体看这里https://gitee.com/zmrwego/a_simple_reptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants