第六章实战项目基础爬虫 #94

nunu969316192 · 2018-04-12T02:01:56Z

貌似百度百科用书上的代码已经爬取不了了

nunu969316192 · 2018-04-12T02:03:30Z

检查几遍代码没有错误，提示crawl faile
就爬取百度百科‘爬虫'的html也是空的

ZMRWEGO · 2018-04-13T00:56:56Z

你需要去分析百科的前端代码啊，它的代码已经变了。可以参考一下我写的代码https://gitee.com/zmrwego/webCrawler

nunu969316192 · 2018-04-24T04:46:38Z

403

ZMRWEGO · 2018-04-24T07:50:45Z

https://gitee.com/zmrwego/a_simple_reptile 这个应该可以了

chujiangke · 2018-05-03T16:07:45Z

用 2to3.py 工具迁移下就好了。

ghost · 2018-05-16T10:44:22Z

这个代码最后打开查看只有一半的数据，比如爬100个但html中只有50个。把self.datas.remove(data)这句话去掉html里就有100个了。没想清楚为什么。（python rookie）

chujiangke · 2018-05-16T11:48:35Z

重复的去掉了。

ghost · 2018-05-17T01:08:59Z

我对比了一下输出html里的和内存里的，不是去掉重复的。

chujiangke · 2018-05-17T08:50:10Z

是dataoutput.py 这个文件么？没有 self.datas.remove(data) 这个啊。

ghost · 2018-05-21T08:52:35Z

对就是在dataoutput.py里。在for data in self.datas这个循环里最后一句。

chujiangke · 2018-05-21T09:14:28Z

你看看哪有更新下Git啊

coding:utf-8
import codecs
import time
class DataOutput(object):
def init(self):
self.filepath='baike_%s.html'%(time.strftime("%Y_%m_%d_%H_%M_%S", time.localtime()) )
self.output_head(self.filepath)
self.datas=[]

def store_data(self,data):
    if data is None:
        return
    self.datas.append(data)
    if len(self.datas)>10:
        self.output_html(self.filepath)


def output_head(self,path):
    '''
    将HTML头写进去
    :return:
    '''
    fout=codecs.open(path,'w',encoding='utf-8')
    fout.write("<html>")
    fout.write(r'''<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />''')
    fout.write("<body>")
    fout.write("<table>")
    fout.close()


def output_html(self,path):
    '''
    将数据写入HTML文件中
    :param path: 文件路径
    :return:
    '''
    fout=codecs.open(path,'a',encoding='utf-8')
    for data in self.datas:
        fout.write("<tr>")
        fout.write("<td>%s</td>"%data['url'])
        fout.write("<td>%s</td>"%data['title'])
        fout.write("<td>%s</td>"%data['summary'])
        fout.write("</tr>") 
    self.datas=[]
    fout.close()


def ouput_end(self,path):
    '''
    输出HTML结束
    :param path: 文件存储路径
    :return:
    '''
    fout=codecs.open(path,'a',encoding='utf-8')
    fout.write("</table>")
    fout.write("</body>")
    fout.write("</html>")
    fout.close()

ZMRWEGO · 2018-05-22T14:06:34Z

datas满10个进行一次读写，减轻cpu负担，然后去掉已经写入的datas.remove(data),如果没有这句的话，只会重复写入前10个data。在结束前加入time.sleep(3),是数据完全写入后关闭进程。具体看这里https://gitee.com/zmrwego/a_simple_reptile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

第六章实战项目基础爬虫 #94

第六章实战项目基础爬虫 #94

nunu969316192 commented Apr 12, 2018

nunu969316192 commented Apr 12, 2018

ZMRWEGO commented Apr 13, 2018

nunu969316192 commented Apr 24, 2018

ZMRWEGO commented Apr 24, 2018

chujiangke commented May 3, 2018

ghost commented May 16, 2018

chujiangke commented May 16, 2018

ghost commented May 17, 2018

chujiangke commented May 17, 2018

ghost commented May 21, 2018

chujiangke commented May 21, 2018 •

edited

Loading

ZMRWEGO commented May 22, 2018 •

edited

Loading

第六章实战项目基础爬虫 #94

第六章实战项目基础爬虫 #94

Comments

nunu969316192 commented Apr 12, 2018

nunu969316192 commented Apr 12, 2018

ZMRWEGO commented Apr 13, 2018

nunu969316192 commented Apr 24, 2018

ZMRWEGO commented Apr 24, 2018

chujiangke commented May 3, 2018

ghost commented May 16, 2018

chujiangke commented May 16, 2018

ghost commented May 17, 2018

chujiangke commented May 17, 2018

ghost commented May 21, 2018

chujiangke commented May 21, 2018 • edited Loading

你看看 哪有 更新下Git啊

ZMRWEGO commented May 22, 2018 • edited Loading

chujiangke commented May 21, 2018 •

edited

Loading

你看看哪有更新下Git啊

ZMRWEGO commented May 22, 2018 •

edited

Loading