Skip to content

1 第一个简单的爬虫

Lewis Zou edited this page May 19, 2020 · 9 revisions

注意,文档可能落后于代码,所有会有微小的差异,可以参考源码中的 Sample 项目一起学习使用。使用的 nuget 包也可能不是正式包。

  1. .NET Core 的 Console 项目。

  2. 添加 DotnetSpider 库

  • Visual Studio

    + 右键解决方案并启动 Manage NuGet Packages(管理NuGet包)
    + 搜索 DotnetSpider,从结果列表中选中 DotnetSpider
    + 安装到项目
    
  • Package Manager

    Install-Package DotnetSpider -Version 5.0.0-beta1
    
  • 命令行

    dotnet add package DotnetSpider --version 5.0.0-beta1
    
  1. 添加 Serilog 日志组件

    <PackageReference Include="Serilog.AspNetCore" Version="3.2.0"/>
    <PackageReference Include="Serilog.Sinks.Console" Version="3.1.1"/>
    <PackageReference Include="Serilog.Sinks.RollingFile" Version="3.3.0"/>
    <PackageReference Include="Serilog.Sinks.PeriodicBatching" Version="2.3.0"/>
  2. 创建 GithubSpider 类

        public class GithubSpider : Spider
        {
            public GithubSpider(IOptions<SpiderOptions> options, SpiderServices services, ILogger<Spider> logger) : base(
                options, services, logger)
            {
            }
    
            protected override async Task InitializeAsync(CancellationToken stoppingToken)
            {
                // 添加自定义解析
                AddDataFlow(new Parser());
                // 使用控制台存储器
                AddDataFlow(new ConsoleStorage());
                // 添加采集请求
                await AddRequestsAsync("https://github.com/zlzforever");
            }
    
            protected override (string Id, string Name) GetIdAndName()
            {
                return (Guid.NewGuid().ToString("N"), "Github");
            }
    
            class Parser : DataParser
            {
                protected override Task Parse(DataContext context)
                {
                    var selectable = context.Selectable;
                    // 解析数据
                    var author = selectable.XPath("//span[@class='p-name vcard-fullname d-block overflow-hidden']")
                        ?.Value;
                    var name = selectable.XPath("//span[@class='p-nickname vcard-username d-block']")
                        ?.Value;
                    context.AddData("author", author);
                    context.AddData("username", name);
                    return Task.CompletedTask;
                }
            }
        }
  3. 在 Main 方法中添加如下代码

            static async Task Main(string[] args)
            {
                Log.Logger = new LoggerConfiguration()
                    .MinimumLevel.Information()
                    .MinimumLevel.Override("Microsoft.Hosting.Lifetime", LogEventLevel.Warning)
                    .MinimumLevel.Override("Microsoft", LogEventLevel.Warning)
                    .MinimumLevel.Override("System", LogEventLevel.Warning)
                    .MinimumLevel.Override("Microsoft.AspNetCore.Authentication", LogEventLevel.Warning)
                    .Enrich.FromLogContext()
                    .WriteTo.Console().WriteTo.RollingFile("logs/spiders.log")
                    .CreateLogger();
    
                var builder = Builder.CreateDefaultBuilder<GithubSpider>(options =>
                {
                    // 每秒 1 个请求
                    options.Speed = 1;
                    // 请求超时
                    options.RequestTimeout = 10;
                });
                builder.UseSerilog();
                builder.UseQueueDistinctBfsScheduler<HashSetDuplicateRemover>();
                await builder.Build().RunAsync();
                Environment.Exit(0);
            }
  4. 运行程序

    [17:36:53 INF] Argument: RequestedQueueCount, 100
    [17:36:53 INF] Argument: Depth, 0
    [17:36:53 INF] Argument: RequestTimeout, 10
    [17:36:53 INF] Argument: RetriedTimes, 3
    [17:36:53 INF] Argument: EmptySleepTime, 10
    [17:36:53 INF] Argument: Speed, 1
    [17:36:53 INF] Argument: ProxyTestUri, http://www.baidu.com
    [17:36:53 INF] Argument: ProxySupplierUri, 
    [17:36:53 INF] Argument: UseProxy, False
    [17:36:53 INF] Argument: RemoveOutboundLinks, False
    [17:36:53 INF] Argument: StorageConnectionString, 
    [17:36:53 INF] Argument: Storage, 
    [17:36:53 INF] Argument: ConnectionString, 
    [17:36:53 INF] Argument: Database, dotnetspider
    [17:36:53 INF] Argument: StorageMode, InsertIgnoreDuplicate
    [17:36:53 INF] Argument: MySqlFileType, LoadFile
    [17:36:53 INF] Argument: SqlServerVersion, V2000
    [17:36:53 INF] Argument: HBaseRestServer, 
    [17:36:53 INF] None proxy supplier
    [17:36:53 INF] Statistics service starting
    [17:36:53 INF] Agent register service starting
    [17:36:53 INF] Statistics service started
    [17:36:53 INF] Agent register service started
    [17:36:53 INF] Agent starting
    [17:36:54 INF] Initialize d9531ecc28a5492ab58e9d8b47a6bf05, Github
    [17:36:54 INF] Agent started
    [17:36:54 INF] d9531ecc28a5492ab58e9d8b47a6bf05, Github DataFlows: Parser -> ConsoleStorage
    [17:36:54 INF] Register topic DOTNET_SPIDER_D9531ECC28A5492AB58E9D8B47A6BF05
    [17:36:54 INF] d9531ecc28a5492ab58e9d8b47a6bf05, Github started
    [17:36:56 INF] https://github.com/zlzforever download success
    [{"Key":"username","Value":"zlzforever"},{"Key":"author","Value":"Lewis Zou"}]
    [17:36:58 INF] d9531ecc28a5492ab58e9d8b47a6bf05 total 1, success 1, failed 0, left 0
    [17:37:03 INF] d9531ecc28a5492ab58e9d8b47a6bf05 total 1, success 1, failed 0, left 0
    [17:37:05 INF] d9531ecc28a5492ab58e9d8b47a6bf05, Github stopping
    [17:37:05 INF] d9531ecc28a5492ab58e9d8b47a6bf05, Github stopped
    [17:37:05 INF] Agent stopping
    [17:37:05 INF] Agent stopped
    [17:37:05 INF] Agent register service stopping
    [17:37:05 INF] Agent register service stopped
    [17:37:05 INF] Statistics service stopping
    [17:37:05 INF] Statistics service stopped
    
Clone this wiki locally