Skip to content

Popis souborů

JanMeritus edited this page Mar 2, 2017 · 3 revisions

harvest/harvest.warc.gz

Tady je asi zajímavé, že je tam např. WARC-Record-ID: urn:uuid:3d1ee065-1ae5-469f-bc02-1a8b3bcbe6b2 - což je jednotný identifikátor WARCu.. u starých ARCů to asi není.

[root@war 13]# zcat  serials/Serials-2013-07-1M_ArchiveIt/Serials-2013-07-1M_ArchiveIt-20130722150002470-00000-5644~crawler00.webarchiv.cz~7778.warc.gz  |head -n 100
WARC/1.0
WARC-Type: warcinfo
WARC-Date: 2013-07-22T15:00:02Z
WARC-Filename: Serials-2013-07-1M_ArchiveIt-20130722150002470-00000-5644~crawler00.webarchiv.cz~7778.warc.gz
WARC-Record-ID: <urn:uuid:3d1ee065-1ae5-469f-bc02-1a8b3bcbe6b2>
Content-Type: application/warc-fields
Content-Length: 704

software: Heritrix/3.1.2-SNAPSHOT-20130207.001528 http://crawler.archive.org
ip: 127.0.0.2
hostname: crawler00.webarchiv.cz
format: WARC File Format 1.0
conformsTo: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf
operator: Rudolf Kreibich
publisher: National Library of the Czech Republic - WebArchiv.cz
audience: WebArchiv.cz Users
isPartOf: Serials 2013-07-1M_ArchiveIt
description: Pravidelná sklize? semínek s m?sí?ní frekvencí a archivace semínek s nízkou frekvencí p?idaných v kv?tnu.
robots: ignore
http-header-user-agent: Mozilla/5.0 (compatible; heritrix/3.1.2-SNAPSHOT-20130207.001528 +http://webarchiv.cz/kontakty/)
http-header-from: [email protected]



WARC/1.0
WARC-Type: response
WARC-Target-URI: dns:botany.cz
WARC-Date: 2013-07-22T15:00:01Z
WARC-IP-Address: 195.113.132.45
WARC-Record-ID: <urn:uuid:3341dff8-1253-47b7-8b59-c4a67916a8d5>
Content-Type: text/dns
Content-Length: 50

20130722150001
botany.cz.              1800    IN      A       81.2.225.176


WARC/1.0
WARC-Type: response
WARC-Target-URI: dns:apatykar.info
WARC-Date: 2013-07-22T15:00:01Z
WARC-IP-Address: 195.113.132.45
WARC-Record-ID: <urn:uuid:97a2ea9c-7fcc-4643-ac8a-f4222c2ae65e>
Content-Type: text/dns
Content-Length: 56

20130722150001
apatykar.info.          3600    IN      A       217.198.114.12
...

### harvest/logs/harvest_harvest.xml Obsahuje metadata k jednotlivým souborům, včetně md5. Vtip je, že md5 máme na úložišti v souborex linuxfixity, ale tady v METSu je i info, kdy MD5 bylo vytvořeno.

Mimoriadne dolezite pre uloziste LTP

[root@war 13]# cat serials/Serials-2013-07-1M_ArchiveIt/logs/Serials-2013-07-1M_ArchiveItharvest.xml
<?xml version="1.0" encoding="UTF-8"?>
<mets:mets xmlns:mets="http://www.loc.gov/METS/" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/XMLSchema-instance http://www.w3.org/2001/XMLSchema.xsd http://www.loc.gov/METS/ http://www.loc.gov/standards/mets/mets.xsd">
  <mets:metsHdr ROLE="CREATOR" TYPE="ORGANIZATION">
    <mets:name>ABA001</mets:name>
  </mets:metsHdr>
  <mets:dmdSec ID="DCMD_CRAWL_0001">
    <mets:mdWrap DMTYPE="DC" MIMETYPE="text/xml">
      <mets:xmlData xmlns:dc="http://purl.org/dc/elements/1.1/">
        <oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/">
          <dc:title>Serials-2013-07-1M_ArchiveIt</dc:title>
          <dc:type>Crawl</dc:type>
          <dc:audience>ABA 001</dc:audience>
          <dc:identifier>f5nf6u6f-wmc7-l08s-qg4q-bkuzoclgbf9j</dc:identifier>
        </oai_dc:dc>
      </mets:xmlData>
    </mets:mdWrap>
  </mets:dmdSec>
  <mets:fileSec>
    <mets:fileGrp ID="WARCSGROUP" Use="Warcs">
      <mets:file CHECKSUM="7da1eef75c626e6601ab5eb8083f1bf9" CHECKSUMTYPE="MD5" CREATED="2013-07-24T15:28:02+00:00" ID="&amp;lt;urn:uuid:fbae3970-2063-4f51-a28b-b16bcf6048bb&amp;gt;" MIMETYPE="application/X-warc" SEQ="1" SIZE="355891615">
        <mets:FLocat LOCTYPE="URL" xlink:href="Serials-2013-07-1M_ArchiveIt-20130724152802065-00014-5644~crawler00.webarchiv.cz~7778.warc.gz"/>
      </mets:file>
      <mets:file CHECKSUM="c1676c6d9c6d3bc0355d1805e31e2f5a" CHECKSUMTYPE="MD5" CREATED="2013-07-23T04:08:07+00:00" ID="&amp;lt;urn:uuid:40f98543-0775-4cef-9c85-7a01944aa83a&amp;gt;" MIMETYPE="application/X-warc" SEQ="2" SIZE="1000071094">
        <mets:FLocat LOCTYPE="URL" xlink:href="Serials-2013-07-1M_ArchiveIt-20130723040807094-00010-5637~crawler02.webarchiv.cz~7778.warc.gz"/>
      </mets:file>
...
      <mets:file CHECKSUM="6b2236f33a843ccf78b8c2567d5efba8" CHECKSUMTYPE="MD5" CREATED="2013-07-23T22:34:39+00:00" ID="&amp;lt;urn:uuid:13b06fc6-c7a1-4aaa-a287-c7e4253617f8&amp;gt;" MIMETYPE="application/X-warc" SEQ="44" SIZE="701420231">
        <mets:FLocat LOCTYPE="URL" xlink:href="Serials-2013-07-1M_ArchiveIt-20130723223439571-00013-5644~crawler00.webarchiv.cz~7778.warc.gz"/>
      </mets:file>
    </mets:fileGrp>
  </mets:fileSec>
</mets:mets>

harvest/logs/crawl/harvest_crawler-hostname.tar.gz

žádný výpis, je to komprimovaný, defakto by tam měla být struktura z logů heritrix jobu - dost se to měnilo v čase a pro ty historický to neexistuje. (tj. až někdy roku 2014/5). Možná by stálo za to mít dohledatelný, jestli u sklizně jsou logy a případně, co obsahují na úrovni názvu souborů.. Usecase je třeba, zda jde dohledat zda ve sklizni X, crawl-report - který se třeba negeneruje u spadlních sklizní, ale občas se dodělával ručně..

[root@war 13]# tar -ztvf serials/Serials-2013-07-1M_ArchiveIt/logs/crawl/Serials-2013-07-1M_ArchiveIt-crawler00.tar.gz
drwxr-xr-x heritrix/users    0 2013-08-05 11:14 20130722145132/
-rw-r--r-- heritrix/users 55854 2013-07-22 16:47 20130722145132/crawler-beans.cxml
-rw-r--r-- heritrix/users  2113 2013-07-22 16:51 20130722145132/negative-surts.dump
drwxr-xr-x heritrix/users     0 2013-07-22 16:51 20130722145132/actions-done/
-rw-r--r-- heritrix/users  1625 2013-03-28 14:38 20130722145132/negative-surts.txt
-rw-r--r-- heritrix/users  7502 2013-07-22 16:44 20130722145132/seeds.txt
-rw-r--r-- heritrix/users   660 2013-07-25 22:26 20130722145132/job.log
-rw-r--r-- heritrix/users  8735 2013-07-22 16:51 20130722145132/surts.dump
-rw-r--r-- heritrix/users   209 2013-03-28 14:23 20130722145132/surts.txt
drwxr-xr-x heritrix/users     0 2013-07-25 22:26 20130722145132/reports/
-rw-r--r-- heritrix/users    13 2013-07-25 22:26 20130722145132/reports/threads-report.txt
-rw-r--r-- heritrix/users 15043 2013-07-25 22:26 20130722145132/reports/seeds-report.txt
-rw-r--r-- heritrix/users   458 2013-07-25 22:26 20130722145132/reports/crawl-report.txt
-rw-r--r-- heritrix/users 47433 2013-07-25 22:26 20130722145132/reports/frontier-summary-report.txt
-rw-r--r-- heritrix/users  2261 2013-07-25 22:26 20130722145132/reports/processors-report.txt
-rw-r--r-- heritrix/users  3375 2013-07-25 22:26 20130722145132/reports/mimetype-report.txt
-rw-r--r-- heritrix/users 133780 2013-07-25 22:26 20130722145132/reports/hosts-report.txt
-rw-r--r-- heritrix/users    160 2013-07-25 22:26 20130722145132/reports/responsecode-report.txt
-rw-r--r-- heritrix/users     58 2013-07-25 22:26 20130722145132/reports/source-report.txt
drwxr-xr-x heritrix/users      0 2013-08-05 11:14 20130722145132/logs/
-rw-r--r-- heritrix/users 91129110 2013-07-25 22:26 20130722145132/logs/frontier.recover.gz
-rw-r--r-- heritrix/users  1570510 2013-07-24 05:03 20130722145132/logs/uri-errors.log
-rw-r--r-- heritrix/users  2486623 2013-07-25 22:26 20130722145132/logs/progress-statistics.log
-rw-r--r-- heritrix/users     1063 2013-07-22 17:00 20130722145132/logs/alerts.log
-rw-r--r-- heritrix/users     1420 2013-07-22 17:00 20130722145132/logs/runtime-errors.log
-rw-r--r-- heritrix/users 10781093 2013-07-25 21:28 20130722145132/logs/nonfatal-errors.log
-rw-r--r-- heritrix/users 321991016 2013-07-25 22:26 20130722145132/logs/crawl.log

harvest/logs/dmdsec/Mets_FQDN.xml

Zde asi není třeba nic extrahovat pro potřeby Graineru

[root@war 13]# cat serials/Serials-2013-07-1M_ArchiveIt/logs/dmdsec/Mets_aerofilms.cz.xml
<?xml version="1.0" encoding="UTF-8"?>
<mods xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.loc.gov/mods/v3" version="3.4" xsi:schemaLocation="http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-4.xsd">
  <subject authority="Konspekt">
    <topic>Film. Cirkus. Lidová zábava</topic>
  </subject>
  <classification authority="Konspekt">791</classification>
  <subject authority="Conspectus">
    <topic>Public Performances</topic>
  </subject>
  <classification authority="Conspectus">791</classification>
  <titleInfo>
    <title>Aerofilms</title>
    <subTitle>filmová distribuční společnost</subTitle>
  </titleInfo>
  <titleInfo type="alternative">
    <title>Aerofilmscz</title>
  </titleInfo>
  <name type="corporate">
    <namePart>Aerofilms (firma)</namePart>
  </name>
  <typeOfResource>text</typeOfResource>
  <genre authority="marcgt">web site</genre>
  <genre authority="czenas">www dokumenty</genre>
  <genre authority="eczenas">www documents</genre>
  <originInfo>
    <place>
      <placeTerm type="code" authority="marccountry">xr</placeTerm>
    </place>
    <place>
      <placeTerm type="text">Praha]</placeTerm>
    </place>
    <publisher>Aerofilms</publisher>
    <dateIssued>[2005?]-</dateIssued>
    <dateIssued encoding="marc" point="start">2005</dateIssued>
    <dateIssued encoding="marc" point="end">9999</dateIssued>
    <issuance>integrating resource</issuance>
    <frequency authority="marcfrequency">Vychází nepravidelně</frequency>
  </originInfo>
  <language>
    <languageTerm authority="iso639-2b" type="code">cze</languageTerm>
  </language>
  <language>
    <languageTerm authority="iso639-2b" type="code">eng</languageTerm>
  </language>
  <physicalDescription>
    <form authority="marcform">electronic</form>
    <form authority="gmd">elektronický zdroj</form>
    <form authority="marccategory">electronic resource</form>
    <form authority="marcsmd">remote</form>
    <internetMediaType>text/html</internetMediaType>
  </physicalDescription>
  <abstract>Stránky české filmové distribuční společnosti. Výroční zprávy a detailní informace o distribuovaných českých a zahraničních filmových titulech</abstract>
  <note type="statement of responsibility" altRepGroup="00"/>
  <note>Název ze zdrojového kódu (verze z 13.10.2009)</note>
  <note type="system details">Způsob přístupu: World Wide Web</note>
  <subject authority="czenas">
    <name type="corporate">
      <namePart>Aerofilms (firma)</namePart>
    </name>
  </subject>
  <subject authority="czenas">
    <topic>filmová distribuce</topic>
  </subject>
  <subject authority="czenas">
    <topic>filmy</topic>
  </subject>
  <subject>
    <topic>film distribution</topic>
  </subject>
  <subject>
    <topic>films</topic>
  </subject>
  <subject authority="czenas">
    <geographic>Česko</geographic>
  </subject>
  <subject>
    <geographic>Czechia</geographic>
  </subject>
  <classification authority="udc" edition="MRF">791.64</classification>
  <classification authority="udc" edition="MRF">791.2</classification>
  <classification authority="udc" edition="MRF">(437.3)</classification>
  <classification authority="udc" edition="MRF">(0.034.2)004.738.12</classification>
  <location>
    <url displayLabel="electronic resource" usage="primary display">http://aerofilms.cz</url>
  </location>
  <identifier type="ccnb">cnb002006013</identifier>
  <recordInfo>
    <descriptionStandard>aacr</descriptionStandard>
    <recordContentSource authority="marcorg">ABA001</recordContentSource>
    <recordCreationDate encoding="marc">091013</recordCreationDate>
    <recordChangeDate encoding="iso8601">20131030095652.0</recordChangeDate>
    <recordIdentifier source="CZ PrNK">web20092006013</recordIdentifier>
    <recordOrigin>Converted from MARCXML to MODS version 3.4 using MARC21slim2MODS3-4.xsl
                                (Revision 1.86 2013/06/10)</recordOrigin>
    <languageOfCataloging>
      <languageTerm authority="iso639-2b" type="code">cze</languageTerm>
    </languageOfCataloging>
  </recordInfo>
</mods>

harvest/logs/index/harvest.cdx

Každý ARC/WARC by měl mít jeden CDX. CDX obsahuje URL a odkaz na ARC/WARC kde se data pro dané URL nachází. Tj. byla by dobrá kontrola, která by říkala, že každý ARC/WARC má vlastní CDX soubor. Imho by bylo dobré do JSONu dát políčko agent pro informaci, jaký SW, v jaké verzi a kdy CDX vytvořil.

[root@war 13]# head serials/Serials-2013-07-1M_ArchiveIt/logs/index/Serials-2013-07-1M_ArchiveIt-20130722150002470-00000-5644~crawler00.webarchiv.cz~7778.warc.gz.cdx
dns:botany.cz 20130722150001 dns:botany.cz text/dns - 5Y2B6OFG7QSZRAL32JOOLZMXSMHCAJTY - 640 Serials-2013-07-1M_ArchiveIt-20130722150002470-00000-5644~crawler00.webarchiv.cz~7778.warc.gz
dns:apatykar.info 20130722150001 dns:apatykar.info text/dns - SOS477VWLODFJ5GSUS6OUH323TZVPJTD - 878 Serials-2013-07-1M_ArchiveIt-20130722150002470-00000-5644~crawler00.webarchiv.cz~7778.warc.gz
Clone this wiki locally