-
Notifications
You must be signed in to change notification settings - Fork 0
Popis souborů
Tady je asi zajímavé, že je tam např. WARC-Record-ID: urn:uuid:3d1ee065-1ae5-469f-bc02-1a8b3bcbe6b2 - což je jednotný identifikátor WARCu.. u starých ARCů to asi není.
[root@war 13]# zcat serials/Serials-2013-07-1M_ArchiveIt/Serials-2013-07-1M_ArchiveIt-20130722150002470-00000-5644~crawler00.webarchiv.cz~7778.warc.gz |head -n 100
WARC/1.0
WARC-Type: warcinfo
WARC-Date: 2013-07-22T15:00:02Z
WARC-Filename: Serials-2013-07-1M_ArchiveIt-20130722150002470-00000-5644~crawler00.webarchiv.cz~7778.warc.gz
WARC-Record-ID: <urn:uuid:3d1ee065-1ae5-469f-bc02-1a8b3bcbe6b2>
Content-Type: application/warc-fields
Content-Length: 704
software: Heritrix/3.1.2-SNAPSHOT-20130207.001528 http://crawler.archive.org
ip: 127.0.0.2
hostname: crawler00.webarchiv.cz
format: WARC File Format 1.0
conformsTo: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf
operator: Rudolf Kreibich
publisher: National Library of the Czech Republic - WebArchiv.cz
audience: WebArchiv.cz Users
isPartOf: Serials 2013-07-1M_ArchiveIt
description: Pravidelná sklize? semínek s m?sí?ní frekvencí a archivace semínek s nízkou frekvencí p?idaných v kv?tnu.
robots: ignore
http-header-user-agent: Mozilla/5.0 (compatible; heritrix/3.1.2-SNAPSHOT-20130207.001528 +http://webarchiv.cz/kontakty/)
http-header-from: [email protected]
WARC/1.0
WARC-Type: response
WARC-Target-URI: dns:botany.cz
WARC-Date: 2013-07-22T15:00:01Z
WARC-IP-Address: 195.113.132.45
WARC-Record-ID: <urn:uuid:3341dff8-1253-47b7-8b59-c4a67916a8d5>
Content-Type: text/dns
Content-Length: 50
20130722150001
botany.cz. 1800 IN A 81.2.225.176
WARC/1.0
WARC-Type: response
WARC-Target-URI: dns:apatykar.info
WARC-Date: 2013-07-22T15:00:01Z
WARC-IP-Address: 195.113.132.45
WARC-Record-ID: <urn:uuid:97a2ea9c-7fcc-4643-ac8a-f4222c2ae65e>
Content-Type: text/dns
Content-Length: 56
20130722150001
apatykar.info. 3600 IN A 217.198.114.12
...
### harvest/logs/harvest_harvest.xml Obsahuje metadata k jednotlivým souborům, včetně md5. Vtip je, že md5 máme na úložišti v souborex linuxfixity, ale tady v METSu je i info, kdy MD5 bylo vytvořeno.
Mimoriadne dolezite pre uloziste LTP
[root@war 13]# cat serials/Serials-2013-07-1M_ArchiveIt/logs/Serials-2013-07-1M_ArchiveItharvest.xml
<?xml version="1.0" encoding="UTF-8"?>
<mets:mets xmlns:mets="http://www.loc.gov/METS/" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/XMLSchema-instance http://www.w3.org/2001/XMLSchema.xsd http://www.loc.gov/METS/ http://www.loc.gov/standards/mets/mets.xsd">
<mets:metsHdr ROLE="CREATOR" TYPE="ORGANIZATION">
<mets:name>ABA001</mets:name>
</mets:metsHdr>
<mets:dmdSec ID="DCMD_CRAWL_0001">
<mets:mdWrap DMTYPE="DC" MIMETYPE="text/xml">
<mets:xmlData xmlns:dc="http://purl.org/dc/elements/1.1/">
<oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/">
<dc:title>Serials-2013-07-1M_ArchiveIt</dc:title>
<dc:type>Crawl</dc:type>
<dc:audience>ABA 001</dc:audience>
<dc:identifier>f5nf6u6f-wmc7-l08s-qg4q-bkuzoclgbf9j</dc:identifier>
</oai_dc:dc>
</mets:xmlData>
</mets:mdWrap>
</mets:dmdSec>
<mets:fileSec>
<mets:fileGrp ID="WARCSGROUP" Use="Warcs">
<mets:file CHECKSUM="7da1eef75c626e6601ab5eb8083f1bf9" CHECKSUMTYPE="MD5" CREATED="2013-07-24T15:28:02+00:00" ID="&lt;urn:uuid:fbae3970-2063-4f51-a28b-b16bcf6048bb&gt;" MIMETYPE="application/X-warc" SEQ="1" SIZE="355891615">
<mets:FLocat LOCTYPE="URL" xlink:href="Serials-2013-07-1M_ArchiveIt-20130724152802065-00014-5644~crawler00.webarchiv.cz~7778.warc.gz"/>
</mets:file>
<mets:file CHECKSUM="c1676c6d9c6d3bc0355d1805e31e2f5a" CHECKSUMTYPE="MD5" CREATED="2013-07-23T04:08:07+00:00" ID="&lt;urn:uuid:40f98543-0775-4cef-9c85-7a01944aa83a&gt;" MIMETYPE="application/X-warc" SEQ="2" SIZE="1000071094">
<mets:FLocat LOCTYPE="URL" xlink:href="Serials-2013-07-1M_ArchiveIt-20130723040807094-00010-5637~crawler02.webarchiv.cz~7778.warc.gz"/>
</mets:file>
...
<mets:file CHECKSUM="6b2236f33a843ccf78b8c2567d5efba8" CHECKSUMTYPE="MD5" CREATED="2013-07-23T22:34:39+00:00" ID="&lt;urn:uuid:13b06fc6-c7a1-4aaa-a287-c7e4253617f8&gt;" MIMETYPE="application/X-warc" SEQ="44" SIZE="701420231">
<mets:FLocat LOCTYPE="URL" xlink:href="Serials-2013-07-1M_ArchiveIt-20130723223439571-00013-5644~crawler00.webarchiv.cz~7778.warc.gz"/>
</mets:file>
</mets:fileGrp>
</mets:fileSec>
</mets:mets>
žádný výpis, je to komprimovaný, defakto by tam měla být struktura z logů heritrix jobu - dost se to měnilo v čase a pro ty historický to neexistuje. (tj. až někdy roku 2014/5). Možná by stálo za to mít dohledatelný, jestli u sklizně jsou logy a případně, co obsahují na úrovni názvu souborů.. Usecase je třeba, zda jde dohledat zda ve sklizni X, crawl-report - který se třeba negeneruje u spadlních sklizní, ale občas se dodělával ručně..
[root@war 13]# tar -ztvf serials/Serials-2013-07-1M_ArchiveIt/logs/crawl/Serials-2013-07-1M_ArchiveIt-crawler00.tar.gz
drwxr-xr-x heritrix/users 0 2013-08-05 11:14 20130722145132/
-rw-r--r-- heritrix/users 55854 2013-07-22 16:47 20130722145132/crawler-beans.cxml
-rw-r--r-- heritrix/users 2113 2013-07-22 16:51 20130722145132/negative-surts.dump
drwxr-xr-x heritrix/users 0 2013-07-22 16:51 20130722145132/actions-done/
-rw-r--r-- heritrix/users 1625 2013-03-28 14:38 20130722145132/negative-surts.txt
-rw-r--r-- heritrix/users 7502 2013-07-22 16:44 20130722145132/seeds.txt
-rw-r--r-- heritrix/users 660 2013-07-25 22:26 20130722145132/job.log
-rw-r--r-- heritrix/users 8735 2013-07-22 16:51 20130722145132/surts.dump
-rw-r--r-- heritrix/users 209 2013-03-28 14:23 20130722145132/surts.txt
drwxr-xr-x heritrix/users 0 2013-07-25 22:26 20130722145132/reports/
-rw-r--r-- heritrix/users 13 2013-07-25 22:26 20130722145132/reports/threads-report.txt
-rw-r--r-- heritrix/users 15043 2013-07-25 22:26 20130722145132/reports/seeds-report.txt
-rw-r--r-- heritrix/users 458 2013-07-25 22:26 20130722145132/reports/crawl-report.txt
-rw-r--r-- heritrix/users 47433 2013-07-25 22:26 20130722145132/reports/frontier-summary-report.txt
-rw-r--r-- heritrix/users 2261 2013-07-25 22:26 20130722145132/reports/processors-report.txt
-rw-r--r-- heritrix/users 3375 2013-07-25 22:26 20130722145132/reports/mimetype-report.txt
-rw-r--r-- heritrix/users 133780 2013-07-25 22:26 20130722145132/reports/hosts-report.txt
-rw-r--r-- heritrix/users 160 2013-07-25 22:26 20130722145132/reports/responsecode-report.txt
-rw-r--r-- heritrix/users 58 2013-07-25 22:26 20130722145132/reports/source-report.txt
drwxr-xr-x heritrix/users 0 2013-08-05 11:14 20130722145132/logs/
-rw-r--r-- heritrix/users 91129110 2013-07-25 22:26 20130722145132/logs/frontier.recover.gz
-rw-r--r-- heritrix/users 1570510 2013-07-24 05:03 20130722145132/logs/uri-errors.log
-rw-r--r-- heritrix/users 2486623 2013-07-25 22:26 20130722145132/logs/progress-statistics.log
-rw-r--r-- heritrix/users 1063 2013-07-22 17:00 20130722145132/logs/alerts.log
-rw-r--r-- heritrix/users 1420 2013-07-22 17:00 20130722145132/logs/runtime-errors.log
-rw-r--r-- heritrix/users 10781093 2013-07-25 21:28 20130722145132/logs/nonfatal-errors.log
-rw-r--r-- heritrix/users 321991016 2013-07-25 22:26 20130722145132/logs/crawl.log
Zde asi není třeba nic extrahovat pro potřeby Graineru
[root@war 13]# cat serials/Serials-2013-07-1M_ArchiveIt/logs/dmdsec/Mets_aerofilms.cz.xml
<?xml version="1.0" encoding="UTF-8"?>
<mods xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.loc.gov/mods/v3" version="3.4" xsi:schemaLocation="http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-4.xsd">
<subject authority="Konspekt">
<topic>Film. Cirkus. Lidová zábava</topic>
</subject>
<classification authority="Konspekt">791</classification>
<subject authority="Conspectus">
<topic>Public Performances</topic>
</subject>
<classification authority="Conspectus">791</classification>
<titleInfo>
<title>Aerofilms</title>
<subTitle>filmová distribuční společnost</subTitle>
</titleInfo>
<titleInfo type="alternative">
<title>Aerofilmscz</title>
</titleInfo>
<name type="corporate">
<namePart>Aerofilms (firma)</namePart>
</name>
<typeOfResource>text</typeOfResource>
<genre authority="marcgt">web site</genre>
<genre authority="czenas">www dokumenty</genre>
<genre authority="eczenas">www documents</genre>
<originInfo>
<place>
<placeTerm type="code" authority="marccountry">xr</placeTerm>
</place>
<place>
<placeTerm type="text">Praha]</placeTerm>
</place>
<publisher>Aerofilms</publisher>
<dateIssued>[2005?]-</dateIssued>
<dateIssued encoding="marc" point="start">2005</dateIssued>
<dateIssued encoding="marc" point="end">9999</dateIssued>
<issuance>integrating resource</issuance>
<frequency authority="marcfrequency">Vychází nepravidelně</frequency>
</originInfo>
<language>
<languageTerm authority="iso639-2b" type="code">cze</languageTerm>
</language>
<language>
<languageTerm authority="iso639-2b" type="code">eng</languageTerm>
</language>
<physicalDescription>
<form authority="marcform">electronic</form>
<form authority="gmd">elektronický zdroj</form>
<form authority="marccategory">electronic resource</form>
<form authority="marcsmd">remote</form>
<internetMediaType>text/html</internetMediaType>
</physicalDescription>
<abstract>Stránky české filmové distribuční společnosti. Výroční zprávy a detailní informace o distribuovaných českých a zahraničních filmových titulech</abstract>
<note type="statement of responsibility" altRepGroup="00"/>
<note>Název ze zdrojového kódu (verze z 13.10.2009)</note>
<note type="system details">Způsob přístupu: World Wide Web</note>
<subject authority="czenas">
<name type="corporate">
<namePart>Aerofilms (firma)</namePart>
</name>
</subject>
<subject authority="czenas">
<topic>filmová distribuce</topic>
</subject>
<subject authority="czenas">
<topic>filmy</topic>
</subject>
<subject>
<topic>film distribution</topic>
</subject>
<subject>
<topic>films</topic>
</subject>
<subject authority="czenas">
<geographic>Česko</geographic>
</subject>
<subject>
<geographic>Czechia</geographic>
</subject>
<classification authority="udc" edition="MRF">791.64</classification>
<classification authority="udc" edition="MRF">791.2</classification>
<classification authority="udc" edition="MRF">(437.3)</classification>
<classification authority="udc" edition="MRF">(0.034.2)004.738.12</classification>
<location>
<url displayLabel="electronic resource" usage="primary display">http://aerofilms.cz</url>
</location>
<identifier type="ccnb">cnb002006013</identifier>
<recordInfo>
<descriptionStandard>aacr</descriptionStandard>
<recordContentSource authority="marcorg">ABA001</recordContentSource>
<recordCreationDate encoding="marc">091013</recordCreationDate>
<recordChangeDate encoding="iso8601">20131030095652.0</recordChangeDate>
<recordIdentifier source="CZ PrNK">web20092006013</recordIdentifier>
<recordOrigin>Converted from MARCXML to MODS version 3.4 using MARC21slim2MODS3-4.xsl
(Revision 1.86 2013/06/10)</recordOrigin>
<languageOfCataloging>
<languageTerm authority="iso639-2b" type="code">cze</languageTerm>
</languageOfCataloging>
</recordInfo>
</mods>
Každý ARC/WARC by měl mít jeden CDX. CDX obsahuje URL a odkaz na ARC/WARC kde se data pro dané URL nachází. Tj. byla by dobrá kontrola, která by říkala, že každý ARC/WARC má vlastní CDX soubor. Imho by bylo dobré do JSONu dát políčko agent pro informaci, jaký SW, v jaké verzi a kdy CDX vytvořil.
[root@war 13]# head serials/Serials-2013-07-1M_ArchiveIt/logs/index/Serials-2013-07-1M_ArchiveIt-20130722150002470-00000-5644~crawler00.webarchiv.cz~7778.warc.gz.cdx
dns:botany.cz 20130722150001 dns:botany.cz text/dns - 5Y2B6OFG7QSZRAL32JOOLZMXSMHCAJTY - 640 Serials-2013-07-1M_ArchiveIt-20130722150002470-00000-5644~crawler00.webarchiv.cz~7778.warc.gz
dns:apatykar.info 20130722150001 dns:apatykar.info text/dns - SOS477VWLODFJ5GSUS6OUH323TZVPJTD - 878 Serials-2013-07-1M_ArchiveIt-20130722150002470-00000-5644~crawler00.webarchiv.cz~7778.warc.gz
Postup extrakce provozních metadat
Grainery frontend