Skip to content

Commit

Permalink
Scrapy only whne keyword found
Browse files Browse the repository at this point in the history
  • Loading branch information
Simon Hardy committed Feb 2, 2018
1 parent 1fd5382 commit b2e4e40
Show file tree
Hide file tree
Showing 5 changed files with 22 additions and 7 deletions.
4 changes: 4 additions & 0 deletions keyword.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
<?xml version="1.0" encoding="utf-8"?>
<items>
<item><Partners><value>STIFTELSEN SINTEF</value><value>IWW RHEINISCH WESTFALISCHES INSTITUT FUR WASSERFORSCHUNG GEMEINNUTZIGE GMBH</value><value>CETAQUA, CENTRO TECNOLOGICO DEL AGUA, FUNDACION PRIVADA</value><value>KWR WATER B.V.</value><value>FUNDACIO EURECAT</value><value>TECHNION - ISRAEL INSTITUTE OF TECHNOLOGY</value><value>ATOS SPAIN SA</value><value>MEKOROT WATER COMPANY LIMITED</value><value>AIGUES DE BARCELONA, EMPRESA METROPOLITANA DE GESTIO DEL CICLE INTEGRAL DE L'AIGUA SA</value><value>HESSENWASSER GMBH &amp; CO. KG</value><value>OSLO KOMMUNE</value><value>INSTITUTE OF COMMUNICATION AND COMPUTER SYSTEMS</value><value>BERGEN KOMMUNE</value><value>BERLINER WASSERBETRIEBE</value><value>EUROPEAN WATER SUPPLY AND SANITATION TECHNOLOGY PLATFORM</value><value>PNO INNOVATION</value><value>BEIT TOCHNA APLICATZIA LTD</value><value>EMPRESA MUNICIPAL DE ABASTECIMIENTO Y SANEAMIENTO DE GRANADA SA</value><value>WORLDSENSING SL</value><value>RISA SICHERHEITSANALYSEN GMBH</value><value>MNEMONIC AS</value><value>VLAAMSE MAATSCHAPPIJ VOORWATERVOORZIENING CVBA</value></Partners><EU_Contribution><value>EUR 8 255 319,50</value></EU_Contribution><Project_Title><value>Strategic, Tactical, Operational Protection of water Infrastructure against cyber-physical Threats</value></Project_Title><Total_Cost><value>EUR 9 616 525,18</value></Total_Cost><Country><value>Norway</value><value>Germany</value><value>Spain</value><value>Netherlands</value><value>Spain</value><value>Israel</value><value>Spain</value><value>Israel</value><value>Spain</value><value>Germany</value><value>Norway</value><value>Greece</value><value>Norway</value><value>Germany</value><value>Belgium</value><value>Belgium</value><value>Israel</value><value>Spain</value><value>Spain</value><value>Germany</value><value>Norway</value><value>Belgium</value></Country><Activity><value>Research Organisations</value><value>Research Organisations</value><value>Research Organisations</value><value>Private for-profit entities (excluding Higher or Secondary Education Establishments)</value><value>Research Organisations</value><value>Higher or Secondary Education Establishments</value><value>Private for-profit entities (excluding Higher or Secondary Education Establishments)</value><value>Private for-profit entities (excluding Higher or Secondary Education Establishments)</value><value>Private for-profit entities (excluding Higher or Secondary Education Establishments)</value><value>Private for-profit entities (excluding Higher or Secondary Education Establishments)</value><value>Public bodies (excluding Research Organisations and Secondary or Higher Education Establishments)</value><value>Research Organisations</value><value>Public bodies (excluding Research Organisations and Secondary or Higher Education Establishments)</value><value>Public bodies (excluding Research Organisations and Secondary or Higher Education Establishments)</value><value>Other</value><value>Private for-profit entities (excluding Higher or Secondary Education Establishments)</value><value>Private for-profit entities (excluding Higher or Secondary Education Establishments)</value><value>Private for-profit entities (excluding Higher or Secondary Education Establishments)</value><value>Private for-profit entities (excluding Higher or Secondary Education Establishments)</value><value>Private for-profit entities (excluding Higher or Secondary Education Establishments)</value><value>Private for-profit entities (excluding Higher or Secondary Education Establishments)</value><value>Private for-profit entities (excluding Higher or Secondary Education Establishments)</value></Activity><Technology_Description>Water critical infrastructures (CIs) are essential for human society, life and health and they can be endangered by physical/cyber threats with severe societal consequences. To address this, STOP-IT assembles a team of major Water Utilities, industrial technology developers, high tech SMEs and top EU R&amp;D providers. It organizes communities of practice for water systems protection to identify current and future risk landscapes and to co-develop an all-hazards risk management framework for the physical and cyber protection of water CIs. Prevention, Detection, Response and Mitigation of relevant risks at strategic, tactical and operational levels of planning will be taken into account to generate modular solutions (technologies, tools and guidelines) and an integrated software platform. STOP-IT solutions are based on: a) mature technologies improved via their combination and embedment (incl. public warning systems, smart locks) and b) novel technologies whose TRL will be increased (incl. cyber threat incident services, secure wireless sensor communications modules, context-aware anomaly detection technologies; fault-tolerant control strategies for SCADA integrated sensors, high-volume real-time sensor data protection via blockchain schemes; authorization engines; irregular human detection using new computer vision methods and WiFi and efficient water contamination detection algorithms). STOP-IT solutions are demonstrated through a front-runner/follower approach where 4 advanced utilities, Aigües de Barcelona (ES), Berliner Wasserbetriebe (DE), MEKOROT (IL) and Oslo VAV (NO) are twinned with 4 less advanced, but ambitious ones, to stimulate mutual learning, transfer and uptake. Building on this solid basis STOP-IT delivers high impact through the creation of hands-on training, best practice guidelines, support for certification and standardization as well as by fostering market opportunities, also leveraging the EU water technology platform's multi-stakeholder network.</Technology_Description><To><value>2021-05-31, ongoing project</value></To><From><value>2017-06-01</value></From><Meta><value>&lt;meta name="WT.cg_s" content="H2020-EU.3.7.4., H2020-EU.3.7.2."&gt;</value></Meta><Topic_s><value>CIP-01-2016-2017 - Prevention, detection, response and mitigation of the combination of physical and cyber threats to the critical infrastructure of Europe.</value></Topic_s><Funding_scheme><value>IA - Innovation action</value></Funding_scheme><Coordinated_in><value>Norway</value></Coordinated_in><Project_ACR><value>STOP-IT</value></Project_ACR><Call_for_Proposal><value>CIP-2016-2017-1</value></Call_for_Proposal><Project_ID><value>740610</value></Project_ID></item>
</items>
7 changes: 6 additions & 1 deletion spiders/back-up.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,18 @@
import scrapy
from scrapy.loader import ItemLoader
from CORDIS.items import CordisItem
from scrapy.spider import BaseSpider

class CordisSpider(scrapy.Spider):
name = 'cordis'
# f = open("urls.txt")
# start_urls = [url.strip() for url in f.readlines()]
# f.close()
allowed_domains = ['cordis.europa.eu']
start_urls = ['http://cordis.europa.eu/project/rcn/%d_en.html' %(n) for n in range(210216, 210217)]
# Max EU CORDIS 213445

# def parse_keywordpage(self, response):
# if water in response.xpath('//*[@id="ica:content"]'):
def parse(self, response):
# Misconfiguration to check - eu in response.xpath not needed
#for eu in response.xpath('//*[@id="container-pack"]'):
Expand Down
15 changes: 9 additions & 6 deletions spiders/cordis_spider.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,15 +6,18 @@

class CordisSpider(scrapy.Spider):
name = 'cordis'
f = open("urls.txt")
start_urls = [url.strip() for url in f.readlines()]
f.close()
# allowed_domains = ['cordis.europa.eu']
# start_urls = ['http://cordis.europa.eu/project/rcn/%d_en.html' %(n) for n in range(210216, 210217)]
# f = open("urls.txt")
# start_urls = [url.strip() for url in f.readlines()]
# f.close()
allowed_domains = ['cordis.europa.eu']
start_urls = ['http://cordis.europa.eu/project/rcn/%d_en.html' %(n) for n in range(210216, 210217)]

# def parse_keywordpage(self, response):
# if water in response.xpath('//*[@id="ica:content"]'):
def parse(self, response):
# Misconfiguration to check - eu in response.xpath not needed
#for eu in response.xpath('//*[@id="container-pack"]'):
if response.xpath('//*[@id="ica:content"][contains(.,"water")]'):
item = CordisItem()
item['Meta'] = response.xpath('/html/head/meta[23]').extract()
item['Project_ACR'] = response.xpath('//*[@id="dynamiccontent"]/div[1]/h1/text()').extract()
Expand All @@ -34,4 +37,4 @@ def parse(self, response):

#for eu in response.css('div.objective'):
item['Technology_Description'] = response.css('p::text').extract_first()
yield item
yield item
Binary file modified spiders/cordis_spider.pyc
Binary file not shown.
3 changes: 3 additions & 0 deletions spiders/keyword-draft.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
def parse_keywordpage(self, response):
if keyword in response.body:
#do something

0 comments on commit b2e4e40

Please sign in to comment.