CrossOver is a web scraping toolset, initiated in 2021 with funding from the European Union’s programme on the financing of Pilot Projects and Preparatory Actions in the field of “Communications Networks, Content and Technology” under the grant agreement LC-01682253, as well as from the Mozilla Technology Fund. The project was led by EU DisinfoLab in collaboration with CheckFirst, Apache, and Savoir-Devenir. For more detailed information about the project, please visit here.
Here are some key media coverages that provide insights into the impact and relevance of the CrossOver project:
-
Nieuws in de Klas: This article discusses the role of CrossOver in combating disinformation in the digital age.
-
Politico: This piece highlights the geopolitical implications of the CrossOver project.
-
OpenFacto: This article delves into how Google's autocomplete function can inadvertently spread disinformation, and how tools like CrossOver can help mitigate this.
-
Daar Daar: This piece discusses the role of CrossOver in tracking and analyzing Russian propaganda in Belgium.
-
Science Media Hub: This article reviews the technological advancements of CrossOver in the field of data science and media.
To install CrossOver, run the following command in your terminal: pip install .
For YouTube search emulation, you will need to install youtube-dl
, a required dependency. Use the following command: pip install -e git+https://github.com/ytdl-org/youtube-dl#egg=youtube_dl
Please note that these commands may vary depending on your setup environment.
Use crossover -g -i queries.csv
to load an input file containing queries. For each entry, it returns the suggested autocomplete searches provided by the Google Autocomplete API. The responses are printed to stdout in a machine-readable format, along with a screenshot.
Use crossover -y -i queries.csv
to load an input file containing queries. It simulates web browsing to the YouTube search page, parses all results, and collects their metadata. It then collects metadata from the recommended videos for each search result. All metadata are printed to stdout in a machine-readable format.
Use crossover -r -i queries.csv
to load an input file containing queries. It returns the trending posts of the corresponding subreddit. The returned metadata fields include ID, author, title, number of likes, number of comments, URL to the post, timestamp of content scraping, and timestamp shift (estimation of the post's age at the time of scraping).
As of July 2023, due to changes applied by Twitter, this feature is no longer maintained.
Please refer to the LICENSE file for licensing information.
© 2021 Check First OY