Skip to content

An approximate nearest-neighbor search for text reuse.

Notifications You must be signed in to change notification settings

senderle/fandom-search

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

61 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The Archive of Our Own script ao3.py can be used to scrape and analyze fanworks and prepare the results for visualization in JavaScript. A markup version of the script of the orginal work is required for searching for n-gram matches in the fanworks.

The basic workflow is below. This assumes you have a scripts folder, a fanworks folder, and a results folder, with a particular structure that can be inferred from the example commands below. (Sorry, very busy!) Take sw-all to be a stand-in for a folder of fan works, sw-new-hope.txt to be a stand-in for a correctly formatted script, and sw-new-hope (without the .txt) to be a stand-in for the results folder for the given movie.

A todo for this repo is to create options for where to save error and log files, and search results.

Another todo for this repo is to create more thorough documentation, especially of the script format, which is idiosyncratic but effective.

  • Scrape AO3 (Ooops! Currently broken!)

    python ao3.py scrape \
        -t "Star Wars - All Media Types" \
        -o fanworks/sw-all/html
    

The scrape command will save log and error files; check to see that the scrape went OK, and then move the (generically named) error file to fanworks/sw-all/sw-all-errors.txt.

  • Clean the HTML

    python ao3.py clean \
        fanworks/sw-all/html/ \
        -o fanworks/sw-all/plaintext/
    

The clean command will save an error file; check to see that the cleaning process went OK, and then move the error file (this time in the root dir) from clean-html-errors.txt to sw-all-clean-errors.txt

  • Perform the reuse search

    python ao3.py search \
        fanworks/sw-all/ \
        scripts/sw-new-hope.txt
    

The search command will create sevaral (and in some case, many, even hundreds) of separate CSV files. Each one contains the results for 500 fan works. They will automatically be aggregated by the script at the end of the process, but they are also saved here to ensure that if the search is interrupted, the results are still usable.

If the search completes without any errors, the final aggregated data will be in a file with a date timestamp in YYYYMMDD format. It will be something like match-6gram-20190604. Create a new folder results/sw-all/20190604/, and move all the CSV files into that folder.

  • Aggregate the results over the script (i.e. "format" the results)

    python ao3.py format \
        results/sw-new-hope/20190604/match-6gram-20190604.csv \
        scripts/sw-new-hope.txt \
        -o results/sw-new-hope/fandom-data-new-hope.csv
    
  • Create a Bokeh visualization of the aggregated results

    python ao3.py vis \
        results/sw-new-hope/fandom-data-new-hope.csv \
        -o results/sw-new-hope/new_hope_reuse.html
    

This is not a perfect workflow and needs to be tidied up in several ways. I will get around to that someday.

usage: ao3.py [-h] {scrape,clean,getmeta,search,matrix,format} ...

process fanworks scraped from Archive of Our Own.

positional arguments:
  {scrape,clean,getmeta,search,matrix,format}
                        scrape, clean, getmeta, search, matrix, or format
    scrape              find and scrape fanfiction works from Archive of Our
                        Own
    clean               takes a directory of html files and yields a new
                        directory of text files
    getmeta             takes a directory of html files and yields a csv file
                        containing metadata
    search              compare fanworks with the original script
    matrix              deduplicates and builds matrix for best n-gram matches
    format              takes a script and outputs a csv with senitment
                        information for each word formatted for javascript
                        visualization

optional arguments:
  -h, --help            show this help message and exit

There are three scraping options for Archive of Our Own: (1) Use the '-s' option to provide a search term and see a list of possible tags. (2) Use the '-t' option to scrape fanworks from a tag. (3) Use the '-u' option to scrape fanworks from a URL. The URL should be to the /works page, e.g. https://archiveofourown.org/tags/Rogue%20One:%20A%20Star%20Wars%20Story%20(2016)/works

usage: ao3.py scrape [-h] [-s SEARCH | -t TAG | -u URL] [-o OUT]
                     [-p STARTPAGE]

optional arguments:
  -h, --help            show this help message and exit
  -s SEARCH, --search SEARCH
                        search term to search for a tag to scrape
  -t TAG, --tag TAG     the tag to be scraped
  -u URL, --url URL     the full URL of first page to be scraped
  -o OUT, --out OUT     target directory for scraped html files
  -p STARTPAGE, --startpage STARTPAGE
                        page on which to begin downloading (to resume a
                        previous job)

Clean and convert the scraped html files into plain text files.

usage: ao3.py clean [-h] [-o O] i

positional arguments:
  i           directory of input html files to clean

optional arguments:
  -h, --help  show this help message and exit
  -o O        target directory for output txt files

Extract Archive of Our Own metadata from the scraped html files.

usage: ao3.py getmeta [-h] [-o O] i

positional arguments:
  i           directory of input html files to process

optional arguments:
  -h, --help  show this help message and exit
  -o O        filename for metadata csv file

The search process compares fanworks with the original work script and is based on 6-gram matches.

usage: ao3.py search [-h] d s

positional arguments:
  d           directory of fanwork text files
  s           filename for markup version of script

optional arguments:
  -h, --help  show this help message and exits

The n-gram search results can be used to create a matrix.

usage: ao3.py matrix [-h] [-n N] i m

positional arguments:
  i           input csv file
  m           fandom/movie name for output file prefix

optional arguments:
  -h, --help  show this help message and exit
  -n N        n-gram size, default is 6-grams

The n-gram search results can be prepared for JavaScript visualization.

usage: ao3.py format [-h] [-o O] s

positional arguments:
  s           filename for markup version of script

optional arguments:
  -h, --help  show this help message and exit
  -o O        filename for csv output file of data formatted for visualization
s```

About

An approximate nearest-neighbor search for text reuse.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages