Skip to content

Latest commit

 

History

History
45 lines (25 loc) · 8.23 KB

File metadata and controls

45 lines (25 loc) · 8.23 KB

Personal-Identity-in-the-Treatise

Overview of Files

Usable Data

The data_out folder contains the data generated by the scripts in this folder.

The two data sets generated are blank_master_score_sheet and citation_data. Both are csv files

The blank_master_score_sheet.csv is a csv that can be used to create a score sheet for scoring any set of citations with respect to which paragraphs in the Treatise are mentioned by those citations.

The citation_data.csv is a csv that has the current complete scoring of every citation thus far isolated from the search texts. Each row represents a citation. The columns are: the filename (originally of the PDF file downloaded, but now tied to the txt file searched in search_texts; the author of the publication; the publication title; the publication year; the actual text of the citation which was captured by the search; and a list of pairs. The first entry in the pair is a paragraph to be credited on the basis of this citation. The second entry in the pair is the proportion of the citation to be credited to that paragraph.

The paragraph_level_citation_data is an expanded version of citation_data.csv. It is expanded to make it a bit easier (for me, at least), to use with Tableau for visualizations. The original citation_data.csv file has each citation tied a list of paragraphs,credit pairs. This new data creates a citation entry for each paragraph. So, it keeps track of all the original bibliographic information from the citation but replicates it for each paragraph credited by the citation.

script_to_generate_paragraph_level_citation_data.py is the script to generate teh above data. It reads the citation_data.csv file as input and outputs the above paragraph_level_citation_data file.

Publication Texts to search

The search_texts folder contains searchable text versions of all of the articles from the PhilPapers-Hume-Personal Identity leaf that have been acquired so far. They were originally acquired in PDF format from the journals or books in which they were published. They have been converted into .txt documents with no white space to facilitate more consistent searches.

An overview of the entire PhilPapers-Hume-Personal Idneity leaf can be found in the biblio_info.csv file. That spread sheet has a list of every entry in the leaf, the bibliographic information for each entry, and a few columns related to whether the article has been acquired or not, as well as a brief overview on style the article uses to make citatinos to the Treatise.

Python and text files related to generating a representation of the Treatise to be used in scoring citations

[Full Treatise in HTML.txt](./Full Treatise in HTML.txt) is a full copy of the Treatise in txt format. This file was acquired from the davidhume.org webesite. This file was used to generate a list of all the paragraphs in the Treatise. It doesn't contain the text related to the Abstract of the Treatise, however. That is a separate file I need to add from my computer.

[Norton to SBN Dictionary.txt](./'Norton to SBN Dictionary.txt') is a .txt file in which each line is a pair separated by a ' : '. The first part of the pair is a paragraph number from the Treatise. The second part of the pair is a list of page numbers from the SBN edititon of the Treatise on which the paragraph appears. The page numbers are separated by ','. The file can be imported to created a paragraph to page number dictionary. These connections are generated from the Full Treatise in HTML.txt file as each paragraph is separated by a line that lists the paragraph followed by the range of pages the paragraph spans.

[SBN gto Norton Dictionary.txt](./'SBN to Norton Dictionary.txt') is a .txt file in which each line is a pair seaprated by a ' : '. The first part of the pair is a page number from the SBN edition of the Treatise. The second memeber in the pair is a list of the paragraphs that appear on the page (this file is generated from Norton to SBN Dictionary as a quick lookup feature to be used in scoring).

treatise_reference_data.py This Python file imports the data from Norton to SBN Dictionary and SBN to Norton Dictionary to create usable dictionaries from those data sets, a usable list of every paragraph in the Treatise, a unique list of all SBN pages in the treatise, and, a master score sheet that is a dictionary linking each paragraph in the Treatise to a number that represents how frequently it is cited in the scholarly literature.

master_score_sheet_script.py When executed, this file will draw on the treatise_reference_data.py file to generate a blank master_score_sheet in the data_out folder.

Python files related to searching articles for citations

Article_and_Citation_Classes.py This Python file contains the two data structures that are used for extracting citation data out of articles.

  • First, we have a class that represents individual articles or publications. This is called the Paper class. To make these easily usable by the name of the file, this file also creates a dictionary linking the name of the file of each article (extracted from the biblio_info.csv) to an instance of the Paper object created. The paper object keeps track of the bibliographic information of the article as well as all of the citations extracted from the article. It contains four different methods for searching and recording the articles for different types of citations as well as an open ended search which takes a search string as input. To make the Paper class properly living, the file also creates a file to path dictinoary for each instance created from the biblio_info.csv spreadsheet. The path for each file is just set as /search_texts for current purposes. The search strings used to extract citations are located in this class.

  • Second, the Citation class is a data structure that stores each citation. It tracks the paper that it came from, the order in which the search was discovered, the start point of the text discovered, the end of the text discovered, a 'cleaned citation' version of the text discovered, and a method for generating a list of paragraphs mentioned in the citation and the credit each paragraph mentioned is due. It also has methods for extracting the surrounding text of each citation, specified by number of characters on either side you want to pull.

master_function_list.py This file contains all of the functions related to parsing, processing, and scoring citations in either the SBN (ie page number) or Norton (ie paragraph number) format. In sum, these functions can take SBN page formats as either a single page, a list of pages, or a list of page ranges and pages, and return a list of paragraphs that occur on all of those pages. The page numbers can be given in Arabic or Roman numerals. It has the same functioanlity with respect to lists and ranges of paragraphs. The functions also can handle chapter inputs rather than inputs that go all the way down to the grain of paragraphs. This draws on the treatise_reference_data. All of these functions lead into two functions—UltimateParser and UltimateScorer. The parser takes a citation in and returns a list of pairs of paragraphs and scores. The scorer figures out those scores and then updates a master_score_sheet.

data_generating_script.py. This script, when executed, draws on the treatise reference data, article and citation classes. It generates three sets of citations. The first is based on Norton style citations. The second is based on SBN style citatons. The third pulls easily idnetifiable citaitons to the Treatise (becauase they start with 'T' and end with just a page number) that are contained inside of parentheses. The script then aggregates these lists together and scores them on a master_score_sheet. Finally, the script outputs the citation_data.csv file in the data_out folder.

In a way, this script is all you need to generate the data to be used in analaysis.