Skip to content

Latest commit

 

History

History
53 lines (38 loc) · 3.83 KB

README.md

File metadata and controls

53 lines (38 loc) · 3.83 KB

Python application to read a legal case document in text format and highlight the citations among the text for easier navigation and impact analysis.

Legal citation is the practice of crediting and referring to authoritative documents and sources. The most common sources of authority cited are court decisions (cases), statutes, regulations, treaties, and scholarly writing. Typically, a proper legal citation will inform the reader about a source's authority, how strongly the source supports the writer's proposition, its age, and other, relevant information. This is an example citation to a United States Supreme Court court case:
Griswold v. Connecticut, 381 U.S. 479, 480 (1965).
However in very long documents, searching for citations is a time consuming process. It is also a difficult task in NLP to find out the pattern of citation text and predict them. We have used a 'spacy' deep learning model for this purpose.

approach

Here's a quick guide to each of the files and information:

  • input folder: Keep the document in txt format in this folder
  • outputs folder: Contains the generated output files. The highlighted text can be found in html and word document. The txt file here lists out the predicted citations.

Final command:: python main.py input/<filename>.txt
Ex - python main.py input/sample_input.txt

Use the following command to install and satisfy all requirements: pip install --user --requirement requirements.txt

Working of the files explained:

  • spacyNL24Sep.py : This is the code for building the spacy model, separately kept in model_building_and_training folder. For building the model on your own, put the files train_data_and_labels.csv and test_data.csv in the same folder and tweak the python script according to the dataframe.

  • predict_citation.py : This module takes a single raw text file as input and passes on through the model for prediction. Output is generated in csv format: result_citation.csv (the result contains filenames, citation text)

  • json_making.py : This module builds the json from the csv file generated from the predicted csv. Conditions used: Consider valid citation if(startid != -1 and length < 150) If duplicate citations are present, we check for all the citations's positional indexes and keep the record accordingly.

  • Coref_jsoncreation.py : This file takes in the initial json data and adds anaphoric information (Whether the short citations are refering to some other citation and details of the same).

  • Json_to_text_doc.py: This file takes in the final json data and raw text file as input. It generates the output as a text file as required . It also generates a docx file which contains all the highlighted citations

  • Citationhtmlutils.py : This takes in the raw text file and the text generated from Json_to_text_doc.py as input, and generates the html text with highlighted citations

Sample input

input

Generated Outputs

Command line output:

cmd_output

Predicted citations in json:

json_output

Highlighted text:

html_output

Future work :

The precision and recall currently is not good enough, and many citations are still not detected. Need to try other techniques such as LSTMs or BERT and try to improve the results.