This is a library to scrape and reconcile all payments made by a hiarcharcy of NHS institutions over time (c. 2010 to 2020). It is the final of three projects on public procurement data (the first two of which were centgovspend and TSRC-NCVO-CSDP). Code for an interactive dashboard is found at src/dashboard with the help of Ian M. Knowles. Links to open-access (OSF) versions of the two headline academic papers which use this dataset ("The Role of Non-Profits in Public Health Service Provision: Evidence from 25,338 heterogeneous procurement datasets"
with John Mohan and "Is outsourcing healthcare services to the private sector associated with higher mortality rates? An observational analysis of privatisation in England's NHS, 2013-2020"
by Ben Goodair and Aaron Reeves) can be found here and here. Please cite the former of these two papers as:
Rahal, C. and Mohan, J. (2024), 'The role of the third sector in public health service provision: Evidence from 25,338 heterogeneous procurement datasets', Journal of the Royal Statistical Society: Series A, 0, pp. 1-22.
A full, build passing notebook for the first of these two papers can be found here. Kindly note that there is a minor typing error in the caption of Table 2; it should read:
"Top 10 institutions by procurement value mapped to the Charity Commission, ordered by the cumulative value of all contracts they receive (£Mn). Count refers to the number of payments made in our datasets. `Income' (£Mn) and `Rank' refer to their total income for the full years in the Charity Commission during our sample window.
If you would like to collaborate on related work, please don't hestiate to get in touch! Two spin-off repositories specifically for pdf-parsing and institutional data curation can be found here and here respectively. Thanks again to Ian M. Knowles for his help with them. One of our next projects involves making a database of all third sector organisations. You can find the GitHub organisational repository for that here.
NHSSpend tries to minimize the number of pre-requisite installations outside of the standard library, and we recommend an Anaconda installation to provide a comprehensive set of basic tools. However, a couple are necessary due to the magnitude of the undertaking. These include a range of modules found in the requirements.txt
file (generated by pipreqs). The pdfparser is based on a version of the pdftableparser library, and the Charity Commission data is extracted using the charity-commission-extract library from NCVO. The Elasticsearch functionality is a custom implementation.
The data originates from one of two lists of recognised NHS institutions (Trusts and CCGs) and the main NHS England data provision page. These lists are used to create mappings to websites, and update on the status of the data (data/data_support/ccg_list.xlsx
and data/data_support/trust_list.xlsx
) with a number of different parametres fed into the scraper (src/NHSscraper.py
). The data curation exercise has stopped as of April 2020 in order to focus on the analysis of the data, with the compresse datasets found in data/merged/*
subdirectory of this repository). This is also partly due to the Covid-19 pandemic and the restructuring of Clinical Commissioning groups more generally (where 18 mergers took the number of CCGs from 191 to 136). However, please do raise issues on here if you think any of those institutions are mislabelled, or outdated. If you want to update this list (and the subsequent scrapers), please do raise an issue\get in touch (this is a constant ongoing work in progress until there is a centrally covened resource provided by the Government Data Service).
The procurement data itself is provided under an Open Government License (OGL). Guidance for publishing spend over £25,000 is published by HM Treasury.
The es_configure.md
describes the reconciliation approach. These reconciliations are then manually verified and merged back into the procurement data.
It is possible that you are reading this most interested in a copy of the output data! A link to the scraped, parsed, cleaned and reconciled can be found at NHSSpend/data/data_final. Please see the readme.md
in that subdirectory for information on each of the fields.
Repo structure is based on the tree
utility.
├ readme.md
├ es_configure.md
├ requirements.txt
├ src
│ └ analysis
│ │ ├ charity_analysis_notebook.ipynb
│ │ ├ general_analysis_functions.py
│ │ ├ helper_functions.py
│ │ ├ charity_analysis_functions.py
│ ├ scrape_and_parse_ccgs.py
│ ├ scrape_and_parse_trusts.py
│ ├ scraping_tools.py
│ ├ generate_output.py
│ ├ ingest_everything.py
│ ├ merge_and_evaluate_tools.py
│ ├ NHSSpend.py
│ ├ parsing_tools.py
│ ├ pdf_table_parser.py
│ ├ preconciliation.py
├ dashboard
├ data
│ └ data_support/*
│ └ data_cc/*
│ └ data_ch/*
│ └ data_dashboard/*
│ └ data_final/*
│ └ data_masteringest/*
│ └ data_merge/*
│ └ data_nhsccgs/*
│ └ data_nhsdigital/*
│ └ data_nhsengland/*
│ └ data_nhstrusts/*
│ └ data_reconciled/*
│ └ data_shapefiles/*
│ └ data_summary/*
├ papers
│ └ corporate_networks
│ └ figures
│ └ tables
│ └ third_sector
├ logging
│ │ ├ nhsspend.log
│ └ eval_logs
├ tokens
The authors are grateful to comments on earlier versions of the work from [Mark Exworthy](https://www.birmingham.ac.uk/staff/profiles/social-policy/exworthy-mark, David Stuckler, Martin Mckee, Lucy Reynolds and James Rees. Technical research assistance provided by Ian M. Knowles. The origins of this work originate from a scoping and prototyping exercise funded by the ESRC (grant numbers ES/M010392/1 and latterly ES/X000524/1), with majority funding latterly and gratefully acknowledged from the British Academy and the Leverhulme Trust (Grant RC-2018-003), the Leverhulme Centre for Demographic Science (LCDS), and Nuffield College. Insightful comments were gratefully received from participants at the International Conference for Administrative Data Research, the Economic Insights team at the Office for National Statistics, the Spatial Unit at the Department for Levelling Up, the Government Data Science Community Meetup, two editors, and two anonymous referees. Additional thanks are due to Max Hattersly, Ben Goodair and Yu Pei for all of their work on data verification.
This code is made available under a GNU GENERAL PUBLIC LICENSE 3.0.
Last updated: 2024-10-14