diff --git a/.nojekyll b/.nojekyll new file mode 100644 index 00000000..e69de29b diff --git a/404.html b/404.html new file mode 100644 index 00000000..d994b310 --- /dev/null +++ b/404.html @@ -0,0 +1,4126 @@ + + + +
+ + + + + + + + + + + + + + +Our internships are aimed at current PhD students looking for an industrial placement of around five months with the right to work in the UK. The projects are focussed on innovation, in particular around getting the most value out of NHS data.
+The projects often have a focus on emerging data science techniques and so we advertise mainly to data science programmes, however previous interns have come from other disciplines such as clinical, mathematics, computer science and bioinformatics, which have added huge value through the range of approaches and knowledge.
+For more information and details on how to apply see the Scheme Overview page on the microsite
+For details on open projects see the Projects page on the microsite
+Available outputs from previous projects can also be seen at Previous Projects on the microsite
+Currently our interns are working on the following projects in two waves. These are the original briefs they applied to and their work and outputs will be available on our organisation GitHub.
+Wave 6 | +February - July 2024 | +
---|---|
+ | NHS Language Corpus Extension | +
+ | Understanding Fairness and Explainability in Multi-modal Approaches within Healthcare | +
Wave 7 | +July - December 2024 | +
+ | Evaluating NER-focussed models and LLMs for identifying key entities in histopathology reports – working with GOSH DRIVE | +
+ | Investigating Privacy Concerns and Mitigations for Healthcare Language and Foundation Models | +
We are the NHS England Data Science Team.
+We are passionate about getting the most value out of the data collected by NHS England and the wider NHS through applying innovative techniques in appropriate and well-considered ways.
+Our vision is:
+++ + +Embed ambitious yet accessible data science in health and care to help people live healthier longer lives
+
In NHSE data scientists are concentrated in the central team but also embedded across a number of other areas.
+Data Linkage
+The Data Linkage Hub aims at providing a unified and quality solution to the data linkage needs in NHS England. Data Science is central to achieving this objective, and it covers many aspects, from the mathematical models of entity resolution and record linkage, to identifying and correcting linkage errors, assessing their impact on downstream applications, and ensuring quality.
+ +Central Data Science Team
+We develop and deploy data science products to make a positive impact on NHS patients and workforce. We investigate applying novel techniques that increase the insight we get from health-related data. We prioritise code-first ways of working, transparency and promoting best practice. We champion quality, safety and ethics in the application of methods and use of data. We have the remit to be open and collaborative and have the aim of sharing products with the wider healthcare community.
+See our Projects
+National SDE Team
+Working with customer researchers and analysts to identify how they can do their research, overcome and rectify data issues and use the platform and data to its fullest. There is also work to create products and tools that facilitate research in the environment such as data quality and completeness visualisations, example analysis and machine learning code as well as continuous improvement and increasing automation of the processes to get data both into the SDE and out through output checking.
+See SDE website.
+Other Embedded Data Scientists
+Across the organisation individual data scientists are embedded within specific team (inclu. Workforce, Training and Education (WT&E); Medicines; Patient Safety; AI Lab; Digital Channels..).
+We come together through the data science assembly to align our professional development and standards.
+To support knowledge share of data science in healthcare we've put together a monthly newsletter with valuable insights, training opportunities and events.
+Note
+The newsletter is targeted towards members of the NHS England Data Science team, so some links may only be accessible to those with the necessary login credentials, however the newsletter and its archive are available for all at the link above.
+We also support the NHS Data Science Community hosted in AnalystX, which is the home of spreading data science knowledge within the NHS. You can also learn a lot about data science from the other communities we support:
+ +Name | Role | Team | Github |
---|---|---|---|
Sarah Culkin | Deputy Director | Central Data Science Team | SCulkin-code |
Rupert Chaplin | Assistant Director | Central Data Science Team | rupchap |
Jonathan Hope | Data Science Lead | Central Data Science Team | JonathanHope42 |
Jonathan Pearson | Data Science Lead | Central Data Science Team | JRPearson500 |
Achut Manandhar | Data Science Lead | Central Data Science Team | achutman |
Jennifer Hall | Data Science Lead | Data Linking Hub | |
Simone Chung | Principal Data Scientist (Section Head) | Central Data Science Team | simonechung |
Efrosini Serakis | Principal Data Scientist (Section Head) | Central Data Science Team | efrosini-s |
Sam Hollings | Principal Data Scientist | Central Data Science Team | SamHollings |
Daniel Schofield | Principal Data Scientist (Section Head) | Central Data Science Team | danjscho |
Eladia Valles Carrera | Principal Data Scientist | Central Data Science Team | lilianavalles |
Paul Carroll | Principal Data Scientist (Section Head) | Central Data Science Team | pauldcarroll |
Elizabeth Johnstone | Principal Data Scientist (Section Head) | Central Data Science Team | LiziJohnstone |
Nicholas Groves-Kirkby | Principal Data Scientist (Section Head) | Central Data Science Team | ngk009 |
Divya Balasubramanian | Principal Data Scientist (Section Head) | Central Data Science Team | divyabala09 |
Giulia Mantovani | Principal Data Scientist (Section Head) | Data Linking Hub | GiuliaMantovani1 |
Angeliki Antonarou | Principal Data Scientist | National SDE Data Science Team | AnelikiA |
Kevin Fasusi | Principal Data Scientist | National SDE Data Science Team | KevinFasusi |
Jonny Laidler | Senior Data Scientist | Central Data Science Team | JonathanLaidler |
Mia Noonan | Senior Data Scientist | Central Data Science Team | amelianoonan1-nhs |
Sean Aller | Senior Data Scientist | Central Data Science Team | seanaller |
Hadi Modarres | Senior Data Scientist | Central Data Science Team | hadimodarres1 |
Michael Spence | Senior Data Scientist | Central Data Science Team | mspence-nhs |
Harriet Sands | Senior Data Scientist | Central Data Science Team | harrietrs |
Alice Tapper | Senior Data Scientist | Central Data Science Team | alicetapper1 |
Ben Wallace | Senior Data Scientist | Central Data Science Team | |
Jane Kirkpatrick | Senior Data Scientist | Central Data Science Team | |
Kenneth Quan | Senior Data Scientist | Central Data Science Team | quan14 |
Daniel Goldwater | Senior Data Scientist | Central Data Science Team | DanGoldwater1 |
Shoaib Ali Ajaib | Senior Data Scientist | National SDE Team | |
Marek Salamon | Senior Data Scientist | National SDE Team | |
Adam Hollings | Senior Data Scientist | National SDE Team | AdamHollings |
Oluwadamiloju Makinde | Senior Data Scientist | National SDE Team | |
Joseph Wilson | Senior Data Scientist | National SDE Team | josephwilson8-nhs |
Alistair Jones | Senior Data Scientist | National SDE Team | alistair-jones |
Nickie Wareing | Senior Data Scientist | National SDE Team | nickiewareing |
Helen Richardson | Senior Data Scientist | National SDE Team | helrich |
Humaira Hussein | Senior Data Scientist | National SDE Team | humairahussein1 |
Jake Kasan | Senior Data Wrangler (contract) | National SDE Team | |
Lucy Harris | Senior Data Scientist | Meds Team | |
Vithursan Vijayachandrah | Senior Data Scientist | Workforce, Training & Education Team | VithurshanVijayachandranNHSE |
Warren Davies | Data Scientist | Central Data Science Team | warren-davies4 |
Sami Sultan | Data Scientist | Workforce, Training & Education Team | SamiSultanNHSE |
Chaeyoon Kim | Data Scientist | Workforce, Training & Education Team | ChaeyoonKimNHSE |
Ilja Lomkov | Data Scientist | Workforce, Training & Education Team | IljaLomkovNHSE |
Thomas Bouchard | Data Science Officer | Central Data Science Team | tom-bouchard |
Catherine Sadler | Data Science Officer | Central Data Science Team | CatherineSadler</a |
William Poulett | Data Science Officer | Central Data Science Team | willpoulett |
Amaia Imaz Blanco | Data Science Officer | Central Data Science Team | amaiaita |
Xiyao Zhuang | Data Science Officer | Central Data Science Team | xiyaozhuang |
Scarlett Kynoch | Data Science Officer | Central Data Science Team | scarlett-k-nhs |
Jennifer Struthers | Data Science Officer | Central Data Science Team | jenniferstruthers1-nhs |
Matthew Taylor | Data Science Officer | Central Data Science Team | mtaylor57 |
Elizabeth Kelly | Data Science Officer | National SDE Team | ejkcode |
++Reproducible analytical pipelines (RAP) help ensure all published statistics meet the highest standards of transparency and reproducibility. Sam Hollings and Alistair Bullward share their insights on adopting RAP and give advice to those starting out.
+
Reproducible analytical pipelines (RAP) are automated statistical and analytical processes that apply to data analysis. It’s a key part of national strategy and widely used in the civil service.
+Over the past year, we’ve been going through a change programme and adopting RAP in our Data Services directorate. We’re still in the early stages of our journey, but already we’ve accomplished a lot and had some hard-learnt lessons.
+ + + +This is about analytics and data, but knowledge of RAP isn’t just for those cutting code day-to-day. It’s crucial that senior colleagues understand the levels and benefits of RAP and get involved in promoting this new way of working and planning how we implement it.
+This improves the lives of our data analysts and the quality of our work.
+ + + + + + + + + + + + + + + + + + + +++ + +Over recent years, larger, more data-intensive Language Models (LMs) with greatly enhanced performance have been developed. The enhanced functionality has driven widespread interest in adoption of LMs in Healthcare, owing to the large amounts of unstructured text data generated within healthcare pathways.
+However, with this heightened interest, it becomes critical to comprehend the inherent privacy risks associated with these LMs, given the sensitive nature of Healthcare data. This PhD Internship project sought to understand more about the Privacy-Risk Landscape for healthcare LMs through a literature review and exploration of some technical applications.
+
Studies have shown that LMs can inadvertently memorise and disclose information verbatim from their training data when prompted in certain ways, a phenomenon referred to as training data leakage. This leakage can violate the privacy assumptions under which datasets were collected and can make diverse information more easily searchable.
+As LMs have grown, their ability to memorize training data has increased, leading to substantial privacy concerns. The amount of duplicated text in the training data also correlates with memorization in LMs. This is especially relevant in healthcare due to the highly duplicated text in Electronic Healthcare Records (EHRs).
+If LMs have been trained on private data and are subsequently accessible to users who lack direct access to the original training data, the model could leak this sensitive information. This is a concern even if the user has no malicious intent.
+A malicious user can stage a privacy attack on an LM to extract information about the training data purposely. Researchers can also use these attacks to measure memorization in LMs. There are several different attack types with distinct attacker objectives.
+One of the most well-known attacks is Membership inference attacks (MIAs). MIAs determine whether a data point was included in the training data of the targeted model. Such attacks can result in various privacy breaches; for instance, discerning that a text sequence generated by Clinical LMs (trained on EHRs) originating from the training data can disclose sensitive patient information.
+At the simplest level, MIAs use the confidence of the target model on a target data instance to predict membership. A threshold is set against the confidence of the model to ascertain membership status. For a specific example, if the confidence is greater than the threshold then the attacker assumes the target is a member of the training data, as the model is "unsurprised" to see this example, indicating it has likely seen this example before during training. Currently, the most successful MIAs use reference models. This refers to a second model trained on a dataset similar to the training data of the target model. The reference model filters out uninteresting common examples, which will also be "unsurprising" to the reference model.
+There are three primary approaches to mitigate privacy risks in LMs:
+In this project, we sought to understand more about the Privacy-Risk Landscape for Healthcare LMs and conduct a practical investigation of some existing privacy attacks and defensive methods.
+Initially, we conducted a thorough literature search to understand the privacy risk landscape. Our first applied work package explored data deduplication before model training as a mitigation to reduce memorization and evaluated the approach with Membership Inference Attacks. We showed that RoBERTa models trained on patient notes are highly vulnerable to MIAs, even when only trained for a single epoch. We investigated data deduplication as a mitigation strategy but found that these models were just as vulnerable to MIAs. Further investigation of models trained for multiple epochs is needed to confirm these results. In the future, semantic deduplication could be a promising avenue for medical notes.
+Our second applied work package explored editing/unlearning approaches for healthcare LMs. Unlearning in LMs is poised to become increasingly relevant, especially in light of the growing awareness surrounding training data leakage and the 'Right to be Forgotten'. We found that many repositories for performing such approaches were not adapted for all LM types, and some are still not mature enough to be easy to use as packages. Exploring a Locate-then-Edit approach to Knowledge Neurons, we found this was not well suited to the erasure of information we needed in medical notes. Our findings suggest that the focus from a privacy perspective on these methods should be on those which allow the erasure of specific training data instances instead of relational facts.
+This work primarily explored privacy in pre-trained Masked Language Models. The growing adoption of generative LMs underscores the importance of expanding this work to Encoder and Encoder-Decoder models like the GPT family and T5. Also, due to the common practice of freezing parameters and tuning the last layer of a LM on a private dataset, it is critical to expand investigations of privacy risks to LMs fine-tuned on healthcare data.
+Within the scope of this exploration, the field of Machine Unlearning/Editing applied to LMs was in its infancy, but it is gaining momentum. As this field matures, comparing the efficacy of different methods becomes crucial. Furthermore, it is important to explore the effect of removing the influence of a set of data points. A holistic examination of the effectiveness, privacy implications, and broader impacts of Machine Unlearning/Editing methods on healthcare LMs is essential to inform the development of robust and privacy-conscious LMs in the NHS.
+When considering explainability of models, this often involves generating explanations or counterfactuals alongside the decisions made by the LM. However, integrating explanations into the output of LMs can introduce vulnerabilities related to training data leakage and privacy attacks. Additionally, efforts to enhance privacy, such as employing Privacy-preserving training techniques, can inadvertently impact fairness, particularly in datasets lacking diversity. In healthcare, all three elements are paramount, so investigating the privacy-explainability-fairness trade-off is crucial for developing private, robust and ethically sound LMs.
+Finally, privacy concerns in several emerging trends for LMs need to be understood in Healthcare scenarios. Incorporating external Knowledge Bases to enhance LMs, known as retrieval augmentation, could make LMs more likely to leak private information. Further, Multimodal Large Language Models (MLLM), referring to LM-based models that can take in and reason over multimodal information common in healthcare, could be susceptible to leakage from one input modality through another output modality.
+ + + + + + + + + + + + + + + + + + +++ + +We have been building a proof-of-concept tool that scores the privacy risk of free text healthcare data. To use our tool effectivly, users need a basic understanding of the entities within their dataset which may contribute to privacy risk.
+There are various tools for annotating and exploring free text data. The author explores some of these tools and discusses his experiences.
+
We have been building a proof-of-concept tool that scores the privacy risk of free text healthcare data called Privacy Fingerprint.
+Named Entity Recognition (NER) is a particularly important part of our pipeline. It is the task of identifying, categorizing and labelling specific pieces of information, known as entities, within a given piece of text. These entities can include the names of people, dates of birth, or even unique identifiers like NHS Numbers.
+As of the time of writing, there are two NER models fully integrated within the Privacy Fingerprint pipeline used to identify entities which may contribute towards a privacy risk. These are:
+Both NER models in our pipeline need to be fed a list of entities to extract. This is true for many NER models, although some like Stanza from Stanford NLP Group and BERT token classifiers do not need an initial entity list for extraction. For our privacy tool to be effective, we want our list of entities to be representative of the real entities in the data, and not miss any important information.
+ +Let's consider a new user who wants to investigate the privacy risk of a large unstructured dataset. Maybe they want to use this data to train a new generative healthcare model and don’t want any identifiable information to leak into the training data. Or maybe this dataset is a large list of outputs from a similar model and they want to ensure that no identifiable information has found it's way into the data. They may ask:
+What does my data look like?
+What entities within my data have a high privacy risk?
+Wait a second, what even is an entity?
+We want to offer an easy and interactive starting point for new users of our tool, where they can easily explore their data, understand the role of NER and identify what risks lie in their data. If they miss certain entities, this could have large implications on the scoring aspect of our pipeline.
+Of course, we want people to use our tool efficiently and effectively! So we asked:
+How can a new user efficiently explore their data to understand what entities exist within the data, and in particular, which entities may contribute to a privacy risk?
+Interactive annotation tools offer a solution to the above problem. If we can include a tool which allows a user to manually label their dataset, alongside live feedback from the NER model, it would allow a user to very quickly understand the entities in their data.
+Further to this, some NER models can be surprisingly affected by the wording of entities. The entity titled 'name' may extract both the name of an individual and the name of a hospital. The entity 'person' might only extract the name of the person. We have found that changing the entity 'person' to 'name' in UniversalNER reduced how often names were picked up by the model. If a user gets live feedback from a model whilst labelling, this will help them both finetune which entity names work best, alongside picking out which entities to use at all.
+We want a tool that:
+There were two approaches we took to develop an annotation tool.
+First, we used DisplaCy, ipyWidgets, and a NER model of choice to generate an interactive tool that works inside Jupyter notebooks. DisplaCy is a visualiser integrated into the SpaCy library which allows you to easily visualise labels. Alongside ipyWidgets, a tool that allows you to create interactive widgets such as buttons, we created an interface which allowed a user to go through reviews and add new entities.
+One of the main advantages of this method is that everything is inside a Jupyter notebook. The entity names you want to extract come straight from the experiment parameters, so if you used this in the same notebook as the rest of your pipeline the entitiy names could be updated automatically from the labelling tool. This would allow easy integration into a user workflow.
+There is also a button which allows for live feedback from the NER model which is useful given our previous comment on different entitity names having different effects on the NER model.
+This approach was simple and resulted in a fully working example. However, highlighting entities manually was not possible, and this meant it was hard to correct predictions that the model got wrong. You are fully reliant on the labels given by the model, and can't add your own.
+We explored a second option using Streamlit. Streamlit is a python framework that allows you to build simple web apps. We can use it alongside a package called Streamlit Annotation Tools to generate a more interactive user interface. As an example, a user can now use their cursor to highlight particular words and assign them an entity type which is more hands-on and engaging. Unlike our ipyWidgets example, users can select different labels to be displayed which makes the tool less cluttered, and you can easily navigate using a slider to separate reviews. Like the previous widget, there is a button which uses a NER model to label the text and give live feedback. Including this, the tool is more synergistic, easier to use and more immersive than the ipyWidgets alternative.
+However, there were still a few teething issues when developing the Streamlit app. Firstly, Streamlit annotation tool’s has an inability to display \n
as a new line and instead prints \n
, resulting in the structure of text being lost. This is a Streamlit issue and we haven’t yet found a way to keep the structure of the text intact. There was an easy fix in which each \n
was replaced with two spaces (this means the start and end character count for each labelled entity remains consistent with the original structured text), but the structure of the text is still lost which may cause issues for some future users.
Secondly, Streamlit involves a little bit more set up than ipyWidgets. Rather than interacting with the reviews in your notebook you run the app on a local port and access it through your browser. This also makes it harder to retrieve back into your pipeline the list of entities you have labelled. Whilst there is benefit to running all your analysis in one jupyter notebook, the Streamlit app gives a better user experience.
+Both labelling tools we have identified have key advantages. DisplaCy and ipyWidgets fit well into your workflow, whilst Streamlit offers a nicer user experience. ipyWidgets and Streamlit are both versatile tools, and so users can edit the annotation tools in the future to fit their own use case.
+Following the research and development of these two tools, we believe the ability to interactively annotate, explore and extract entities from your data greatly improves the user experience when using our privacy risk scorer pipeline.
+We will publish working examples of annotation using both ipyWidgets and Streamlit, such that a future user can build on them or use them to improve their workflow. The code is available on our github.
+ + + + + + + + + + + + + + + + + + +++Reproducible analytical pipelines (RAP) help ensure all published statistics meet the highest standards of transparency and reproducibility. Sam Hollings and Alistair Bullward share their insights on adopting RAP and give advice to those starting out.
+
Reproducible analytical pipelines (RAP) are automated statistical and analytical processes that apply to data analysis. It’s a key part of national strategy and widely used in the civil service.
+Over the past year, we’ve been going through a change programme and adopting RAP in our Data Services directorate. We’re still in the early stages of our journey, but already we’ve accomplished a lot and had some hard-learnt lessons.
+ + +++ + +We have been building a proof-of-concept tool that scores the privacy risk of free text healthcare data. To use our tool effectivly, users need a basic understanding of the entities within their dataset which may contribute to privacy risk.
+There are various tools for annotating and exploring free text data. The author explores some of these tools and discusses his experiences.
+
++ + +Over recent years, larger, more data-intensive Language Models (LMs) with greatly enhanced performance have been developed. The enhanced functionality has driven widespread interest in adoption of LMs in Healthcare, owing to the large amounts of unstructured text data generated within healthcare pathways.
+However, with this heightened interest, it becomes critical to comprehend the inherent privacy risks associated with these LMs, given the sensitive nature of Healthcare data. This PhD Internship project sought to understand more about the Privacy-Risk Landscape for healthcare LMs through a literature review and exploration of some technical applications.
+
++ + +We have been building a proof-of-concept tool that scores the privacy risk of free text healthcare data. To use our tool effectivly, users need a basic understanding of the entities within their dataset which may contribute to privacy risk.
+There are various tools for annotating and exploring free text data. The author explores some of these tools and discusses his experiences.
+
++ + +We have been building a proof-of-concept tool that scores the privacy risk of free text healthcare data. To use our tool effectivly, users need a basic understanding of the entities within their dataset which may contribute to privacy risk.
+There are various tools for annotating and exploring free text data. The author explores some of these tools and discusses his experiences.
+
++ + +Over recent years, larger, more data-intensive Language Models (LMs) with greatly enhanced performance have been developed. The enhanced functionality has driven widespread interest in adoption of LMs in Healthcare, owing to the large amounts of unstructured text data generated within healthcare pathways.
+However, with this heightened interest, it becomes critical to comprehend the inherent privacy risks associated with these LMs, given the sensitive nature of Healthcare data. This PhD Internship project sought to understand more about the Privacy-Risk Landscape for healthcare LMs through a literature review and exploration of some technical applications.
+
++ + +We have been building a proof-of-concept tool that scores the privacy risk of free text healthcare data. To use our tool effectivly, users need a basic understanding of the entities within their dataset which may contribute to privacy risk.
+There are various tools for annotating and exploring free text data. The author explores some of these tools and discusses his experiences.
+
++ + +Over recent years, larger, more data-intensive Language Models (LMs) with greatly enhanced performance have been developed. The enhanced functionality has driven widespread interest in adoption of LMs in Healthcare, owing to the large amounts of unstructured text data generated within healthcare pathways.
+However, with this heightened interest, it becomes critical to comprehend the inherent privacy risks associated with these LMs, given the sensitive nature of Healthcare data. This PhD Internship project sought to understand more about the Privacy-Risk Landscape for healthcare LMs through a literature review and exploration of some technical applications.
+
++ + +We have been building a proof-of-concept tool that scores the privacy risk of free text healthcare data. To use our tool effectivly, users need a basic understanding of the entities within their dataset which may contribute to privacy risk.
+There are various tools for annotating and exploring free text data. The author explores some of these tools and discusses his experiences.
+
++ + +Over recent years, larger, more data-intensive Language Models (LMs) with greatly enhanced performance have been developed. The enhanced functionality has driven widespread interest in adoption of LMs in Healthcare, owing to the large amounts of unstructured text data generated within healthcare pathways.
+However, with this heightened interest, it becomes critical to comprehend the inherent privacy risks associated with these LMs, given the sensitive nature of Healthcare data. This PhD Internship project sought to understand more about the Privacy-Risk Landscape for healthcare LMs through a literature review and exploration of some technical applications.
+
++Reproducible analytical pipelines (RAP) help ensure all published statistics meet the highest standards of transparency and reproducibility. Sam Hollings and Alistair Bullward share their insights on adopting RAP and give advice to those starting out.
+
Reproducible analytical pipelines (RAP) are automated statistical and analytical processes that apply to data analysis. It’s a key part of national strategy and widely used in the civil service.
+Over the past year, we’ve been going through a change programme and adopting RAP in our Data Services directorate. We’re still in the early stages of our journey, but already we’ve accomplished a lot and had some hard-learnt lessons.
+ + +++Reproducible analytical pipelines (RAP) help ensure all published statistics meet the highest standards of transparency and reproducibility. Sam Hollings and Alistair Bullward share their insights on adopting RAP and give advice to those starting out.
+
Reproducible analytical pipelines (RAP) are automated statistical and analytical processes that apply to data analysis. It’s a key part of national strategy and widely used in the civil service.
+Over the past year, we’ve been going through a change programme and adopting RAP in our Data Services directorate. We’re still in the early stages of our journey, but already we’ve accomplished a lot and had some hard-learnt lessons.
+ + +++ + +We have been building a proof-of-concept tool that scores the privacy risk of free text healthcare data. To use our tool effectivly, users need a basic understanding of the entities within their dataset which may contribute to privacy risk.
+There are various tools for annotating and exploring free text data. The author explores some of these tools and discusses his experiences.
+
++ + +Over recent years, larger, more data-intensive Language Models (LMs) with greatly enhanced performance have been developed. The enhanced functionality has driven widespread interest in adoption of LMs in Healthcare, owing to the large amounts of unstructured text data generated within healthcare pathways.
+However, with this heightened interest, it becomes critical to comprehend the inherent privacy risks associated with these LMs, given the sensitive nature of Healthcare data. This PhD Internship project sought to understand more about the Privacy-Risk Landscape for healthcare LMs through a literature review and exploration of some technical applications.
+
++Reproducible analytical pipelines (RAP) help ensure all published statistics meet the highest standards of transparency and reproducibility. Sam Hollings and Alistair Bullward share their insights on adopting RAP and give advice to those starting out.
+
Reproducible analytical pipelines (RAP) are automated statistical and analytical processes that apply to data analysis. It’s a key part of national strategy and widely used in the civil service.
+Over the past year, we’ve been going through a change programme and adopting RAP in our Data Services directorate. We’re still in the early stages of our journey, but already we’ve accomplished a lot and had some hard-learnt lessons.
+ + +