-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
db78a47
commit f071170
Showing
1 changed file
with
346 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,346 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"id": "view-in-github", | ||
"colab_type": "text" | ||
}, | ||
"source": [ | ||
"<a href=\"https://colab.research.google.com/github/ieg-dhr/NLP-Course4Humanities_2024/blob/main/Introduction_Transformers_what_can_they_do.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"id": "9l6f2ughnhK9" | ||
}, | ||
"source": [ | ||
"# Transformers, what can they do?\n", | ||
"\n", | ||
"This notebook has been provided by HuggingFace as part of the [learning material](https://huggingface.co/learn) and was adapted by Sarah Oberbichler for the DMGK NLP-Course at the University of Mainz." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"id": "ar0vPO0pnhK_" | ||
}, | ||
"source": [ | ||
"Install the Transformers, Datasets, and Evaluate libraries to run this notebook." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"id": "G6gZppnanhK_" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"# @markdown ### Let's intall the requirements to run this notebook\n", | ||
"!pip install datasets evaluate transformers[sentencepiece]" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"source": [ | ||
"#Transformer Pipelines\n", | ||
"\n", | ||
"The most basic object in the Huggingface Transformers library is the pipeline() function. It connects a model with its necessary preprocessing and postprocessing steps. There are three main steps involved when you pass some text to a pipeline:\n", | ||
"\n", | ||
"* The text is preprocessed into a format the model can understand.\n", | ||
"* The preprocessed inputs are passed to the model.\n", | ||
"* The predictions of the model are post-processed, so you can make sense of\n", | ||
"\n", | ||
"\n", | ||
"\n" | ||
], | ||
"metadata": { | ||
"id": "br5cLOdsLlgZ" | ||
} | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"source": [ | ||
"# 1. Text Classification Pipelines" | ||
], | ||
"metadata": { | ||
"id": "XI9qskKoRK9A" | ||
} | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"source": [ | ||
"### 1.1 Sentiment Analysis Pipeline\n", | ||
"\n", | ||
"Sentiment analysis is an NLP technique that determines the emotional tone in text, classifying it as positive, negative, or neutral. It's used in various fields like customer feedback analysis, social media monitoring, and market research. Methods range from rule-based approaches using sentiment lexicons to machine learning models trained on labeled data. Sentiment analysis helps businesses and researchers understand attitudes in large volumes of text, though it faces challenges with sarcasm, context-dependent expressions, and domain-specific language. Advanced applications include aspect-based analysis and emotion detection. The accuracy of sentiment analysis systems is typically evaluated against human-labeled datasets." | ||
], | ||
"metadata": { | ||
"id": "x1MpLdXBQLH1" | ||
} | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"id": "q9uYRXKRnhLA" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"# @markdown #### **Exercise 1**: Search for the most recent news and copy one sentence or paragraph. Add it instead if \"Today is a gread day.\" Do you agree with the automated classification? Is there a model that we can use with German text? Find one model and adapt the code.\n", | ||
"from transformers import pipeline\n", | ||
"\n", | ||
"classifier = pipeline(\"sentiment-analysis\")\n", | ||
"classifier(\"Today is a gread day.\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"id": "VBm5ydoQnhLB" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"classifier(\n", | ||
" [\"I've been waiting for a HuggingFace course my whole life.\", \"I hate this so much!\"]\n", | ||
")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"source": [ | ||
"### 2.1 Zero-Shot Classification\n", | ||
"\n", | ||
"Zero-shot classification is a machine learning technique where a model can categorize or classify items into classes it hasn't been explicitly trained on. It leverages the model's understanding of language or concepts to make predictions about new, unseen categories." | ||
], | ||
"metadata": { | ||
"id": "vG5oR_zaREFv" | ||
} | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"id": "QYjwIu7pnhLC" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"# @markdown #### **Exercise 2**: Change the candidate_labels into \"technical\", \"non-technical\", see what happens when we delete the part: \"in the humanities\".\n", | ||
"from transformers import pipeline\n", | ||
"\n", | ||
"classifier = pipeline(\"zero-shot-classification\")\n", | ||
"classifier(\n", | ||
" \"This is a course about Natural Language Processing in the humanities\",\n", | ||
" candidate_labels=['technical', 'non-technical'],\n", | ||
")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"source": [ | ||
"# 2. Text Generation Pipeline\n", | ||
"\n", | ||
"Text generation with transformers leverages pre-trained models and deep learning frameworks to produce human-like text. The process typically involves:\n", | ||
"\n", | ||
"\n", | ||
"* Next-token prediction: The basic mechanism where the model predicts the most likely next word or token.\n", | ||
"* Advanced generation techniques: Including Greedy Search, Beam Search, and Top-K sampling, which offer different trade-offs between text quality and diversity.\n", | ||
"* Prompt engineering: Carefully designed inputs can elicit specific types of responses, revealing the model's learned patterns and biases.\n", | ||
"* Model insights: Generated text can provide glimpses into the norms, biases, and knowledge encoded in the model's training data.\n", | ||
"* Temporal context: The outputs reflect not just the training data, but also the historical and cultural context of when the data was collected and the model trained.\n", | ||
"* Accessibility: With user-friendly interfaces like chatbots, text generation often serves as many people's first interaction with large language models.\n", | ||
"* Observable inference: Text generation allows users to witness the model's decision-making process in real-time, offering a tangible way to understand how these complex systems operate.\n", | ||
"\n", | ||
"\n" | ||
], | ||
"metadata": { | ||
"id": "CapJVFq1-Rhc" | ||
} | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"id": "MCFvQXnenhLC" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"# @markdown #### **Exercise 3**: Try different phrases, did you get an answer that was satisfying?\n", | ||
"from transformers import pipeline\n", | ||
"\n", | ||
"generator = pipeline(\"text-generation\")\n", | ||
"generator(\"Transformer Models are\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"id": "rY-je7MvnhLD" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"from transformers import pipeline\n", | ||
"\n", | ||
"generator = pipeline(\"text-generation\", model=\"distilgpt2\")\n", | ||
"generator(\n", | ||
" \"Transformer Models are\",\n", | ||
" max_length=30,\n", | ||
" num_return_sequences=2,\n", | ||
")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"id": "pdzWVjBOnhLD" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"from transformers import pipeline\n", | ||
"\n", | ||
"unmasker = pipeline(\"fill-mask\")\n", | ||
"unmasker(\"This course will teach you all about <mask> models.\", top_k=2)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"source": [ | ||
"# 3. Text Extraction Pipeline" | ||
], | ||
"metadata": { | ||
"id": "9ekoxTyHmLOr" | ||
} | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"source": [ | ||
"### 3.1 Named Entity Recognition\n", | ||
"\n", | ||
"Named Entity Recognition (NER) is a natural language processing task that identifies and classifies named entities in text. It locates and categorizes proper nouns into predefined classes such as person names, organizations, locations, dates, monetary values, and product names. NER is widely used in various applications, including information extraction from unstructured text, improving search engine results, customer service automation, content recommendation, social media analysis, medical record processing, legal document analysis, machine translation enhancement, question answering systems, and automatic text summarization. By helping machines recognize and categorize important elements within text, NER enables more human-like understanding and processing of language, making it a crucial component in many language understanding and information extraction tasks across different domains.\n" | ||
], | ||
"metadata": { | ||
"id": "pKpvWJamXpzI" | ||
} | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"id": "P51VHzsenhLE" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"from transformers import pipeline\n", | ||
"\n", | ||
"ner = pipeline(\"ner\", grouped_entities=True)\n", | ||
"ner(\"My name is Hans. Yesterday, I visited the office of my girlfriend, who works at Sparkasse. Her office is not far from mine at the University, which is next to Sparkasse.\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"source": [ | ||
"### 3.2 Question-Answering\n", | ||
"\n", | ||
"Question answering with transformers is a natural language processing task that uses transformer-based models to automatically generate answers to questions based on a given context or knowledge base. This approach leverages the powerful language understanding and generation capabilities of transformer models like BERT, RoBERTa, or T5." | ||
], | ||
"metadata": { | ||
"id": "RLm10gsfYzxJ" | ||
} | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"id": "uWEXfcEynhLE" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"# @markdown #### **Exercise 4**: Try a different model. For example the \"bert-large-uncased-whole-word-masking-finetuned-squad\" model. Does it change the answer?\n", | ||
"from transformers import pipeline\n", | ||
"\n", | ||
"question_answerer = pipeline(\"question-answering\", model=\"deepset/roberta-base-squad2\")\n", | ||
"question_answerer(\n", | ||
" question=\"Where do I work?\",\n", | ||
" context=\"My name is Hans. Yesterday, I visited the office of my girlfriend, who works at Sparkasse. Her office is not far from mine at the University, which is next to Sparkasse\",\n", | ||
")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"source": [ | ||
"# 4. Summarization and Translation" | ||
], | ||
"metadata": { | ||
"id": "hWsXTnTHpNf2" | ||
} | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"id": "ANYmaR7mnhLE" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"from transformers import pipeline\n", | ||
"\n", | ||
"summarizer = pipeline(\"summarization\")\n", | ||
"summarizer(\n", | ||
" \"\"\"\n", | ||
" America has changed dramatically during recent years. Not only has the number of\n", | ||
" graduates in traditional engineering disciplines such as mechanical, civil,\n", | ||
" electrical, chemical, and aeronautical engineering declined, but in most of\n", | ||
" the premier American universities engineering curricula now concentrate on\n", | ||
" and encourage largely the study of engineering science. As a result, there\n", | ||
" are declining offerings in engineering subjects dealing with infrastructure,\n", | ||
" the environment, and related issues, and greater concentration on high\n", | ||
" technology subjects, largely supporting increasingly complex scientific\n", | ||
" developments. While the latter is important, it should not be at the expense\n", | ||
" of more traditional engineering.\n", | ||
"\n", | ||
" Rapidly developing economies such as China and India, as well as other\n", | ||
" industrial countries in Europe and Asia, continue to encourage and advance\n", | ||
" the teaching of engineering. Both China and India, respectively, graduate\n", | ||
" six and eight times as many traditional engineers as does the United States.\n", | ||
" Other industrial countries at minimum maintain their output, while America\n", | ||
" suffers an increasingly serious decline in the number of engineering graduates\n", | ||
" and a lack of well-educated engineers.\n", | ||
"\"\"\"\n", | ||
")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"id": "iT1OvS1JnhLF" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"from transformers import pipeline\n", | ||
"\n", | ||
"translator = pipeline(\"translation\", model=\"Helsinki-NLP/opus-mt-fr-en\")\n", | ||
"translator(\"Ce cours est produit par Hugging Face.\")" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"colab": { | ||
"provenance": [], | ||
"include_colab_link": true | ||
}, | ||
"language_info": { | ||
"name": "python" | ||
}, | ||
"kernelspec": { | ||
"name": "python3", | ||
"display_name": "Python 3" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 0 | ||
} |