Course materials for "Applied Natural Language Processing" (INFO 256, Fall 2021, UC Berkeley) Syllabus: http://people.ischool.berkeley.edu/~dbamman/info256.html
Notebook | Description |
---|---|
1.words/EvaluateTokenizationForSentiment | The impact of tokenization choices on sentiment classification. |
1.words/ExploreTokenization | Different methods for tokenizing texts (whitespace, NLTK, spacy, regex) |
1.words/TokenizePrintedBooks | Design a better tokenizer for printed books |
1.words/Text_Complexity | Implement type-token ratio and Flesch-Kincaid Grade Level scores for text |
2.compare/ChiSquare, Mann-Whitney Tests | Explore two tests for finding distinctive terms |
2.compare/Log-odds ratio with priors | Implement the log-odds ratio with an informative (and uninformative) Dirichlet prior |
3.dictionaries/DictionaryTimeSeries | Plot sentiment over time using human-defined dictionaries |
3.dictionaries/Empath | Explore using Empath dictionaries to characterize texts |
4.embeddings/DistributionalSimilarity | Explore distributional hypothesis to build high-dimensional, sparse representations for words |
4.embeddings/WordEmbeddings | Explore word embeddings using Gensim |
4.embeddings/Semaxis | Implement SemAxis for scoring terms along a user-defined axis (e.g., positive-negative, concrete-abstract, hot-cold), |
4.embeddings/BERT | Explore the basics of token representations in BERT and use it to find token nearest neighbors |
4.embedings/SequenceEmbeddings | Use sequence embeddings to find TV episode summaries most similar to a short description |
5.eda/WordSenseClustering | Inferring distinct word senses using KMeans clustering over BERT representations |
5.eda/Haiku KMeans | Explore text representation in clustering by trying to group haiku and non-haiku poems into two distinct clusters |