-
Project 1: Document Clustering by Dan Zheng
- Strength: has a lot of code
- Possible improvement: need more code
- One thing I learned: providing requirements.txt along with installation instruction for future guests was a great idea. It streamlines
pip
installation.
-
Project 2: some other project
-
Project 1: 2016 Election Project
- Strength: Clean code, lots of good explanations (also super detailed and organized analysis)
- Improvement: Term "professional" unclear in presentation until the end, would have liked more of the "process" in the presentation
- Thing I learned: How to modify NER trees
-
Project 2: Dogs vs. Cat Neologisms on Twitter
- Strength: Really interesting concept, fun presentation, clear code
- Improvement: Incorporating a way to read images (lots of memes out there), is limited in what terms it uses
- Thing I learned: More methods of handling Tweepy and DFs!
- Project 1: Dogs vs Cats on
Twitter
by Margaret Jones
- Strength: The LIVE_data_collection jupyter notebook file is a great way for people looking at her repo to get a good, hands-on way to understand her project by letting them try out code themselves.
- Improvement: Counting retweets and favorites seems more like a general statistical analysis than a linguistic analysis. Is there a way to focus more on the neologism trends? Also, since data was collected over time, and trends varied based on when data was collected, it would be interesting to see a time plot of retweet and favorite trends, instead of just individual pie charts and bar graphs.
- Thing I learned: Twitter can be really restrictive on scraping data from tweets and publishing data! What tweets Twitter gives you access to is an important consideration when doing any kind of twitter analysis.
- Project 2: NYT Sentiment
Analysis
by Christopher Lagunilla
- Strength: Really well organized/navigable! They included and clearly labeled old/abandoned code even though their final project ended up going a different direction. Past/failed attempts can be just as informative as successes! This lets visitors to the repo see what not to do, or let them attempt an approach Christopher decided not to take.
- Improvement: The project could discuss a bit more about how VADER works, to get a better idea of what the program is looking for when classifying sentiment.
- Thing I learned: how to unpack xml files and use regular expressions with glob to find specific tags. Also, this introduced me to the sentiment analyzer VADER!
-
Project 1: Dog vs. Cat Neologisms on Twitter
- What was done well? This is such a fun idea! I bet the terms are even more popular now than when this project was done, it'd be interesting to take another look. Also, a lot of comments! This is helpful when reading through the code.
- What could be improved? I think it's hard to compute things with tweepy since it is so limited. From my use of Twitter, I know favorites are more popular than retweets, but from this project, it was concluded there were more retweets. This intuitively doesn't make sense to me, so I think there is something more going on here.
- What did I learn? There's a lot of numerical analysis here! This can be helpful if I need a refresher on creating graphs and other sorts of numerical data.
-
Project 2: Discourse Analysis of the Australian Radio Talkback
- What was done well? Very clean code and pretty and intuitive graphs! I liked how everything was labeled so I didn't have to struggle to read through the code and figure out what each output corresponded to. This analysis seems very thorough.
- What could be improved? When comparing groups of people, use statistical tests so we can see if the groups are actually significantly different. It would also be cool to compare the findings to an American talk show!
- What did I learn? Australian talk shows are predominantly male, or at least the ones from this data sample. I also learned what back channels are, and thought it was really interesting how men produce more back channels when listening to men talk, and women produce more back channels when listening to women talk.
- Project 1: NYT Sentiment Analysis
- Strength: very well organized, nice visualizations
- Improvement: They said that only half was done, so maybe they could come back to this and finish it.
- Thing I learned: I learned how to process XML files.
- Project 2: Bigram Analysis of Writing from the ELI
- Strength: A lot of data is handled.
- Improvement: I would have changed the organization of the data, because it took me a while to figure what the question being answered was.
- Thing I learned: I learned how to use glob.
-
Project 1: Document_Clustering by Dan Zheng
- Strength: Handling really huge data - 12GB, Great idea to implement clustering method on wiki data classification. Clear explanation.
- Improvement: For Hierarchical Clustering, it is hard to read the result. And the result seems to contain a lot of 'a' initial words.
- Thing I learned: Real implementations on Tf-idf Vectors, K-Means and t-SNE method. Clear coding style.
-
Project 2: Native_and_Non-native_English by Katherine Kairis
- Strength: Clear explanation for final report. Clear comparison
- Improvement: The chart in 'trigram' part is not easy to understand.
- Thing I learned: A lot of ideas for comparing native and non-native English.
- Project 1: Document_Clustering by Dan Zheng
- Strength: Very clear exposition in presentation slides.
- Improvement: It is unclear whether either single-layer or hierarchical clustering can provide the most useful perspective on the data. For example, a human might think that a resource (bookmark) is associated meaningfully, even if not equally, with more than one topic, and might want it to be in more than one category simultaneously, where the categories may not form a hierarchy. The two models explored here don’t seem designed to assign the same thing to more than one cluster other than hierarchically. Are there alternative models?
- Thing I learned: How to approach the same core task with K-Means and Hierarchical clustering, and what the code looks like to perform the clustering.
- Project 2: 2016 Election Project by Paige Haring
- Strength: Very clear exposition in notebooks.
- Improvement: The summary at the end of the slides is very clear, and is supported by the bar charts that precede, but could the researcher take the investigation one step further by exploring relationships among these (now interim) conclusions? For example, is there a statistically meaningful relationship between how Trump refers to Clinton and how she refers to him? And insofar as Senator or Secretary are titles, and not just professional references (unlike Businessman), does the fact that Trump doesn’t really have a title influence how we should interpret the distributions? The study does not appear to quantify this sort of information; could doing so lead to a next level of discourse and rhetorical analysis?
- Thing I learned: Possibly already familiar to everyone else in our course (I’m not much of a syntactician), but I didn’t know how to incorporate and render syntax trees in a notebook until I saw it demonstrated here. `
-
Project 1: NYT_Figures_Sentiment_Analysis by Christopher Lagunilla
- Strengths: Super well organized, visualizations are easy to understand and aesthetically pleasing. Sentiment analysis includes four categories: "positive", "negative", "neutral", and "composite".
- Improvements: Maybe try the scikit learn NB classifier instead of NLTK? Also try and clean up the tokens a little bit more for the sentiment classification and training -- some of the features still included punctuation, which may have an effect on the sentiment classification. Also, may have been interesting to look at coverage of one or both of the persons of interest (e.g. Barack Obama) over time, to see how sentiment changed over a longer period of time.
- What I Learned: How to read in and process XML files -- also learned about a sentiment analyzer (Vader)
-
Project 2: 2016 Election Project by Paige Haring
- Strengths: Organization of data is great, thorough explanations of code and thought process throughout the project.
- Improvements: It would have been interesting to see how the moderators refer to the candidates. Perhaps beyond the scope of the project at the time, but it would also be interesting to see how the media referred to candidates, or how candidates referred to each other outside of debates (like in interviews).
- What I Learned: Entity naming
-
Project 1: NYT_Figures_Sentiment Analysis
- Strengths: Jupyter Notebooks well organized. Good visualizations of the data in presentation,
- Improvements: Could probably use more time to process data (said was 1/2 done on presentation), maybe more data samples for people if possible within licensing. Also, a longer time period sample would make the results more impactful, though I understand that would up the processing time.
- What I learned: NLTK package has the code to run VADER, a sentiment analysis tool which was used in this project.
-
Project 2: Project Corbett
- Strengths: Lots of data to work with, gives appropriate data samples,
- Improvements:Could use a little more organization of repository - jupyter notebook files in a separate repository for example. Presentation feels incomplete, could also use more markdown to explain and organize his/her jupyter notebook files
- What I learned: A vague idea of how to incorporate Count Vectorizer into my own project, though nothing solid yet. Also, how to present replicable data processing when your corpus is too large to fit onto GitHub.
-
Project 1: Analyzing the Australian Radio Talkback Corpus by Speaker Role and Gender by Alicia
- Strengths: Code is very well organized - I like the tables of contents! The graphs are nice and easy to read. I like how this project connects to outside research.
- Improvements: I think more could have been done with most common male and female back channels - since the most frequent are pretty much the same across the board, maybe look at which words are only said by women or only said by men.
- Thing I learned: Australian men on radio shows are more likely to reply when speaking to other men, and women are slightly more likely to reply to other women. A lot of new things you can do with matplotlib.
-
Project 2: Analysis of bigrams from learners' written work at the Pitt English Language Institute (ELI) by Ben Naismith
- Strengths: I like the powerpoint! Having the data in a CSV is useful for anyone who would like to replicate the experiment. I like the idea of a visitor's log - I guess everyone had to do this, but Ben specifically linked to his in the readme.
- Improvements: Graph names could be more informative. Not super organized. I was confused about where to go first.
- Thing I learned: Most students spend 2 semesters studying at the ELI. I also learned about mutual information (MI), a measure of two-way collocation.
- Project 1: Bigram Analysis of writing from ELI by Ben Naismith
- Strengths: Well organized, analysis is thorough and lucid, lots of graphics that make it easier to understand.
- Improvements: I don't know if this is Ben's fault but (at least on my computer) the formatting for the markdown isn't working properly. Headers written with hashtags don't show up at all, links to things within the page don't work, some of the lists aren't working properly, etc.
- What I learned: Try to use as many images as possible so data can be understand more easily.
- Project 2: Project Corbett by Robert Corbett
- Strengths: The analysis is well done, and the topic and results are fairly interesting. Robert's code is also pretty thoroughly commented.
- Improvement: His presentation is really poorly made, he uses black text with a background that is part black and has typos and stuff. Did he not finish it? Is he storing all of the text for each subreddit in a single file? And with all of the json information for every single comment. I didn't read through all his code but why would he ever need to use, say, the author or his flair? Does he even need to retain any information other than text and score? it seems like he probably could have just kept the those two and reduced memory and storage used.
- What I learned: Make sure my powerpoint is readable. Make sure I am storing my data efficiently and appropriately for my goals. Make sure to use comments (or markdown blocks in jupyter notebook) so my code is clear.