Data Science on Unstructured Text Data
Text is ubiquitous. Humans have been storing information in written form for over 5000 years, and unfortunately the information in this information has defied principled quantitative analysis for much of that time. Unlocking skills and techniques to take text and derive sense and sentiment enables exploratory analysis and modeling on human communication. In this course we will use R packages such as ggraph, tidytext, dplyr and topicmodels to manipulate and understand a number of document collections. We will also learn to import textual data into R using twitteR, docxtractr and other packages.
By successfully completing the course the students will be able to:
● Learn how to transform text into the tidytext format for NLP
● Extract emotion and tone from text using sentiment analysis
● Understand what makes a document unique in a collection
● Understand how words and tokens and visualize them
● Import and export textual data into R
● Classify documents into groups using topic modeling
● Build models which take as input textual features
3 Daily Quizes (30%) End of Course Assignment (60%) Intellectual Presence (10%)
To be counted as intellectually present, you must demonstrate an intellectual presence, which means you are engaged in all classroom activities. An intellectual absence (including reading non-course related material, playing/texting on phone, using a laptop for non-class related activities) will be counted as an absence. Students who anticipate the need to be absent should be aware that this course is very compressed, and any absence will make it very challenging to complete this course.
Data Analysis 1a, Data Analysis 2, Data Analysis 3