This course will allow the student to develop a general understanding of knowledge discovery and gain a specific understanding of text mining. Students will become familiar with both the theoretical and practical aspects of text mining and develop a proficiency with data modeling.
Production and consumption of information have drastically changed. Today, we inherently expect websites and apps to understand what we need and provide the appropriate service. At the same time, there is a growing call to provide better transparency on what such services provide. This course will provide you a theoretical and practical knowledge on how to tackle such problems. You will learn how to parse through text data to gain insights, and also learn why certain algorithms work the way they do.
The course is divided into three modules: basics, principles, and applications (see details below). The third part of the course will focus on several applications of text mining: methods for automatically organizing textual documents for sense-making and navigation (clustering and classification), methods for detecting opinion and bias, methods for detecting and resolving specific entities in text (information extraction and resolution), and methods for learning new relations between entities (relation extraction). Throughout the course, a strong emphasis will be placed on evaluation. Students will develop a deep understanding of one particular method through a course project.
Students are expected to have experience with any programming languages such as C/C++, Java, Python, R, SAS, Matlab, etc. Please see me if you are not proficient in programming. We can discuss ways to accommodate. For example, tools such as Weka, Excel, and Tableau accommodate text analysis without any programming background.
Time and location
[ MoTuTh 3:15PM - 5:50PM ]
Data Mining: Practical Machine Learning Tools and Techniques (Fourth Edition) Ian H. Witten, Eibe Frank, Mark A. Hall, and Christopher J. Pal. 2017. Morgan Kaufman. ISBN 978-0128042915. Available online.
Foundations of Statistical Natural Language Processing. C. Manning and H Schutze. 1999. Introduction to Information Retrieval. C. Manning, P. Raghavan and H. Schutze. 2008.
Students are expected to be present for every class. The discussions in class will help you with your exam questions and I don’t want you to miss that. Of course, if you have to miss 2-3 classes due to unavoidable circumstances such as travel, sickness, it is fine. In case you need to miss classes please notify me beforehand.
10% of your course credit will be for participation, so please make sure that you regularly participate. Also, this is a practical course and you learn more by participating!
Please approach me if all collaboration fails!
Plagiarism and cheating
I expect every student to honor UNC’s Honor Code. In the event of a violation, the student will be punished according to the University guidelines.
Assignments are expected to be submitted by 11:59 pm local time. If the assignment is submitted late, for every day the assignment will be penalized 10% of the total assignment score. If the assignment has not been submitted beyond 5 days, the student will receive an automatic 0. This being said, I do realize that sometime due to unforeseeable circumstances a submission maybe late. Please contact me in that case. Maybe I can be of some help!
Undergraduate students: A+ 97-100%, A 94-96%, A- 90-93%, B+ 87-89%, B 84-86, B- 80-83%, C+ 77-79%, C 74-76%, C- 70-73%, D+ 67-69%, D 64-66%, D- 60-63%, F 0-59% Graduate students: H 95-100%, P 80-94%, L 60-79%, and F 0-59%.
I would like to thank Dr. Jaime Arguello for allowing me to borrow his course material.
In this course, you will work on three homework assignments (based on lectures), one midterm, and one final project.
Your final project will have three deliverables: proposal, report, and presentation. My advice: Start early, get familiar with the research questions, data, and tools you plan to use. I highly encourage you to take this course as an opportunity to learn a new ML toolkit. For more information on the deliverables please check here.
|1||5/16||Introduction and course outline||WFHP Ch. 1, Mitchell '06, Hearst '99|
|2||5/20||Predictive analysis||WFHP Ch. 2, Dominigos '12|
|3||5/21||Text Representation||HW1 out||WFHP Ch. 4.2|
|4||5/23||Machine Learning: Linear Classifiers and Naïve Bayes||Mitchell Sections 1 and 2; Andrew Ng's notes|
|6||5/28||Machine Learning: Linear Classifier + ML Toolkits||HW1 Due.|
|7||5/30||LightSIDE (slides); Machine Learning: Instance Based Classification + Review and HW discussion.||HW2 out.||WFH Ch. 4.7|
|9||6/4||ML-Toolkits tutorial+ Mid Term Review.|
|10||6/6||Predictive analysis: Experimentation and evaluation - Part I||Proposal due||WFHP Ch. 5|
|11||6/10||Predictive analysis: Experimentation and evaluation - Part II||HW2 due.||Smucker et al. '07, Cross-Validation, Parameter tuning and overfiting|
|12||6/11||Machine Learning: Clustering; LDA[Code]||HW3 Out||Manning Ch.16|
|13||6/13||SVMs and CRFs|
|14||6/17||Applied Machine Learning : Sentiment analysis, View Point Detection, Text-based forecasting.||HW 3 due.|
|15||6/18||Class Presentations - I|
|16||6/18||Class Presentations - II; Final Project submission|