Course Objective
This course will allow the student to develop a general understanding of knowledge discovery and gain a specific understanding of text mining. Students will become familiar with both the theoretical and practical aspects of text mining and develop a proficiency with data modeling.
Description
Production and consumption of information have drastically changed. Today, we inherently expect websites and apps to understand what we need and provide the appropriate service. At the same time, there is a growing call to provide better transparency on what such services provide. This course will provide you a theoretical and practical knowledge on how to tackle such problems. You will learn how to parse through text data to gain insights, and also learn why certain algorithms work the way they do.
The course is divided into three modules: basics, principles, and applications (see details below). The third part of the course will focus on several applications of text mining: methods for automatically organizing textual documents for sense-making and navigation (clustering and classification), methods for detecting opinion and bias, methods for detecting and resolving specific entities in text (information extraction and resolution), and methods for learning new relations between entities (relation extraction). Throughout the course, a strong emphasis will be placed on evaluation. Students will develop a deep understanding of one particular method through a course project.
Prerequisites
Students are expected to have experience with any programming languages such as C/C++, Java, Python, R, SAS, Matlab, etc. Please see me if you are not proficient in programming. We can discuss ways to accommodate. For example, tools such as Weka, Excel, and Tableau accommodate text analysis without any programming background.
Time and location
[ MoTuTh 3:15PM - 5:50PM ]
Office Hours
By appointment
Required textbook
Data Mining: Practical Machine Learning Tools and Techniques (Fourth Edition) Ian H. Witten, Eibe Frank, Mark A. Hall, and Christopher J. Pal. 2017. Morgan Kaufman. ISBN 978-0128042915. Available online.
Additional resources
Foundations of Statistical Natural Language Processing. C. Manning and H Schutze. 1999. Introduction to Information Retrieval. C. Manning, P. Raghavan and H. Schutze. 2008.
Attendance
Students are expected to be present for every class. The discussions in class will help you with your exam questions and I don’t want you to miss that. Of course, if you have to miss 2-3 classes due to unavoidable circumstances such as travel, sickness, it is fine. In case you need to miss classes please notify me beforehand.
Participation
10% of your course credit will be for participation, so please make sure that you regularly participate. Also, this is a practical course and you learn more by participating!
Collaboration
Please approach me if all collaboration fails!
Plagiarism and cheating
I expect every student to honor UNC’s Honor Code. In the event of a violation, the student will be punished according to the University guidelines.
Late Policy
Assignments are expected to be submitted by 11:59 pm local time. If the assignment is submitted late, for every day the assignment will be penalized 10% of the total assignment score. If the assignment has not been submitted beyond 5 days, the student will receive an automatic 0. This being said, I do realize that sometime due to unforeseeable circumstances a submission maybe late. Please contact me in that case. Maybe I can be of some help!
Overall grading
Grade assignment
Undergraduate students: A+ 97-100%, A 94-96%, A- 90-93%, B+ 87-89%, B 84-86, B- 80-83%, C+ 77-79%, C 74-76%, C- 70-73%, D+ 67-69%, D 64-66%, D- 60-63%, F 0-59% Graduate students: H 95-100%, P 80-94%, L 60-79%, and F 0-59%.
Acknowledgment
I would like to thank Dr. Jaime Arguello for allowing me to borrow his course material.
In this course, you will work on three homework assignments (based on lectures), one midterm, and one final project.
Your final project will have three deliverables: proposal, report, and presentation. My advice: Start early, get familiar with the research questions, data, and tools you plan to use. I highly encourage you to take this course as an opportunity to learn a new ML toolkit.
For more information on the deliverables please check here.