Course Objective

This course will allow the student to develop a general understanding of knowledge discovery and gain a specific understanding of text mining. Students will become familiar with both the theoretical and practical aspects of text mining and develop a proficiency with data modeling.


Production and consumption of information have drastically changed. Today, we inherently expect websites and apps to understand what we need and provide the appropriate service. At the same time, there is a growing call to provide better transparency on what such services provide. This course will provide you a theoretical and practical knowledge on how to tackle such problems. You will learn how to parse through text data to gain insights, and also learn why certain algorithms work the way they do.

The course is divided into three modules: basics, principles, and applications (see details below). The third part of the course will focus on several applications of text mining: methods for automatically organizing textual documents for sense-making and navigation (clustering and classification), methods for detecting opinion and bias, methods for detecting and resolving specific entities in text (information extraction and resolution), and methods for learning new relations between entities (relation extraction). Throughout the course, a strong emphasis will be placed on evaluation. Students will develop a deep understanding of one particular method through a course project.


Students are expected to have experience with any programming languages such as C/C++, Java, Python, R, SAS, Matlab, etc. Please see me if you are not proficient in programming. We can discuss ways to accommodate. For example, tools such as Weka, Excel, and Tableau accommodate text analysis without any programming background.

Time and location

[ MoTuTh 3:15PM - 5:50PM ]

Office Hours

By appointment

Required textbook

Data Mining: Practical Machine Learning Tools and Techniques (Fourth Edition) Ian H. Witten, Eibe Frank, Mark A. Hall, and Christopher J. Pal. 2017. Morgan Kaufman. ISBN 978-0128042915. Available online.

Additional resources

Foundations of Statistical Natural Language Processing. C. Manning and H Schutze. 1999. Introduction to Information Retrieval. C. Manning, P. Raghavan and H. Schutze. 2008.


Students are expected to be present for every class. The discussions in class will help you with your exam questions and I don’t want you to miss that. Of course, if you have to miss 2-3 classes due to unavoidable circumstances such as travel, sickness, it is fine. In case you need to miss classes please notify me beforehand.


10% of your course credit will be for participation, so please make sure that you regularly participate. Also, this is a practical course and you learn more by participating!


  • Discussion on ideas and approaches to solving those ideas is encouraged.
  • However, any work that you submit must be your own.
  • In the event that you do take help from another person, please write their name on top. This is not meant to hurt you but to benefit the person who helped you. By help, I do not mean copying an answer, but rather taking assistance to approach a problem.
  • I realize that some of you will be approached by your peers for assistance. I highly encourage this, but please do not give away the answers. Let them learn!

Please approach me if all collaboration fails!

Plagiarism and cheating

I expect every student to honor UNC’s Honor Code. In the event of a violation, the student will be punished according to the University guidelines.

Late Policy

Assignments are expected to be submitted by 11:59 pm local time. If the assignment is submitted late, for every day the assignment will be penalized 10% of the total assignment score. If the assignment has not been submitted beyond 5 days, the student will receive an automatic 0. This being said, I do realize that sometime due to unforeseeable circumstances a submission maybe late. Please contact me in that case. Maybe I can be of some help!

Overall grading

  • Participation: 10%.
  • Midterm: 20%.
  • Homework: 30%.
  • Final Project: 40% (Proposal: 5%; Project report: 25%; Presentation: 10%)

Grade assignment

Undergraduate students: A+ 97-100%, A 94-96%, A- 90-93%, B+ 87-89%, B 84-86, B- 80-83%, C+ 77-79%, C 74-76%, C- 70-73%, D+ 67-69%, D 64-66%, D- 60-63%, F 0-59% Graduate students: H 95-100%, P 80-94%, L 60-79%, and F 0-59%.


I would like to thank Dr. Jaime Arguello for allowing me to borrow his course material.

[ Assignments ]

In this course, you will work on three homework assignments (based on lectures), one midterm, and one final project.

Your final project will have three deliverables: proposal, report, and presentation. My advice: Start early, get familiar with the research questions, data, and tools you plan to use. I highly encourage you to take this course as an opportunity to learn a new ML toolkit. For more information on the deliverables please check here.

[ Schedule ]
Summer - I (2019)
#Lecture Date Topic Events Readings
1 5/16 Introduction and course outline WFHP Ch. 1, Mitchell '06, Hearst '99
2 5/20 Predictive analysis WFHP Ch. 2, Dominigos '12
3 5/21 Text Representation HW1 out WFHP Ch. 4.2
4 5/23 Machine Learning: Linear Classifiers and Naïve Bayes Mitchell Sections 1 and 2; Andrew Ng's notes
5 5/27 Memorial day
6 5/28 Machine Learning: Linear Classifier + ML Toolkits HW1 Due.
7 5/30 LightSIDE (slides); Machine Learning: Instance Based Classification + Review and HW discussion. HW2 out. WFH Ch. 4.7
8 6/3 Mid-Term
9 6/4 ML-Toolkits tutorial+ Mid Term Review.
10 6/6 Predictive analysis: Experimentation and evaluation - Part I Proposal due WFHP Ch. 5
11 6/10 Predictive analysis: Experimentation and evaluation - Part II HW2 due. Smucker et al. '07, Cross-Validation, Parameter tuning and overfiting
12 6/11 Machine Learning: Clustering; LDA[Code] HW3 Out Manning Ch.16
13 6/13 SVMs and CRFs
14 6/17 Applied Machine Learning : Sentiment analysis, View Point Detection, Text-based forecasting. HW 3 due.
15 6/18 Class Presentations - I
16 6/18 Class Presentations - II; Final Project submission