Text Mining Workshop Series

The Text Mining Workshop Series is a four-part, hands-on tutorial program designed to introduce social science researchers to computational text analysis, from foundational concepts to advanced modeling techniques. Led by TAMIDS’s Data Science Ambassadors, the series begins with an accessible introduction to text mining, its complement to traditional qualitative methods, and its value for social science research.

Participants will learn the full text-mining pipeline, core concepts such as corpora and document-term matrices, and essential preprocessing techniques such as tokenization, stopword removal, and lemmatization. Each session combines theory with practical Python exercises in Jupyter Notebook, ensuring participants gain both conceptual understanding and technical skills.

FOUNDATIONS & TEXT PREPROCESSING

Workshop One of Four

Friday, February 27 | 10 AM – 12 PM | Rudder Tower 601

Participants will understand what text mining is and why it’s valuable for social science research. They’ll learn how to prepare text data for analysis and perform basic text cleaning.

Key Topics:

Text mining applications in social sciences
Text mining vs. traditional qualitative methods
The text mining pipeline
Core concepts: corpus, document, token, document-term matrix
Preprocessing techniques: tokenization, lowercasing, punctuation removal, stopword removal, lemmatization/stemming
Creating word frequency distributions and visualizations

NOTE: Please install Jupyter Notebook ahead of the workshop. If you would like help with Python, you may review the free PYTHON PRIMER on the Texas A&M Institute of Data Science’s website. Instructors will be available 30 minutes before the workshop starts to help those who are struggling with the installation.

REGISTER

EXPLORATORY ANALYSIS & VISUALIZATION

Workshop Two of Four

Friday, March 27 | 10 AM – 12 PM | Rudder Tower 601

Participants will learn techniques for exploring and describing textual datasets, identifying patterns, and communicating findings through visualization. It will be one hour of theory and explanation of the concepts, and the last hour will be hands-on practice in Python.

Key Topics:

Word frequency analysis and comparative frequency
N-grams (bigrams and trigrams) for phrase detection
TF-IDF (Term Frequency-Inverse Document Frequency) for identifying distinctive terms
Text visualization best practices and techniques
Interpreting exploratory findings in a social science context

REGISTER

SUPERVISED TEXT CLASSIFICATION

Workshop Three of Four

Friday, April 10| 10 AM – 12 PM | Rudder Tower 701

Participants will learn how to train models to categorize texts based on labeled examples and understand evaluation metrics. It will be one hour of theory and explanation of the concepts, and the last hour will be hands-on practice in Python.

Key Topics:

Supervised learning paradigm and workflow
Applications in social science research (sentiment analysis, content categorization, frame detection)
Training data preparation and train-test splitting
Feature extraction with TF-IDF and bag-of-words
Classification algorithms: Naive Bayes and Logistic Regression· Evaluation metrics: accuracy, precision, recall, F1-score, confusion matrices

REGISTER

TOPIC MODELING & DISCOVERY

Workshop Four of Four

Friday, April 24 | 10 AM – 12 PM | Rudder Tower 510

Participants will learn how to identify latent themes in large text collections using unsupervised machine learning and how to interpret and validate these findings. It will be one hour of theory and explanation of the concepts, and the last hour will be hands-on practice in Python.

Key Topics:

Unsupervised learning and topic modeling concepts
Latent Dirichlet Allocation (LDA) intuition and implementation
Choosing the number of topics (K) using coherence metrics and interpretability
Systematic topic interpretation: words, documents, and labels
Evaluating topic quality and distinctiveness· Document-level and corpus-level topic analysis recall, F1-score, and confusion matrices

REGISTER

Workshop Lead

Walid El Mansour

Senior Domain Data Science Ambassador
Department of Educational Administration & Human Resource Development

welmansour@tamu.edu

Presenters

Walid El Mansour, Department of Educational Administration & Human Resource Development
Simon Shin, Department of Educational Administration & Human Resource Development
Garam Kim, Department of Psychological & Brain Science
Dr. Seung Won Yoon, Department of Educational Administration & Human Resource Development