Text Mining Workshop Series

Text Mining Workshop Series

The Text Mining Workshop Series is a four-part, hands-on tutorial program designed to introduce social science researchers to computational text analysis, from foundational concepts to advanced modeling techniques. Led by TAMIDS’s Data Science Ambassadors, the series begins with an accessible introduction to text mining, its complement to traditional qualitative methods, and its value for social science research.

Participants will learn the full text-mining pipeline, core concepts such as corpora and document-term matrices, and essential preprocessing techniques such as tokenization, stopword removal, and lemmatization. Each session combines theory with practical Python exercises in Jupyter Notebook, ensuring participants gain both conceptual understanding and technical skills.

FOUNDATIONS & TEXT PREPROCESSING

Workshop One of Four

Friday, February 27 | 10 AM – 12 PM | Rudder Tower 601

Participants will understand what text mining is and why it’s valuable for social science research. They’ll learn how to prepare text data for analysis and perform basic text cleaning.

Key Topics:

  • Text mining applications in social sciences
  • Text mining vs. traditional qualitative methods
  • The text mining pipeline
  • Core concepts: corpus, document, token, document-term matrix
  • Preprocessing techniques: tokenization, lowercasing, punctuation removal, stopword removal, lemmatization/stemming
  • Creating word frequency distributions and visualizations

NOTE: Please install Jupyter Notebook ahead of the workshop. If you would like help with Python, you may review the free PYTHON PRIMER on the Texas A&M Institute of Data Science’s website. Instructors will be available 30 minutes before the workshop starts to help those who are struggling with the installation. 

EXPLORATORY ANALYSIS & VISUALIZATION

Workshop Two of Four

Friday, March 27 | 10 AM – 12 PM | Rudder Tower 601

Participants will learn techniques for exploring and describing textual datasets, identifying patterns, and communicating findings through visualization. It will be one hour of theory and explanation of the concepts, and the last hour will be hands-on practice in Python.

Key Topics:

  • Word frequency analysis and comparative frequency
  • N-grams (bigrams and trigrams) for phrase detection
  • TF-IDF (Term Frequency-Inverse Document Frequency) for identifying distinctive terms
  • Text visualization best practices and techniques
  • Interpreting exploratory findings in a social science context

SUPERVISED TEXT CLASSIFICATION

Workshop Three of Four

Friday, April 10| 10 AM – 12 PM | Rudder Tower 701

Participants will learn how to train models to categorize texts based on labeled examples and understand evaluation metrics. It will be one hour of theory and explanation of the concepts, and the last hour will be hands-on practice in Python.

Key Topics:

  • Supervised learning paradigm and workflow
  • Applications in social science research (sentiment analysis, content categorization, frame detection)
  • Training data preparation and train-test splitting
  • Feature extraction with TF-IDF and bag-of-words
  • Classification algorithms: Naive Bayes and Logistic Regression·  Evaluation metrics: accuracy, precision, recall, F1-score, confusion matrices

TOPIC MODELING & DISCOVERY

Workshop Four of Four

Friday, April 24 | 10 AM – 12 PM | Rudder Tower 510

Participants will learn how to identify latent themes in large text collections using unsupervised machine learning and how to interpret and validate these findings. It will be one hour of theory and explanation of the concepts, and the last hour will be hands-on practice in Python.

Key Topics:

  • Unsupervised learning and topic modeling concepts
  • Latent Dirichlet Allocation (LDA) intuition and implementation
  • Choosing the number of topics (K) using coherence metrics and interpretability
  • Systematic topic interpretation: words, documents, and labels
  • Evaluating topic quality and distinctiveness·  Document-level and corpus-level topic analysis recall, F1-score, and confusion matrices

Workshop Lead

Walid El Mansour

Senior Domain Data Science Ambassador
Department of Educational Administration & Human Resource Development

welmansour@tamu.edu