COVID-19 Tweet Classification Using LLMs

Oscar Mejia Rodriguez

Co-Presenters: Individual Presentation

College: The Dorothy and George Hennings College of Science, Mathematics and Technology

Major: Computer Science

Faculty Research Mentor: Daehan Kwak

Abstract:

The COVID-19 pandemic sparked widespread discussions on social media platforms such as Twitter, where people shared their real-time perceptions regarding various controversial topics such as vaccinations, public health policies, and societal effects of the pandemic. This research tells the story of the pandemic by classifying millions of web-scraped COVID-19 related tweets into various categories through the use of Large Language Models (LLMs).This process involved creating a pipeline that benchmarked the efficiency and accuracy of locally deployed LLMs (Mistral, LLaMa3.1, Gemma, Qwen, etc.) on tasks such as topic classification. After scraping Twitter and storing tweets in a database, they were retrieved and pre-processed to ensure high-quality tweets were used. Using these models, we categorized tweets into a predefined set of topics such as masks, vaccines, and quarantining. These categories were then appended back into the table and stored in the database for further analysis.To evaluate the performance of different LLMs, model accuracy and classification consistency will be considered. Zero-shot and few-shot prompting, with and without contextual examples, were explored to optimize the classifications generated by the LLMs. Throughout the the research, various prompt engineering techniques, such as self-consistency prompting and chain-of-thought prompting, are utilized. When comparing and analyzing the models, some were better suited for the task at hand, thus identifying the best model for the job.The work accomplished during this research project underlines the potential of LLMs in labeling datasets, a task that would otherwise require significant manual work and time. This research project highlights an efficient pipeline to extract insights from large-scale datasets. The findings of this project can aid in policy-making and demonstrate the utility of LLMs for real-time data classifications during future emergencies.

Previous
Previous

Local and Online Large Language Models for Mental Health Summaries

Next
Next

Leveraging Multimodal AI for Medical Image Diagnosis through LLM's and Visual Question Answering