Leveraging Multimodal AI for Medical Image Diagnosis through LLM's and Visual Question Answering
Shazab Ali
Co-Presenters: Individual Presentation
College: The Dorothy and George Hennings College of Science, Mathematics and Technology
Major: Computer Science
Faculty Research Mentor: Daehan Kwak
Abstract:
Artificial Intelligence (AI), particularly large language models (LLMs), has significantly advanced medical diagnostics, especially in interpreting medical images. This research explores the application of the LLaVa (Large Language and Vision Assistant) model for medical visual question answering (VQA) in analyzing chest X-rays, focusing on COVID-19 diagnoses. Accurate, automated diagnostic tools are crucial in healthcare, where rapid and precise analysis is often life-saving. Chest X-rays, essential for diagnosing conditions like COVID-19, require both expertise and time to interpret. This study investigates how LLaVa can augment or partially automate these processes, generating accurate responses to clinical queries.LLaVa integrates natural language processing and computer vision to process image inputs and generate contextually relevant text responses, making it suitable for multimodal tasks like medical VQA. Leveraging its ability to learn from image-text pairs, LLaVa was fine-tuned to analyze chest X-rays. We utilized the LLaVa 1.5 model, known for its dynamic high-resolution capabilities, which enhances its perception of intricate medical details. To improve LLaVa’s performance, we extracted mask features of the images using a convolutional neural network (CNN) and integrated these features into the model, enabling it to better capture structural details and enhance diagnostic accuracy.A dataset of 7,146 chest X-ray images paired with detailed doctor notes was used for model training, refining LLaVa’s ability to associate visual cues with clinical findings. The training spanned 10 epochs on two NVIDIA RTX 3090 GPUs. Hyperparameters, such as learning rate, were optimized to improve accuracy, and a test set of 2,114 images was used to evaluate the model's performance.Preliminary results demonstrate LLaVa’s potential in identifying pulmonary infections and addressing complex clinical questions. However, challenges remain with nuanced cases and ambiguous visual cues. Further refinement is necessary to improve robustness and accuracy. LLaVa’s integration into medical VQA systems shows promise in supporting healthcare professionals, enhancing diagnostic accuracy, and improving patient outcomes.