- Starts: 10:00 am on Wednesday, September 17, 2025
- Ends: 12:00 pm on Wednesday, September 17, 2025
Title: Data-driven approaches for improving the identification of misleading content online
Presenter: Pujan Paudel
Advisor: Professor Gianluca Stringhini
Chair: Professor Yigong Hu
Committee: Professor Gianluca Stringhini, Professor Manuel Egele, Professor Mark Crovella, Professor Engin Kirda
Google Scholar Link: https://scholar.google.com/citations?user=8K4IiBwAAAAJ&hl=en&oi=ao
Abstract: Misleading content online appears in many forms, spanning false claims that spread rapidly on social networks to craftily designed e-commerce websites defrauding users of money and trust. This thesis aims to build data-driven systems that can improve the automated identification of misleading content online while supporting human-in-the-loop content moderation systems and downstream security systems. I achieve this goal by developing and evaluating four complementary systems that together strengthen platform soft-moderation practices and enable proactive discovery of scam websites on the broader Web.
First, I introduce a claim-comprehensive soft-moderation pipeline that uses learning-to-rank and information retrieval techniques to identify posts discussing misleading claims, increasing coverage and consistency of warning labels. Second, I propose an unsupervised, context-aware stance detection framework to distinguish the propagation of a falsehood from its critique or correction, reducing contextual false positives of warning labels. Third, I extend soft moderation beyond text with an efficient reverse image retrieval system that finds visually similar instances of misleading images at scale, enabling multi-modal moderation. Finally, I present a data-driven query-mining and scoring system that allows systematic issuing of search engine queries with a higher likelihood of returning scam websites, accelerating existing security pipelines to discover scam websites earlier, and improving the resource efficiency of downstream detection systems.
Across large-scale, heterogeneous datasets capturing real-world events on social media and diverse search engine results on the web, these systems (i)expand the coverage, context, and accuracy of soft moderation of misleading content online and (ii)improve the timeliness and yield of discovering online scam websites promoting misleading content. Collectively, these contributions develop a practical toolbox of systems for human-in-the-loop moderation and security systems, demonstrating that targeted, claim-centric, context-aware, and multi-modal pipelines can help make information ecosystems more trustworthy and safe.
- Location:
- PHO 339