Fake News Detection

Compared classical embeddings against BERT fine-tuning on WELFake — 99.5% test accuracy on a known-separable benchmark.

NLP classifier

Fake News Detection

BERT vs. classical · WELFake

Overview

Group NLP project built around the WELFake dataset (~72K labeled articles). The work compared classical feature pipelines against a fine-tuned BERT classifier for fake-news detection.

My contribution

BERT fine-tuning, preprocessing/lemmatization workflow, and benchmark evaluation against classical embedding baselines.

Problem

Fake news spreads quickly on social platforms, but reliable detection needs models that generalize beyond simple keyword heuristics on noisy article text.

Approach

Built a preprocessing pipeline with lemmatization, stopword handling, and stratified train/test splits on WELFake.
Compared TF-IDF, Doc2Vec, Word2Vec, and Sentence2Vec embeddings with logistic regression, random forest, XGBoost, and k-NN baselines.
Fine-tuned BERT on cleaned article text and evaluated against the classical embedding baselines.

Result

Achieved 99.5% accuracy with fine-tuned BERT on WELFake — outperforming the classical embedding pipelines built earlier in the project.

Stack

PythonBERTscikit-learnGensimNLTKpandas

Team repo (course)