Automation & NLP

Entry-Level Job Classifier using Natural Language Processing

Automatically discover entry-level opportunities by scraping LinkedIn listings, cleaning descriptions, and classifying roles with a lightweight machine learning model. Built to accelerate early-career job hunts with transparent, auditable logic.

View on GitHub Python • BeautifulSoup • Scikit-learn

Project Overview

The Entry-Level Job Classifier scrapes LinkedIn postings, cleans away noisy markup, and highlights the most relevant early-career positions. Validated listings are persisted to CSV for downstream analysis and surfaced through a compact classification API. The pipeline combines Python's robust scraping ecosystem with a tailored Naive Bayes model tuned for imbalanced datasets.

Key Features

Targeted Scraping

Traverses LinkedIn result pages, capturing candidate job links while respecting pagination limits and throttling requests for reliability.

Context-Aware Cleaning

Normalizes job descriptions by stripping repeated HTML fragments and boilerplate to produce cleaner, model-ready text.

Entry-Level Detection

A Complement Naive Bayes classifier powered by TF-IDF features separates true entry-level roles from mid-senior postings, even on imbalanced datasets.

Audit-Friendly Storage

Persists each validated listing and description pair into CSV for analysis, dashboards, or future retraining iterations.

Highlighted Implementation

Cleaning Job Descriptions

Pre-processing
def remove_common_chars(string: str) -> str:
    """Replace unwanted HTML artefacts from a job description."""
    replacements = {
        "<": "",
        ">": "",
        "</strong>": "",
        "</u>": "",
        "</li>": "",
    }

    for needle, replacement in replacements.items():
        string = string.replace(needle, replacement)

    return string

Fetching LinkedIn Postings

Scraping Loop
def get_links(url: str, num: int, limit: int) -> None:
    delay = 1
    if num >= limit:
        return

    print(f"Fetching {num} job links")
    page_url = f"{url}&start={num}" if num else url

    page = requests.get(page_url, timeout=10)
    soup = BeautifulSoup(page.content, "html.parser")

    hrefs = [tag.get("href") for tag in soup.find_all("a") if tag.get("href")]

    with open("jobPost.csv", mode="a", newline="") as file:
        writer = csv.writer(file)
        if num == 0:
            writer.writerow(["Link", "Description"])

        for link in hrefs:
            if link.startswith("https://www.linkedin.com/jobs/"):
                process_job_link(link)
                time.sleep(delay)

    get_links(url, num + 25, limit)

Classifying Entry-Level Roles

Model Pipeline
def job(description: str) -> bool:
    df = pd.DataFrame(data)

    X_train, X_test, y_train, y_test = train_test_split(
        df["job_description"],
        df["label"],
        test_size=0.2,
        random_state=42,
    )

    vectorizer = TfidfVectorizer(stop_words="english")
    model = make_pipeline(vectorizer, ComplementNB())

    model.fit(X_train, y_train)
    prediction = model.predict([description])
    return prediction[0] == 1

From Data Collection to Prediction

Scrape & Store

LinkedIn listings are gathered in 25-result batches, de-duplicated, and written to CSV with their raw descriptions.

Clean & Normalize

HTML tags and noisy fragments are stripped to produce structured text that emphasizes responsibilities and requirements.

Vectorize & Train

TF-IDF features paired with Complement Naive Bayes learn the nuanced vocabulary of early career postings.

Predict & Iterate

The trained model returns a boolean prediction, ready to trigger alerts or populate curated job boards for seekers.