Targeted Scraping
Traverses LinkedIn result pages, capturing candidate job links while respecting pagination limits and throttling requests for reliability.
Automatically discover entry-level opportunities by scraping LinkedIn listings, cleaning descriptions, and classifying roles with a lightweight machine learning model. Built to accelerate early-career job hunts with transparent, auditable logic.
The Entry-Level Job Classifier scrapes LinkedIn postings, cleans away noisy markup, and highlights the most relevant early-career positions. Validated listings are persisted to CSV for downstream analysis and surfaced through a compact classification API. The pipeline combines Python's robust scraping ecosystem with a tailored Naive Bayes model tuned for imbalanced datasets.
Traverses LinkedIn result pages, capturing candidate job links while respecting pagination limits and throttling requests for reliability.
Normalizes job descriptions by stripping repeated HTML fragments and boilerplate to produce cleaner, model-ready text.
A Complement Naive Bayes classifier powered by TF-IDF features separates true entry-level roles from mid-senior postings, even on imbalanced datasets.
Persists each validated listing and description pair into CSV for analysis, dashboards, or future retraining iterations.
def remove_common_chars(string: str) -> str:
"""Replace unwanted HTML artefacts from a job description."""
replacements = {
"<": "",
">": "",
"</strong>": "",
"</u>": "",
"</li>": "",
}
for needle, replacement in replacements.items():
string = string.replace(needle, replacement)
return string
def get_links(url: str, num: int, limit: int) -> None:
delay = 1
if num >= limit:
return
print(f"Fetching {num} job links")
page_url = f"{url}&start={num}" if num else url
page = requests.get(page_url, timeout=10)
soup = BeautifulSoup(page.content, "html.parser")
hrefs = [tag.get("href") for tag in soup.find_all("a") if tag.get("href")]
with open("jobPost.csv", mode="a", newline="") as file:
writer = csv.writer(file)
if num == 0:
writer.writerow(["Link", "Description"])
for link in hrefs:
if link.startswith("https://www.linkedin.com/jobs/"):
process_job_link(link)
time.sleep(delay)
get_links(url, num + 25, limit)
def job(description: str) -> bool:
df = pd.DataFrame(data)
X_train, X_test, y_train, y_test = train_test_split(
df["job_description"],
df["label"],
test_size=0.2,
random_state=42,
)
vectorizer = TfidfVectorizer(stop_words="english")
model = make_pipeline(vectorizer, ComplementNB())
model.fit(X_train, y_train)
prediction = model.predict([description])
return prediction[0] == 1
LinkedIn listings are gathered in 25-result batches, de-duplicated, and written to CSV with their raw descriptions.
HTML tags and noisy fragments are stripped to produce structured text that emphasizes responsibilities and requirements.
TF-IDF features paired with Complement Naive Bayes learn the nuanced vocabulary of early career postings.
The trained model returns a boolean prediction, ready to trigger alerts or populate curated job boards for seekers.