Building ML in the Dark: A Survival Guide for the Solo Practitioner

Author(s): Yuval Mehta Originally published on Towards AI. Photo by Boitumelo on Unsplash No GPU cluster. No data team. No ML platform. Here’s what actually ships. Most ML content is written for teams that have things. A labelled dataset. An MLOps platform. A data engineer who answers Slack messages. A GPU budget that someone has already approved. You probably don’t have those things. You’re embedded in a product or analytics team, you were handed a vague mandate to “do something with ML,” and you have a laptop, a free-tier cloud account, and colleagues who think pandas is the animal. This post is for you. Not a roadmap — a survival guide. What to hack around, what to refuse, and how to get something real into production before your stakeholders lose interest. TL;DR Bad data is not your biggest problem. Unclear problem definition is. Fix that first, or nothing else matters. You don’t need a GPU for most things that actually ship at company scale. Learn what does and doesn’t need one. Build the evaluation harness before the model. Without it, you can’t tell if anything is working. Know which requests to push back on entirely. Some “ML problems” should stay heuristics. The smallest deployable model that solves the problem is almost always the right model. Your Actual Constraints (And Which Ones Are Real) Before anything else, audit your constraints honestly. Some are hard walls. Most aren’t. Compute is usually softer than it feels. For tabular data problems at company scale (< 10M rows, < 1000 features), gradient-boosted trees on a single CPU core outperform most deep learning approaches and train in minutes. For embedding-based tasks, the free tier of any major cloud provider gets you surprisingly far. For LLM-based features, API calls are compute, and the economics of gpt-4o-mini or claude-haiku per call are approachable for MVP volumes. The things that genuinely require a GPU: training or fine-tuning transformer-scale models from scratch. If that’s actually the job, you need either a cloud budget (Google Colab Pro+, a spot instance, or a Modal.com run-function) or a scoped problem that doesn’t require it. For almost everything else, the “we don’t have a GPU” constraint is a proxy for a different constraint you haven’t named yet. Data is almost always actually a problem — but not usually in the way people think. The issue is rarely “not enough rows.” It’s label quality, label consistency, and the gap between what was logged and what you need. More on this shortly. Engineering support is the constraint that actually kills most solo ML projects. Not because you can’t build the model alone, but because getting it called in production, monitored, and redeployed when it breaks requires someone on the other side to care. Scope your project to what you can maintain alone, or make a specific ask of one engineer before you start — not after you have a working model. AI Generated Start With the Evaluation Harness, Not the Model This is the discipline that separates practitioners who ship from practitioners who perpetually have “a model working in the notebook.” Before writing a single line of training code, build the thing that tells you whether a model is working: import pandas as pdfrom sklearn.metrics import classification_report, roc_auc_scorefrom typing import Callabledef evaluate( predict_fn: Callable, test_df: pd.DataFrame, label_col: str = “label”, threshold: float = 0.5) -> dict: “”” Minimal evaluation harness. Pass any callable as predict_fn. Works for heuristics, sklearn models, and API-based LLM classifiers alike. “”” y_true = test_df[label_col].values y_scores = predict_fn(test_df.drop(columns=[label_col])) y_pred = (y_scores >= threshold).astype(int) report = classification_report(y_true, y_pred, output_dict=True) auc = roc_auc_score(y_true, y_scores) return { “auc”: round(auc, 4), “precision”: round(report[“1”][“precision”], 4), “recall”: round(report[“1”][“recall”], 4), “f1”: round(report[“1”][“f1-score”], 4), “n_test”: len(y_true), “positive_rate”: round(y_true.mean(), 4) }# Your first “model” should be a heuristic baselinedef heuristic_predict(df: pd.DataFrame) -> pd.Series: “””Example: flag anything above threshold in an existing signal column.””” return (df[“some_existing_signal”] > 50).astype(float)# Now you have a number to beatbaseline_results = evaluate(heuristic_predict, test_df)print(baseline_results) Write this harness first because it forces two critical conversations: what does “working” mean, and what does the baseline look like? If you can’t define a test set and a success metric before training, you don’t have a problem definition — you have a research project. Research projects don’t get deployed. The harness also gives you something to hand to a sceptical stakeholder before you’ve trained anything: “Here’s what a simple rule achieves. Here’s what we’d need to see to justify the model complexity.” The Data Problem You Actually Have You’ve been handed a dataset. It has labels. Here’s what’s probably wrong with it. Label leakage from time. The label was set after the event you’re trying to predict. The model learns to recognise the aftermath, not the signal. Check: can you reconstruct your feature set as it existed at prediction time? If event data is joined without strict temporal cutoffs, you have a problem. Label disagreement between sources. Two systems that should agree on the label don’t, and someone just unioned them. Spot-check 50 positives and 50 negatives manually. If you disagree with 15% of the labels, your ceiling is around 85% accuracy regardless of model complexity. Class imbalance that isn’t handled. A 99:1 imbalance doesn’t mean you need a complex technique — it means your evaluation metric needs to be AUC-ROC or F1, not accuracy, and your baseline “predict everything negative” is already at 99% accuracy and completely useless. A quick label audit that catches most of these: def audit_labels(df: pd.DataFrame, label_col: str, date_col: str) -> None: “””Quick data quality checks before touching a model.””” print(f”Label distribution:n{df[label_col].value_counts(normalize=True).round(3)}n”) # Check for temporal consistency if date_col in df.columns: df[“month”] = pd.to_datetime(df[date_col]).dt.to_period(“M”) monthly_rate = df.groupby(“month”)[label_col].mean() print(f”Label rate over time (should be stable or trend smoothly):”) print(monthly_rate.to_string()) # Spike in label rate is often a labelling artefact, not a real signal rate_std = monthly_rate.std() if rate_std > 0.05: print(f”n⚠️ High label rate variance ({rate_std:.3f}). Check for labelling changes.”) # Duplicates dup_rate = df.duplicated(subset=[c for c in df.columns if c != label_col]).mean() print(f”nDuplicate feature row rate: […]

Liked Liked