🧹 Data Cleaning & Preparation

Messy Data In. Clean, Trusted Data Out.

Duplicate customers, inconsistent dates, broken postcodes, mixed formats — I clean, deduplicate, standardise and validate spreadsheets, CRM exports and entire databases using Python (Pandas) and SQL. GDPR-compliant masking and anonymisation available, applied with the same approach used to secure 600+ enterprise tables.

Data Cleaning Packages

📄
Spreadsheet & CSV Clean-Up

The classic rescue job: one or more Excel/CSV files full of duplicates, inconsistent names, mixed date formats and stray characters — returned clean, consistent and import-ready.

Includes
  • Deduplication (exact + fuzzy matching)
  • Date, phone, address & currency standardisation
  • Validation report: what was fixed and why
  • Output in any format — CSV, Excel, JSON
Python / Pandas Excel Fuzzy Matching
🗄️
Database Cleaning at Scale

Quality rules, constraint repair and cleansing run directly inside your database — SQL Server, PostgreSQL or MySQL — so bad data stops at the source instead of being patched downstream.

Includes
  • Profiling: nulls, orphans, duplicates, drift
  • T-SQL / SQL cleansing scripts you keep
  • Constraints & checks to prevent re-contamination
  • Before/after data-quality scorecard
T-SQL SQL Server PostgreSQL
🔒
GDPR Masking & Anonymisation

Share data with vendors, analysts or test environments without exposing personal information — dynamic masking, pseudonymisation and fully anonymised dataset generation, GDPR-aligned.

Includes
  • PII discovery & classification across tables
  • Dynamic data masking or static anonymised copies
  • Role-based access recommendations (RBAC)
  • Documentation for your compliance records
GDPR Data Masking Microsoft Purview
🤖
ML-Ready Data Preparation

Raw data turned into model-ready datasets: outlier handling, missing-value strategy, encoding and normalisation — delivered as a documented, repeatable Python pipeline, not a one-off file.

Includes
  • Exploratory profiling & quality assessment
  • Outliers, imputation, encoding, scaling
  • Reusable Pandas/NumPy preprocessing script
  • Train/test-safe, leakage-aware preparation
Pandas NumPy Feature Prep

How It Works

1 📧
Send a Sample

Email a small sample of your data (or just describe it). I’ll review the issues and reply with a fixed quote and turnaround — usually same day.

2 🧹
I Clean & Validate

Cleaning runs through scripted, repeatable Python/SQL steps — never manual find-and-replace — so every change is logged and reversible.

3
Receive Clean Data + Report

You get the cleaned dataset, a summary of what changed, and (on request) the script itself so you can re-run it whenever new data arrives.

Cleaning Done Like an Engineer, Not an Intern

🛡️

GDPR Track Record

Secured and masked 600+ tables of enterprise data with Microsoft Purview, dynamic masking and role-based access — your data is handled the same way.

🔎

Catches What Eyeballs Miss

Automated validation once flagged 20,000+ bogus records across 345 organisations’ submissions — scripted checks find what manual review can’t.

🔁

Repeatable, Not One-Off

Every job is a script, not a manual edit. Re-run it next month on fresh data, or have me schedule it as an automated pipeline.

Fast & Fixed-Price

Most spreadsheet jobs are returned within 24–48 hours at a price agreed upfront from your sample — no hourly surprises.

Got a messy dataset right now?

Send a few sample rows and a sentence about what’s wrong with it — you’ll get a fixed quote and turnaround time, usually the same day.

FAQ

CSV, Excel (xls/xlsx), JSON, XML, Google Sheets, and direct database connections (SQL Server, PostgreSQL, MySQL, Azure SQL, Snowflake). CRM and e-commerce exports — Salesforce, HubSpot, Shopify, WooCommerce — are all common jobs.
Data is processed on encrypted storage, never shared, and deleted after delivery on request. I’m happy to sign an NDA, and for sensitive datasets I can work on a masked sample or directly inside your own environment so raw personal data never leaves your systems.
From a 200-row mailing list to multi-million-row warehouse tables — the largest build I’ve delivered processed 5.6 million records. Pandas handles most jobs; for very large volumes I switch to SQL- or Spark-based processing.
Exact dedup only catches identical rows. Fuzzy matching also catches “Jon Smith, 12 High St” vs “John Smith, 12 High Street” — using similarity scoring on names, addresses and emails, with a review file for borderline matches so nothing is merged blindly.
Yes — any cleaning job can be converted into a scheduled pipeline that picks up new files, cleans them and delivers the output automatically. See Database Design & Warehousing for full pipeline builds, or ask about a monthly retainer.

More Services