🧹 Data Cleaning & Preparation

Messy Data In. Clean, Trusted Data Out.

Duplicate customers, inconsistent dates, broken postcodes, mixed formats — I clean, deduplicate, standardise and validate spreadsheets, CRM exports and entire databases using Python (Pandas) and SQL. GDPR-compliant masking and anonymisation available, applied with the same approach used to secure 600+ enterprise tables.

Get a Free Quote All Services

What I Offer

Data Cleaning Packages

📄

Spreadsheet & CSV Clean-Up

The classic rescue job: one or more Excel/CSV files full of duplicates, inconsistent names, mixed date formats and stray characters — returned clean, consistent and import-ready.

Includes

Deduplication (exact + fuzzy matching)
Date, phone, address & currency standardisation
Validation report: what was fixed and why
Output in any format — CSV, Excel, JSON

Python / Pandas Excel Fuzzy Matching

🗄️

Database Cleaning at Scale

Quality rules, constraint repair and cleansing run directly inside your database — SQL Server, PostgreSQL or MySQL — so bad data stops at the source instead of being patched downstream.

Includes

Profiling: nulls, orphans, duplicates, drift
T-SQL / SQL cleansing scripts you keep
Constraints & checks to prevent re-contamination
Before/after data-quality scorecard

T-SQL SQL Server PostgreSQL

🔒

GDPR Masking & Anonymisation

Share data with vendors, analysts or test environments without exposing personal information — dynamic masking, pseudonymisation and fully anonymised dataset generation, GDPR-aligned.

Includes

PII discovery & classification across tables
Dynamic data masking or static anonymised copies
Role-based access recommendations (RBAC)
Documentation for your compliance records

GDPR Data Masking Microsoft Purview

🤖

ML-Ready Data Preparation

Raw data turned into model-ready datasets: outlier handling, missing-value strategy, encoding and normalisation — delivered as a documented, repeatable Python pipeline, not a one-off file.

Includes

Exploratory profiling & quality assessment
Outliers, imputation, encoding, scaling
Reusable Pandas/NumPy preprocessing script
Train/test-safe, leakage-aware preparation

Pandas NumPy Feature Prep

Process

How It Works

1 📧

Send a Sample

Email a small sample of your data (or just describe it). I’ll review the issues and reply with a fixed quote and turnaround — usually same day.

2 🧹

I Clean & Validate

Cleaning runs through scripted, repeatable Python/SQL steps — never manual find-and-replace — so every change is logged and reversible.

3 ✅

Receive Clean Data + Report

You get the cleaned dataset, a summary of what changed, and (on request) the script itself so you can re-run it whenever new data arrives.

Why Me

Cleaning Done Like an Engineer, Not an Intern

🛡️

GDPR Track Record

Secured and masked 600+ tables of enterprise data with Microsoft Purview, dynamic masking and role-based access — your data is handled the same way.

🔎

Catches What Eyeballs Miss

Automated validation once flagged 20,000+ bogus records across 345 organisations’ submissions — scripted checks find what manual review can’t.

🔁

Repeatable, Not One-Off

Every job is a script, not a manual edit. Re-run it next month on fresh data, or have me schedule it as an automated pipeline.

⚡

Fast & Fixed-Price

Most spreadsheet jobs are returned within 24–48 hours at a price agreed upfront from your sample — no hourly surprises.

Questions

FAQ

CSV, Excel (xls/xlsx), JSON, XML, Google Sheets, and direct database connections (SQL Server, PostgreSQL, MySQL, Azure SQL, Snowflake). CRM and e-commerce exports — Salesforce, HubSpot, Shopify, WooCommerce — are all common jobs.

Data is processed on encrypted storage, never shared, and deleted after delivery on request. I’m happy to sign an NDA, and for sensitive datasets I can work on a masked sample or directly inside your own environment so raw personal data never leaves your systems.

From a 200-row mailing list to multi-million-row warehouse tables — the largest build I’ve delivered processed 5.6 million records. Pandas handles most jobs; for very large volumes I switch to SQL- or Spark-based processing.

Exact dedup only catches identical rows. Fuzzy matching also catches “Jon Smith, 12 High St” vs “John Smith, 12 High Street” — using similarity scoring on names, addresses and emails, with a review file for borderline matches so nothing is merged blindly.

Yes — any cleaning job can be converted into a scheduled pipeline that picks up new files, cleans them and delivers the output automatically. See Database Design & Warehousing for full pipeline builds, or ask about a monthly retainer.

Messy Data In. Clean, Trusted Data Out.

Data Cleaning Packages

How It Works

Cleaning Done Like an Engineer, Not an Intern

GDPR Track Record

Catches What Eyeballs Miss

Repeatable, Not One-Off

Fast & Fixed-Price

Got a messy dataset right now?

FAQ

More Services