Back to Blog
Data EngineeringJanuary 10, 2025

Why Your Healthcare Data Platform Is Still Fragmented (And How to Fix It)

Most healthcare organizations are drowning in data but starving for insights. Here's how to build a modern platform that actually works.

I've modernized data platforms for telehealth organizations, payers, and health systems. The starting point is always the same: fragmented systems, legacy ETL pipelines, and teams frustrated by slow analytics and inconsistent data.

The path forward isn't another data warehouse or hiring more data engineers. It's about rethinking your architecture from first principles.

The Fragmentation Problem

Most healthcare organizations have:

  • Multiple EHRs across different facilities or acquisitions
  • Business systems (Salesforce, ADP, billing platforms) that don't talk to each other
  • Claims data in one place, clinical data in another, quality measures somewhere else
  • Legacy ETL jobs that break when source systems change
  • No single source of truth for patient longitudinal records

Sound familiar? This isn't a technology problem. It's an architecture problem.

The Lakehouse + Medallion Approach

Here's what actually works: combine the flexibility of a data lake with the structure of a data warehouse using a Lakehouse architecture with Medallion layers.

Bronze Layer (Raw Data)

  • Ingest everything as-is from source systems
  • No transformations, just schema validation
  • High-velocity ingestion with Kafka, Kinesis, or Pub/Sub
  • Store in Parquet or Delta format for performance

Silver Layer (Cleaned & Standardized)

  • Transform to FHIR-based patient models
  • Apply data quality rules and deduplication
  • Create longitudinal patient records across systems
  • Use dbt for declarative transformations

Gold Layer (Business Logic)

  • Population health cohorts
  • HEDIS measure calculations
  • Quality metrics (PHQ-9, GAD-7, A1c, BP)
  • Risk stratification models

This isn't theoretical. We built this for Defense Health Agency's telehealth operations on AWS GovCloud — processing patient encounters, claims, and clinical program data hourly.

Why FHIR Matters (Even If You're Not Exchanging Data)

FHIR isn't just for interoperability. It's the best framework for building a unified patient data model across fragmented systems.

Here's why:

  1. Semantic interoperability — FHIR resources (Patient, Encounter, Observation, Condition) provide a common vocabulary across EHRs
  2. Longitudinal patient records — Bundle all patient data (demographics, encounters, labs, meds, diagnoses) in a standardized format
  3. Extension support — Add custom fields for payer-specific or clinical program data without breaking the standard
  4. Future-proofing — When CMS mandates FHIR API access, you're already compliant

We mapped EHR data, claims, and clinical programs to FHIR Patient, Encounter, Claim, and Observation resources. This gave us a single, queryable patient view that worked across all source systems.

Real-Time vs. Batch: When Each Makes Sense

Not everything needs to be real-time. Here's when to use each:

Real-time streaming (Kafka, Pub/Sub, Kinesis)

  • Patient encounter events (admissions, discharges, transfers)
  • Clinical alerts and notifications
  • Real-time dashboards for operational monitoring

Batch processing (Airflow, dbt, Glue)

  • HEDIS measure calculations (quarterly or annual)
  • Historical claims data loads
  • Population health cohort analysis
  • Quality metric aggregations

We run hourly batch pipelines for most workloads. Real-time adds complexity and cost — only use it when latency truly matters.

The Right Tech Stack

Here's what we've used successfully across multiple healthcare platforms:

Cloud platforms:

  • GCP (BigQuery, Pub/Sub, Dataflow) for most use cases
  • AWS GovCloud (Redshift, Kinesis, Glue, Step Functions) for FedRAMP compliance
  • Databricks for lakehouse architecture + ML workflows

Orchestration & transformation:

  • Airflow for workflow orchestration
  • dbt for SQL-based transformations
  • Great Expectations or dbt tests for data quality

BI & analytics:

  • Looker for self-service analytics
  • Semantic layer (dbt metrics or LookML) for consistent business logic

Governance: Not Optional

Data governance isn't a compliance checkbox. It's how you ensure your platform is actually trusted and used.

Key components:

  • Data lineage — Track data flow from source systems to analytics (we use dbt docs + metadata APIs)
  • Quality monitoring — Automated checks on completeness, accuracy, and freshness
  • Access controls — Row-level security for multi-tenant environments
  • Audit logging — Who accessed what data, when, and why (HIPAA requirement)

The 90-Day Roadmap

Here's how to get from fragmented chaos to a working modern platform in 90 days:

Weeks 1-2: Assessment

  • Map all source systems and data flows
  • Identify critical use cases (population health, quality measures, risk adjustment)
  • Choose cloud platform and tech stack

Weeks 3-6: Foundation

  • Set up Bronze layer ingestion for 2-3 critical sources
  • Build Silver layer FHIR transformations for Patient + Encounter
  • Implement basic data quality checks

Weeks 7-10: Gold Layer & Analytics

  • Build 1-2 critical Gold layer use cases (e.g., active patient cohorts)
  • Set up Looker dashboards
  • Implement governance and access controls

Weeks 11-12: Production Hardening

  • Add monitoring and alerting
  • Document data models and lineage
  • Train users and hand off to internal teams

The Bottom Line

Fragmented data platforms aren't a technology problem. They're an architecture problem.

Stop adding more ETL jobs. Stop building more one-off integrations. Build a proper Lakehouse with Medallion layers, use FHIR for patient data models, and invest in governance.

The organizations that succeed are the ones who treat data platforms as a strategic enabler, not an IT project.

Need help modernizing your healthcare data platform? Let's discuss your roadmap.