We are seeking a Healthcare Data Engineer to architect, develop, and scale pipelines that harmonize and integrate the EHR data across different datasets. In this hands-on role, you will design and maintain high-throughput ETL workflows, apply standards such as HL7 FHIR, OMOP, and SNOMED to guarantee interoperability, and collaborate with bioinformatics, clinical, product, and engineering teams to deliver secure, research-ready data for our expanding disease predicting pipeline.
Data Standardization & Interoperability
Map heterogeneous data to HL7 FHIR, OMOP, SNOMED CT, ICD-10/11, LOINC, RxNorm, and related vocabularies.
Maintain high fidelity and minimal data loss through ontology-driven mapping and validation.
Design & Implement ETL Pipelines
Work with the engineering team to improve the workflows to ingest, de-identify, and harmonize clinical data from various EHR systems.
Integrate structured and unstructured data (clinical notes, imaging, lab results) into a unified schema.
Cloud Architecture & Scalability
Work with the engineering team to maintain a secure, cloud-based infrastructure capable of supporting petabyte-scale datasets.
Leverage distributed computing frameworks (e.g., Apache Spark, Databricks) for high-throughput data processing.
Privacy & Security
Ensure compliance with HIPAA, GDPR, and other applicable regulations.
Implement federated data-sharing patterns and robust encryption for data in transit and at rest.
Data Quality & Validation
Work with the engineering team to build automated anomaly-detection pipelines for real-time data quality checks.
Collaboration & Communication
Work with cross-functional teams (engineering, product, clinical, lab) to set timelines and roadmaps.
Share daily progress and surface blockers early while following established best practices in healthcare data engineering.
PhD in CS, Bioinformatics, or a related field; OR5+ years of experience in data engineering with at least 2+ years specific to healthcare or clinical informatics.
Hands-on knowledge of HL7 FHIR, OMOP, SNOMED CT, and other healthcare data standards.
Must have experience working with large biobanks (e.g. UKBB, All of Us, ...)
Proficiency in SQL and one or more programming languages (Python, C+).
Experience with cloud platforms (AWS, Azure, or GCP) and distributed frameworks (Spark, Databricks).
Familiarity with privacy-preserving architectures, data encryption, and federated data models.
Demonstrated success in building ETL pipelines .
Strong communication skills to translate complex data requirements into actionable plans for cross-functional teams.
Nice to have: Familiarity with genomic data, and/or NLP for clinical text.