Data Engineer

DeepLife is hiring!

About

DeepLife is a pre-series A startup focused on addressing the urgent need to increase drug discovery reliability by acting on the earliest step, drug target identification. This consists of identifying, a molecular target, such as a protein, that will trigger the transition from disease to healthy cells. With current methods, only 1 target in 10,000 reach the market, leading to a significant loss of time and efforts in the community.

Our approach is to leverage the recent revolution in the omics data, measuring precisely cells activity at large scale, and build foundation models to mimic cell behavior in various contexts and identify the optimal trigger to reverse disease state.

Half of the team today is dedicated to build the largest omics database, aka omics atlas, to map all human body tissues and diseases and reduce experimental biases.

We offer a research friendly environment, with 90% of the company holding a PhD, with academic collaborations and publications. The team is international and composed of +10 different nationalities. The company is remote first with most of the work is remote and regular events organized in our offices in Paris.

Job Description

As a Data Engineer, you will lead the design, implementation, and maintenance of a cloud-based data storage system that hosts large-scale single-cell multi-omics atlases. You will be responsible for building infrastructure to support internal projects—such as training large-scale foundation models on omics data and developing drug discovery use cases—as well as delivering client solutions through an on-demand single-cell atlas generation platform.

You’ll collaborate closely with bioinformatics experts, data scientists, software engineers, and business teams to ensure smooth data integration and optimal performance. This role requires expertise in managing large datasets, creating scalable solutions, and optimizing for performance and accessibility in a highly dynamic and impactful environment.

Preferred Experience

Key Responsibilities:

• Design, build, and maintain a cloud-based data storage system optimized for large-scale single-cell multi-omics atlases.

• Develop and optimize ETL pipelines for data ingestion, transformation, and retrieval, ensuring high performance and scalability.

• Collaborate with the AI and bioinformatics teams to ensure that the infrastructure supports internal use cases, including training foundation models on omics data and drug discovery research.

• Create a scalable, client-facing platform for on-demand single-cell atlas generation, ensuring ease of access, robust performance, and secure data handling.

• Implement data governance, compliance, and security protocols (e.g., GDPR, HIPAA) to ensure the integrity and privacy of sensitive data.

• Develop solutions to manage data versioning, lineage, and metadata tracking, providing seamless access and traceability for internal and external users.

• Continuously monitor, optimize, and scale the cloud infrastructure for growing data needs and user demands.

• Troubleshoot and resolve issues in real-time, ensuring high system availability and reliability.

• Stay updated with advancements in cloud technologies, omics data handling, and data engineering best practices to continually improve the platform.

Qualifications (From Most Important to Least):

1. Proven experience designing and implementing cloud-based data storage solutions (AWS or GCP).

2. Experience building and optimizing ETL pipelines for large-scale biological datasets, particularly single-cell multi-omics.

3. Strong programming skills in Python

4. Experience in bioinformatics or working with large-scale biological datasets is a strong plus.

5. Expertise in cloud services such as AWS S3, Lambda, EC2, GCP BigQuery, Storage

6. Strong understanding of database management systems (SQL, NoSQL) and distributed computing frameworks (e.g., Apache Spark, Dask).

7. Knowledge of data governance, compliance, and security practices (e.g., GDPR, HIPAA).

8. Experience with data formats commonly used in omics research (HDF5, Parquet, CSV, etc.).

9. Familiarity with containerization (Docker, Kubernetes) and microservices architecture.

10. Ability to work collaboratively in multi-disciplinary teams and communicate effectively with technical and non-technical stakeholders.

Preferred Skills:

• Experience with AI/ML infrastructure for training large models on biological data.

• Familiarity with drug discovery processes and applications of omics data in pharmaceutical research.

• Experience developing user-friendly platforms for external clients, especially in the life sciences domain.

• Knowledge of multi-cloud strategies and hybrid cloud architecture.

Additional Information

Contract Type: Full-Time
Location: Paris
Possible full remote

Apply Now