Key Responsibilities:
• Design, build, and maintain a cloud-based data storage system optimized for large-scale single-cell multi-omics atlases.
• Develop and optimize ETL pipelines for data ingestion, transformation, and retrieval, ensuring high performance and scalability.
• Collaborate with the AI and bioinformatics teams to ensure that the infrastructure supports internal use cases, including training foundation models on omics data and drug discovery research.
• Create a scalable, client-facing platform for on-demand single-cell atlas generation, ensuring ease of access, robust performance, and secure data handling.
• Implement data governance, compliance, and security protocols (e.g., GDPR, HIPAA) to ensure the integrity and privacy of sensitive data.
• Develop solutions to manage data versioning, lineage, and metadata tracking, providing seamless access and traceability for internal and external users.
• Continuously monitor, optimize, and scale the cloud infrastructure for growing data needs and user demands.
• Troubleshoot and resolve issues in real-time, ensuring high system availability and reliability.
• Stay updated with advancements in cloud technologies, omics data handling, and data engineering best practices to continually improve the platform.
Qualifications (From Most Important to Least):
1. Proven experience designing and implementing cloud-based data storage solutions (AWS or GCP).
2. Experience building and optimizing ETL pipelines for large-scale biological datasets, particularly single-cell multi-omics.
3. Strong programming skills in Python
4. Experience in bioinformatics or working with large-scale biological datasets is a strong plus.
5. Expertise in cloud services such as AWS S3, Lambda, EC2, GCP BigQuery, Storage
6. Strong understanding of database management systems (SQL, NoSQL) and distributed computing frameworks (e.g., Apache Spark, Dask).
7. Knowledge of data governance, compliance, and security practices (e.g., GDPR, HIPAA).
8. Experience with data formats commonly used in omics research (HDF5, Parquet, CSV, etc.).
9. Familiarity with containerization (Docker, Kubernetes) and microservices architecture.
10. Ability to work collaboratively in multi-disciplinary teams and communicate effectively with technical and non-technical stakeholders.
Preferred Skills:
• Experience with AI/ML infrastructure for training large models on biological data.
• Familiarity with drug discovery processes and applications of omics data in pharmaceutical research.
• Experience developing user-friendly platforms for external clients, especially in the life sciences domain.
• Knowledge of multi-cloud strategies and hybrid cloud architecture.