Gen3 - Technical Intro (2024)

This documentation is intended for developers who want to understand the design and architecture of Gen3. If you want to contribute code to the Gen3 source code please visit our Gen3 Contributor Guidelines .

The Gen3 platform is a set of services that enables users to use data and compute resources easily from various cloud providers. It also provides a user-friendly environment to organize and query data, and run computational analysis.

Terminology

Data Object

Files on disk that are typically analyzed as a whole or in chunks. They are typically petabyte scale in data commons and sit in object storage.

Rich Data

Data that are harmonized, indexed in various databases, discoverable, and queryable.

Data management

The following diagram describes all the user interactions in data management aspect.Gen3 - Technical Intro (1)

  1. Fence microservice provides authentication and authorization framework for all Gen3 services & resources.
  2. Download/upload data. While Gen3 doesn’t have services sitting between the user and cloud storage services so the user can fully leverage the cloud provider’s power, it does provide tools and services to enable users to access protected data with temporary credentials.
  3. Windmill serves static html, javascript and image files to create a view for users to interact with Gen3 microservices.
  4. Sheepdog microservice allows user to do rich data submission.
  5. Peregrine microservice allows user to do GraphQL queries on live rich data.
  6. Indexd microservice allows user to find physical location for data object.
  7. Gen3-arranger exposes GraphQL query interface for a flattened/materialized view of rich data that’s ETLed from graph rich data to Elasticsearch.

Data Submission System

A lot of data are generated during experiments and studies, and ideally are organized and annotated in a way that describes its context. All the ‘context’ is preserved in our ‘rich data’ database. The rich data store is presented in a graph-like relational model to depict the normalized relationships of all the concepts. Take the BloodPAC data model as an example. It describes a study conducted on many cases (aka patients), how doctors gathered clinical information about each patient stored in nodes such as diagnosis and family history, and how the hospital gathered samples from the patient and sent them to sequencing centers which produced sequencing files store in submitted unaligned reads.

In order for a Gen3 Commons to preserve this rich data, a consistent data model with standard terminologies needs to be constructed. Our data model uses jsonschema, and stores the models as yaml fields in GitHub to make it easier for domain experts to make changes and track activity. The schema is then translated to database ORM(psqlgraph ) and used by Gen3 microservices to do data validation and database interactions.

Our backend currently uses Postgres. This is not necessarily the optimal choice for complicated graph traversals, but we chose this database due to its robustness as a traditional relational database. The data model that is described in jsonschema is translated to a relational data model in Postgres, where every node and edge is a table. All properties are stored as jsonb in Postgres as opposed to separate columns. While this sacrifices some query performance, it supports frequent data modeling changes that are required by domain experts.

Sheepdog uses the dictionary-driven ORM to do metadata validation and submission as described in the following diagram:Gen3 - Technical Intro (2).

Peregrine exposes a query interface for the normalized rich data via GraphQL interface:Gen3 - Technical Intro (3).

Separately, users use gen3-client to request temporary urls to do raw data download/upload:Gen3 - Technical Intro (4).

Data Denormalization System

This is an alpha feature

After we collect valuable data from various submitters, we would like to expose it in a user-friendly web interface. Understanding the datamodel and knowing how to traverse the graph is intimidating for a general Gen3 user, so we created an ETL application - tube to denormalize the graph to several types of flat documents to cater to several major use cases.

Tube is driven by configuration files which describe the flat document structure and the mapping logic from the graph model, so that it’s generic and can support various datamodels in different commons. For most of the biomedical commons, there are two types of flat documents that satisfy the majority of users:

  • A file-centric document that denormalizes biospecimen and clinical attributes for each file. This targets bioinformaticians who want to filter by specific clinical/biospecimen attributes and select files on which to run analysis.
  • A case-centric document that denormalizes biospecimen and clinical attributes for each case. This targets clinicians who want to see distributions based on clinical attributes among cases. Most of the time, these cases represent patients.

Living document for data exploration architecture

Workspace Systems

Workspaces are the compute component of a data commons. Workspaces allow users to execute analysis tasks on data without moving the data to another remote environment. Workspaces come in many forms, currently Gen3 integrates what it calls lightweight workspaces. Lightweight workspaces are designed to allow for quick analysis of data, and the creation of workflow jobs in the workflow system.

Lightweight Workspaces

JupyterHub is a service which allows for multiple Jupyter notebooks to be run by multiple users on a central server. The isolation of the user notebooks depends on the spawner used, and in this case relies on the isolation provided between Kubernetes pods. The Gen3 JupyterHub service is based on Zero to JupyterHub and Kubeflow .

The following diagram shows the authorization flow for the JupyterHub instances. We utilize the Revproxy and Fence acting as an API gateway for these workspaces. JupyterHub is configured with the remote user auth plugin so that users are authed based on the REMOTE_USER header.

Gen3 - Technical Intro (5)

JupyterHub runs in a container with an HTTP proxy. The proxy has dynamic routing that routes either to the hub or to the user’s spawned jupyter notebook container.

JupyterHub is deployed into the default namespace for the commons, but user pods are deployed into the specific jupyter-pods namespace to provide an added layer of isolation. This is accomplished using the Kubespawner plugin for JupyterHub. Eventually, users will be deployed into their own Kubernetes namespace so that they can utilize the K8s API to spin up clusters for Spark or Dask. We are tracking issues related to the creation and monitoring of multiple namespaces in Kubespawner 1 2 . We use a customized JupyterHub which contains additional code to cull idle notebooks after several hours of inactivity. This automatically scales down the cluster again when the notebooks are no longer in use by users.

Notebook servers are configured with persistent storage mounted into /home/jovyan/pd for users to store scripts and configurations which they wish to persist past notebook shutdown. In the future we would like to change this to have the storage for the user in the notebook backed by the cloud object storage (S3 or GCS) to improve scalability and the ability to load data into the containers for users.

Currently, we support user selectable notebook containers and resource allocations from a prepopulated list. Earth science and Bioinformatics notebooks are available with popular libraries preconfigured.

We also configure a prepuller daemonset on K8s to pull the docker images for common user notebooks to each node in the cluster. This significantly speeds up launch time as these images can be many GB in size.

Full Workspaces

Full workspaces, i.e. workflow systems that run analysis pipelines at scale over data, are still to be implemented into Gen3.

Gen3 - Technical Intro (2024)
Top Articles
St Christopher's Inn Barcelona in Barcelona, in Spanien ab 11 €: Angebote, Bewertungen, Fotos | momondo
Barcelona - Hostel St Christopher's Inn
Pollen Count Centreville Va
NOAA: National Oceanic & Atmospheric Administration hiring NOAA Commissioned Officer: Inter-Service Transfer in Spokane Valley, WA | LinkedIn
Was ist ein Crawler? | Finde es jetzt raus! | OMT-Lexikon
COLA Takes Effect With Sept. 30 Benefit Payment
Unitedhealthcare Hwp
Nc Maxpreps
Craigslist Vermillion South Dakota
Missing 2023 Showtimes Near Lucas Cinemas Albertville
Jessica Renee Johnson Update 2023
Syracuse Jr High Home Page
Caresha Please Discount Code
Bestellung Ahrefs
Breakroom Bw
Binghamton Ny Cars Craigslist
Wisconsin Women's Volleyball Team Leaked Pictures
Shasta County Most Wanted 2022
Google Doodle Baseball 76
Universal Stone Llc - Slab Warehouse & Fabrication
Busted Mcpherson Newspaper
Panolian Batesville Ms Obituaries 2022
THE FINALS Best Settings and Options Guide
Wics News Springfield Il
Reviews over Supersaver - Opiness - Spreekt uit ervaring
Caring Hearts For Canines Aberdeen Nc
1 Filmy4Wap In
Meridian Owners Forum
Where to eat: the 50 best restaurants in Freiburg im Breisgau
UAE 2023 F&B Data Insights: Restaurant Population and Traffic Data
Everything You Need to Know About Ñ in Spanish | FluentU Spanish Blog
Bernie Platt, former Cherry Hill mayor and funeral home magnate, has died at 90
Ourhotwifes
How does paysafecard work? The only guide you need
Seymour Johnson AFB | MilitaryINSTALLATIONS
The Best Restaurants in Dublin - The MICHELIN Guide
Bernie Platt, former Cherry Hill mayor and funeral home magnate, has died at 90
A Comprehensive 360 Training Review (2021) — How Good Is It?
Lovein Funeral Obits
Silive Obituary
Obituaries in Hagerstown, MD | The Herald-Mail
Lima Crime Stoppers
Tunica Inmate Roster Release
Catchvideo Chrome Extension
Caphras Calculator
Craigslist Pet Phoenix
Willkommen an der Uni Würzburg | WueStart
The Sports Academy - 101 Glenwest Drive, Glen Carbon, Illinois 62034 - Guide
CPM Homework Help
25100 N 104Th Way
A Snowy Day In Oakland Showtimes Near Maya Pittsburg Cinemas
Bluebird Valuation Appraiser Login
Latest Posts
Article information

Author: Pres. Carey Rath

Last Updated:

Views: 6558

Rating: 4 / 5 (41 voted)

Reviews: 80% of readers found this page helpful

Author information

Name: Pres. Carey Rath

Birthday: 1997-03-06

Address: 14955 Ledner Trail, East Rodrickfort, NE 85127-8369

Phone: +18682428114917

Job: National Technology Representative

Hobby: Sand art, Drama, Web surfing, Cycling, Brazilian jiu-jitsu, Leather crafting, Creative writing

Introduction: My name is Pres. Carey Rath, I am a faithful, funny, vast, joyous, lively, brave, glamorous person who loves writing and wants to share my knowledge and understanding with you.