We request that any use of data obtained from the Global Biobank Engine be cited in publications using the following format:
We also ask that the developers of the engine be acknowledged as follows:
Data presented in GBE is from the UK Biobank dataset release version 2. To minimize the impact of cofounders and unreliable observations, we used a subset of individuals that satisfied all of the following criteria: (1) selfreported white British ancestry, (2) used to compute principal components, (3) not marked as outliers for heterozygosity and missing rates, (4) do not show putative sex chromosome aneuploidy, and (5) have at most 10 putative third-degree relatives. These criteria are reported by the UK Biobank in the file “ukb_sqc_v2.txt” in the following columns respectively: (1) “in_white_British_ancestry_subset,” (2) “used_in_pca_calculation,” (3) “het_missing_outliers,” (4) “putative_sex_chromosome_aneuploidy”, and (5) “excess_relatives.” We removed 151,169 individuals that did not meet these criteria. Similar criteria was applied to the exome sequencing data from UK Biobank.
We processed summary statistics from the United States' Million Veterans Program.
Genome-wide association analysis was performed with Firth-fallback using PLINK v2.00a (17 July 2017). We used the following covariates in our analysis: age, sex, array type, and the first four principal components, where array type is a binary variable that represents whether an individual was genotyped with UK Biobank Axiom Array or UK BiLEVE Axiom Array. For variants that were specific to one array, we did not use array as a covariate.
Current best practices for determining significance of associations with p-values in genetic association studies require that the significance threshold be adjusted to reflect the number of associations tested, a method known as the Bonferroni correction. For GWAS, 820,897 tests are performed, one for each variant on the array. For PheWAS, 1,766 tests, one for each phenotype tested for each variant. Thus the appropriate p-value cutoffs are 6.0x10-8 for GWAS and 2.8x10-5 for PheWAS.
The method used for aggregate analysis shown on GBE is described in detail in our manuscript, “Bayesian Model Comparison for rare variant association studies of multiple phenotypes”. Briefly, we run a model called MRP, which considers correlation, scale, and location of genetic effects across a group of genetic variants, phenotypes, and studies. By sharing information across rare variants and phenotypes, we improve our ability to identify rare variants associated with disease compared to considering a single rare variant and a single phenotype.
Variants were filtered using the variant filter table.tsv file available on GitHub (commit 6f9f726) to filter variants on the UK Biobank array for use with MRP. We first chose variants with minor allele frequency less than 1%. We then filtered out all variants with all filters less than one. This removes variants with missingness greater than 1% (calculated on an array-specific basis for array-specific variants) or Hardy-Weinberg equilibrium p < 10-7. This also removes some PTVs for which manual inspection revealed irregular cluster plots. We LD pruned the variants by only using variants with ld equal to one. We included missense variants and PTVs indicated by the following annotations: missense variant, stop gained, frameshift variant, splice acceptor variant, splice donor variant, splice region variant, start lost, stop lost.
The Bayes Factor (BF) is a scoring method used to convey confidence of one hypothesis over another, i.e. the alternative hypothesis over the null hypothesis. We present a log BF as a measure of support for results of the rare variant aggregate analyses. In practice, there is no threshold that indicates significance for Bayes Factors, unlike p-values. However, a log BF greater than 3 indicates moderate evidence for the alternative hypothesis. See Kass & Raftery (1995) for a thorough discussion on Bayes Factors.
The purpose of the Genetic Correlation App is to display genetic correlation estimates from the multivariate polygenic mixture model (MVPMM). Users can select phenotypes that are available in GBE from the search box at the bottom of the page.
The following is a description of each of the relevant variables within the application.
Users can filter by z-score, pi2, genetic correlation, and phenotype category.
For a video walkthrough of the application please see this youtube video.
We combined cancer diagnoses from the UK Cancer Register with self-reported diagnoses from the UK Biobank questionnaire to define cases and controls for cancer GWAS. Individual level ICD-10 codes from the UK Cancer Register, Data-Field 40006, and the National Health Service, Data-Field 41202, in the UK Biobank were mapped to the self-reported cancer codes, Data-Field 20001. The mapping was performed via manual curation of ICD-10 codes for each of the self-reported cancer codes. UKB field codes for self-reported cancer were created with a tree structure such that more specific cancer subtypes (e.g., “malignant melanoma”) are nested under more general categories (“skin cancer”). This tree structure was preserved in the field code to ICD-10 mapping. For example, the self-reported phenotype of “lip cancer” was mapped to its field code, 1010, and the ICD-10 codes for “malignant neoplasm of lip”, C00 and C000-C009. After this mapping, individuals with an affirmative entry in one or more of the phenotype collections (self-reported cancer, cancer registry, and the NHS) were included in the case cohort for the GWAS. No secondary neoplasms were included in the cancer phenotype mappings.
We combined disease diagnoses from the UK National Health Service Hospital Episode Statistics with self-reported diagnoses from the UK Biobank questionnaire to define cases and controls for noncancer phenotypes. We used the following procedure to define cases and controls for non-cancer phenotypes (referred to as “high confidence” phenotypes). ICD-10 codes (Data-Field 41202) were grouped with self-reported non-cancer illness codes (Data-Field 20002) that were closely related. This was done by first creating a computationally generated candidate list of closely related ICD-10 codes and selfreported non-cancer illness codes, then manually curating the matches. The computational mapping was performed by calculating the token set ratio between the ICD-10 code description and the self-reported illness code description using the FuzzyWuzzy python package. The high scoring ICD-10 matches for each selfreported illness were then manually curated to ensure high confidence mappings. Manual curation was required to validate the matches because fuzzy string matching may return words that are similar in spelling but not in meaning. For example, to create a hypertension cohort the code description from Data-Field 20002 (“Hypertension”) was mapped to all ICD-10 code descriptions and all closely related codes were returned (“I10: Essential (primary) hypertension” and “I95: Hypotension”). After manual curation code I10 would be kept and code I95 would be discarded. The following paper describes more about the disease outcome phenotypes.
We used data from Category 100034 (Family history–Touchscreen–UK Biobank Assessment Centre) to define “cases” and controls for family history phenotypes. This category contains data from the touchscreen questionnaire on questions related to family size, sibling order, family medical history (of parents and siblings), and age of parents (age of death if died). We focused on Data Coding 20107: Illness of father and 20110: Illness of mother.
We applied technical covariate correction for 35 blood and urine biomarker phenotypes. Those phenotypes are listed under "Biomarkers" group and the detailed description on the phenotyping procedure is explained in the following manuscript:
We defined quantitative and binary phenotypes using data fields in UK Biobank. Typically, we used one data field to derive a phenotype in Global Biobank Engine. Sometimes, we defined multiple phenotypes in Global Biobank Engine from a single data field in UK Biobank. This is the case, for example, when the source UK Biobank field contains an answer for categorical traits.
For example, field 50 in UK Biobank represents standing height. As you can see in the field description page, each individual may have up to 4 times of multiple measurements. Using the information in this field, we defined our phenotype
Standing height (GBE phenotype code: INI50) by taking the median of non-NA values as described in the following publication:
For non-quantitative fields in UK Biobank, we performed manual curation to define case and control. Sometimes, we split categorical results in a single field in UK Biobank into a series of binary traits. For cancer, family history, and disease outcome phenotypes, please see the description above for how we defined the case and controls. For other phenotypes, we will update the description of those phenotypes in the future.
We provide the phenotype groupings in Global Biobank Engine. Those phenotype groupings are listed in the PheWAS plot in the variant page, for example. The disease outcomes, cancers, family history information, and biomarker phenotypes, are grouped as one category for each. For other phenotypes, we used a modified version of "Primary Category of Origin" information in field browser in UK Biobank. Specifically, we removed several bottom-level specific groupings so that we have a moderate number of groups.
For standing height, for example, the source UK Biobank Field has the following "Primary Category of Origin" information:
UK Biobank Assessment Centre / Physical measures / Anthropometry / Body size measures. Using this information, we classified our
Standing height (GBE phenotype code: INI50) phenotype in the