Input File Formats¶
mrpeg requires several input files depending on the subcommand. This page
describes the expected format for each file type, including required columns,
data types, and concrete examples drawn from the bundled example data in
data/.
GWAS Summary Statistics¶
Used by: mrpeg peg, mrpeg signal
A tab-delimited text file (plain or gzip-compressed) containing per-SNP
effect-size estimates from a genome-wide association study. The column names
do not need to match the defaults shown below — use --gwas-cols to
specify the actual header names in your file.
mrpeg peg expects six columns (specified via --gwas-cols):
Column |
Type |
Description |
|---|---|---|
CHR |
int |
Autosomal chromosome number (1–22). Non-autosomal entries are dropped. |
SNP |
str |
SNP identifier (e.g. |
A1 |
str |
Effect allele (the allele whose effect |
A0 |
str |
Non-effect (reference) allele. Together with |
BETA |
float |
Estimated effect size for allele |
SE |
float |
Standard error of |
mrpeg signal expects five columns (the two allele columns are not
needed because no allele harmonization is performed):
Column |
Type |
Description |
|---|---|---|
CHR |
int |
Autosomal chromosome number (1–22). |
SNP |
str |
SNP identifier. |
BP |
int |
Genomic base-pair position on chromosome |
BETA |
float |
Estimated effect size. |
SE |
float |
Standard error of |
Example (data/example_gwas.tsv.gz, first three rows):
chrom snp pos a0 a1 beta se
1 rs6678318 1030633 A G 0.005895 0.003162
1 rs2785581 6177898 A G -0.004418 0.003162
1 rs10746477 8195724 T C 0.011452 0.003162
To use this file with mrpeg peg:
mrpeg peg --gwas example_gwas.tsv.gz \
--gwas-cols chrom snp a1 a0 beta se ...
To use with mrpeg signal (which needs a BP column instead of alleles):
mrpeg signal --gwas example_gwas.tsv.gz \
--gwas-cols chrom snp pos beta se ...
eQTL Summary Statistics¶
Used by: mrpeg peg
A tab-delimited text file (plain or gzip-compressed) containing per-SNP, per-gene eQTL association results. Each row records the association between one SNP and one gene. Multiple rows can share the same SNP (one per gene) or the same gene (one per SNP).
mrpeg peg expects six columns (specified via --eqtl-cols):
Column |
Type |
Description |
|---|---|---|
CHR |
int |
Autosomal chromosome number (1–22). |
SNP |
str |
SNP identifier. Must match the identifiers in the GWAS file so that the two datasets can be merged on SNP. |
A1 |
str |
Effect allele in the eQTL study. Used together with |
A0 |
str |
Non-effect allele in the eQTL study. |
Z |
float |
Z-score (or t-statistic) of the eQTL association. If your eQTL
results provide a beta and SE instead, compute |
GENE |
str |
Gene identifier (symbol or ENSEMBL ID). Must match the row labels in the perturbation matrix. |
Example (data/example_eqtl.tsv.gz, first three rows):
chrom snp pos a0 a1 beta z gene
1 rs6678318 1030633 A G -0.161025 -2.295787 FASLG
1 rs2785581 6177898 A G -0.207129 -2.979178 TAGLN2
1 rs10746477 8195724 T C 0.173091 2.472928 SOX13
To use this file:
mrpeg peg --eqtl example_eqtl.tsv.gz \
--eqtl-cols chrom snp a1 a0 z gene ...
Note
The pos and beta columns in the example eQTL file are informational
and are not consumed by mrpeg — only the six columns specified via
--eqtl-cols are read.
Perturbation Effect Matrix¶
Used by: mrpeg peg
A tab-delimited text file (plain or gzip-compressed) encoding the gene-to-gene effect matrix from a Perturb-seq (or similar CRISPR-based perturbational screen).
Rows — perturbed (upstream) genes. The first column is a header (any name, e.g.
perturbations) followed by one gene identifier per row. These identifiers must match theGENEcolumn in the eQTL file.Columns — downstream (target / focal) genes. Each column header is a gene identifier. mrpeg tests each of these genes as a potential mediator of the GWAS trait.
Values — continuous effect-size estimates (floats). Missing values should be encoded as
NaNor left empty; they are handled during preprocessing.
Example (data/example_perturbation.tsv.gz, first three rows):
perturbations GENE_ABC
FASLG -0.051150
TAGLN2 -1.046032
SOX13 -1.182132
In this example there are 400 perturbed genes (rows) and 1 downstream gene
(GENE_ABC, column). A real-world dataset will typically have hundreds to
thousands of both.
Note
The --top-signal parameter (default 0.01) keeps only the top fraction
of perturbation effects by absolute value. For example, with a 500 × 200
matrix and --top-signal 0.01, only the strongest 1 % of the 100 000
entries (1 000 values) are retained; the rest are set to zero before
inference. Downstream genes that end up with fewer than --min-snps
(default 10) non-zero instrument SNPs are skipped entirely.
Gene Annotation / Reference File¶
Used by: mrpeg signal
A tab-delimited file (plain or gzip-compressed) that maps genes or genomic annotations to chromosomal coordinates.
mrpeg signal expects four columns (specified via --ref-cols)
that define annotation regions:
Column |
Type |
Description |
|---|---|---|
CHR |
int |
Chromosome number (1–22). |
START |
int |
Start position of the annotation region. If |
END |
int |
End position of the annotation region. |
ANNO |
str |
Annotation name. Used as the row identifier in the output
|
Example (data/ref_gene_info.tsv.gz, first three rows):
ID2 NAME CHR TSS TES P_MID_FLANK0 P_MID_FLANK1
ENSG00000187634 SAMD11 1 859307 879961 369634 1369634
ENSG00000188976 NOC2L 1 879583 894688 387135 1387135
ENSG00000187961 KLHL17 1 895963 901099 398531 1398531
This file contains more columns than mrpeg needs. Use --ref-cols to
select the relevant four:
mrpeg signal --ref ref_gene_info.tsv.gz \
--ref-cols CHR P_MID_FLANK0 P_MID_FLANK1 ID2 ...
Reference Genotype Files (PLINK format)¶
Used by: mrpeg peg (via --ref-geno)
mrpeg computes the LD (linkage disequilibrium) matrix from a set of PLINK binary files. You need one triplet of files per autosomal chromosome:
Extension |
Description |
|---|---|
|
Binary genotype matrix (individuals × SNPs). |
|
SNP information: chromosome, SNP ID, genetic distance, base-pair
position, allele 1, allele 2. The |
|
Individual (sample) information. The number of rows determines the LD sample size. |
Provide the file path with a * (or \*) as a placeholder for the
chromosome number:
mrpeg peg --ref-geno plink/geno_chr\* ...
mrpeg will expand the wildcard for chromosomes 1–22 and read each triplet
independently. The example data ships with 22 triplets in data/plink/,
generated from 1000 Genomes EUR samples.
Note
Only the SNPs that appear in both the GWAS and eQTL files are extracted from the PLINK files. The remaining SNPs are ignored.
Keep File (optional)¶
Used by: mrpeg signal (--keep)
A single-column, tab-delimited text file with no header. Each row contains one gene or annotation name that you want to restrict the analysis to. Any genes or annotations in the reference file that are not present in the keep file are dropped before processing.
Example:
SAMD11
NOC2L
KLHL17