Input File Formats¶

mrpeg requires several input files depending on the subcommand. This page describes the expected format for each file type, including required columns, data types, and concrete examples drawn from the bundled example data in data/.

GWAS Summary Statistics¶

Used by: mrpeg peg, mrpeg signal

A tab-delimited text file (plain or gzip-compressed) containing per-SNP effect-size estimates from a genome-wide association study. The column names do not need to match the defaults shown below — use --gwas-cols to specify the actual header names in your file.

mrpeg peg expects six columns (specified via --gwas-cols):

Column	Type	Description
CHR	int	Autosomal chromosome number (1–22). Non-autosomal entries are dropped.
SNP	str	SNP identifier (e.g. `rs6678318`). Used as the merge key with the eQTL file.
A1	str	Effect allele (the allele whose effect `BETA` measures).
A0	str	Non-effect (reference) allele. Together with `A1`, used for allele harmonization with the eQTL file.
BETA	float	Estimated effect size for allele `A1`.
SE	float	Standard error of `BETA`. Must be strictly positive; rows with SE ≤ 0 are removed automatically.

mrpeg signal expects five columns (the two allele columns are not needed because no allele harmonization is performed):

Column	Type	Description
CHR	int	Autosomal chromosome number (1–22).
SNP	str	SNP identifier.
BP	int	Genomic base-pair position on chromosome `CHR`.
BETA	float	Estimated effect size.
SE	float	Standard error of `BETA` (> 0).

Example (data/example_gwas.tsv.gz, first three rows):

chrom  snp          pos      a0  a1  beta      se
    rs6678318    1030633  A   G   0.005895  0.003162
    rs2785581    6177898  A   G  -0.004418  0.003162
    rs10746477   8195724  T   C   0.011452  0.003162

To use this file with mrpeg peg:

mrpeg peg --gwas example_gwas.tsv.gz \
  --gwas-cols chrom snp a1 a0 beta se ...

To use with mrpeg signal (which needs a BP column instead of alleles):

mrpeg signal --gwas example_gwas.tsv.gz \
  --gwas-cols chrom snp pos beta se ...

eQTL Summary Statistics¶

Used by: mrpeg peg

A tab-delimited text file (plain or gzip-compressed) containing per-SNP, per-gene eQTL association results. Each row records the association between one SNP and one gene. Multiple rows can share the same SNP (one per gene) or the same gene (one per SNP).

mrpeg peg expects six columns (specified via --eqtl-cols):

Column	Type	Description
CHR	int	Autosomal chromosome number (1–22).
SNP	str	SNP identifier. Must match the identifiers in the GWAS file so that the two datasets can be merged on SNP.
A1	str	Effect allele in the eQTL study. Used together with `A0` to detect and correct allele flips relative to the GWAS file.
A0	str	Non-effect allele in the eQTL study.
Z	float	Z-score (or t-statistic) of the eQTL association. If your eQTL results provide a beta and SE instead, compute `Z = BETA / SE` before running mrpeg.
GENE	str	Gene identifier (symbol or ENSEMBL ID). Must match the row labels in the perturbation matrix.

Example (data/example_eqtl.tsv.gz, first three rows):

chrom  snp          pos      a0  a1  beta      z         gene
    rs6678318    1030633  A   G  -0.161025 -2.295787  FASLG
    rs2785581    6177898  A   G  -0.207129 -2.979178  TAGLN2
    rs10746477   8195724  T   C   0.173091  2.472928  SOX13

To use this file:

mrpeg peg --eqtl example_eqtl.tsv.gz \
  --eqtl-cols chrom snp a1 a0 z gene ...

Note

The pos and beta columns in the example eQTL file are informational and are not consumed by mrpeg — only the six columns specified via --eqtl-cols are read.

Perturbation Effect Matrix¶

Used by: mrpeg peg

A tab-delimited text file (plain or gzip-compressed) encoding the gene-to-gene effect matrix from a Perturb-seq (or similar CRISPR-based perturbational screen).

Rows — perturbed (upstream) genes. The first column is a header (any name, e.g. perturbations) followed by one gene identifier per row. These identifiers must match the GENE column in the eQTL file.
Columns — downstream (target / focal) genes. Each column header is a gene identifier. mrpeg tests each of these genes as a potential mediator of the GWAS trait.
Values — continuous effect-size estimates (floats). Missing values should be encoded as NaN or left empty; they are handled during preprocessing.

Example (data/example_perturbation.tsv.gz, first three rows):

perturbations  GENE_ABC
FASLG         -0.051150
TAGLN2        -1.046032
SOX13         -1.182132

In this example there are 400 perturbed genes (rows) and 1 downstream gene (GENE_ABC, column). A real-world dataset will typically have hundreds to thousands of both.

Note

The --top-signal parameter (default 0.01) keeps only the top fraction of perturbation effects by absolute value. For example, with a 500 × 200 matrix and --top-signal 0.01, only the strongest 1 % of the 100 000 entries (1 000 values) are retained; the rest are set to zero before inference. Downstream genes that end up with fewer than --min-snps (default 10) non-zero instrument SNPs are skipped entirely.

Gene Annotation / Reference File¶

Used by: mrpeg signal

A tab-delimited file (plain or gzip-compressed) that maps genes or genomic annotations to chromosomal coordinates.

mrpeg signal expects four columns (specified via --ref-cols) that define annotation regions:

Column	Type	Description
CHR	int	Chromosome number (1–22).
START	int	Start position of the annotation region. If `START` and `END` refer to the same column (single base-pair annotations), set `--window` > 0 to expand around each position.
END	int	End position of the annotation region.
ANNO	str	Annotation name. Used as the row identifier in the output `.signal.tsv` file.

Example (data/ref_gene_info.tsv.gz, first three rows):

ID2              NAME    CHR  TSS     TES     P_MID_FLANK0  P_MID_FLANK1
ENSG00000187634  SAMD11  1    859307  879961  369634        1369634
ENSG00000188976  NOC2L   1    879583  894688  387135        1387135
ENSG00000187961  KLHL17  1    895963  901099  398531        1398531

This file contains more columns than mrpeg needs. Use --ref-cols to select the relevant four:

mrpeg signal --ref ref_gene_info.tsv.gz \
  --ref-cols CHR P_MID_FLANK0 P_MID_FLANK1 ID2 ...

Reference Genotype Files (PLINK format)¶

Used by: mrpeg peg (via --ref-geno)

mrpeg computes the LD (linkage disequilibrium) matrix from a set of PLINK binary files. You need one triplet of files per autosomal chromosome:

Extension	Description
`.bed`	Binary genotype matrix (individuals × SNPs).
`.bim`	SNP information: chromosome, SNP ID, genetic distance, base-pair position, allele 1, allele 2. The `SNP ID` column is used to match SNPs to the GWAS and eQTL files.
`.fam`	Individual (sample) information. The number of rows determines the LD sample size.

Provide the file path with a * (or \*) as a placeholder for the chromosome number:

mrpeg peg --ref-geno plink/geno_chr\* ...

mrpeg will expand the wildcard for chromosomes 1–22 and read each triplet independently. The example data ships with 22 triplets in data/plink/, generated from 1000 Genomes EUR samples.

Note

Only the SNPs that appear in both the GWAS and eQTL files are extracted from the PLINK files. The remaining SNPs are ignored.

Keep File (optional)¶

Used by: mrpeg signal (--keep)

A single-column, tab-delimited text file with no header. Each row contains one gene or annotation name that you want to restrict the analysis to. Any genes or annotations in the reference file that are not present in the keep file are dropped before processing.

Example:

SAMD11
NOC2L
KLHL17