Full AF3-Style Data Pipeline (MSA + Templates) on macOS¶
This repository’s default MLX runner can operate on sequence-only inputs using fill_missing_fields() placeholders. For accuracy comparisons against AlphaFold Server, you typically want to run the real AlphaFold 3 data pipeline first: MSA search and template search, then featurisation.
This doc explains how to configure and run that path locally.
1) Install HMMER (Required)¶
AlphaFold 3’s data pipeline requires HMMER tools: - jackhmmer (protein MSA search) - hmmsearch and hmmbuild (template search)
On macOS, Homebrew is the simplest:
brew install hmmer
Verify:
command -v jackhmmer hmmsearch hmmbuild
jackhmmer -h | head -n 2
Optional (performance): build a jackhmmer with the --seq_limit patch:
./scripts/build_hmmer_macos.sh
If you use the build script, binaries install to ~/hmmer/bin by default.
2) Configure Database Paths (Required)¶
The full AF3 data pipeline needs large genetic databases and PDB template resources (hundreds of GB). See upstream AF3 docs/installation.md.
Recommended: set AF3_DB_DIR¶
If you downloaded databases using ./fetch_databases.sh <DB_DIR>, set:
export AF3_DB_DIR="<DB_DIR>"
The runner will look for these standard names inside AF3_DB_DIR: - uniref90_2022_05.fa - mgy_clusters_2022_05.fa - bfd-first_non_consensus_sequences.fasta - uniprot_all_2021_04.fa - pdb_seqres_2022_09_28.fasta - mmcif_files/ (directory containing PDB mmCIF files) - Some installs extract into pdb_2022_09_28_mmcif_files/mmcif_files/ instead.
Overrides (if your filenames differ)¶
Set explicit env vars (paths can be absolute or ~-expanded): - AF3_UNIREF90_DB - AF3_MGNIFY_DB - AF3_SMALL_BFD_DB - AF3_UNIPROT_DB - AF3_PDB_SEQRES_DB - AF3_PDB_MMCIF_DIR
Optional: - AF3_MAX_TEMPLATE_DATE (default: 2021-09-30)
3) Validate Configuration¶
Check binaries:
source .venv/bin/activate && PYTHONPATH=src python3 scripts/check_deps.py
Check full data pipeline configuration (binaries + databases):
source .venv/bin/activate && PYTHONPATH=src python3 scripts/validate_data_pipeline_paths.py
If anything is missing, the script prints exactly what was missing and where it looked (env vars, PATH, and db_dir-derived locations).
4) Run DeSI1 with the Full Pipeline¶
Monomer:
source .venv/bin/activate && PYTHONPATH=src python3 run_alphafold_mlx.py \
--input examples/desi1_monomer.json \
--output_dir /private/tmp/test_out/desi1_monomer_full_pipeline/ \
--num_samples 1 --diffusion_steps 50 --seed 42 \
--precision float16 \
--run_data_pipeline
Dimer (compare to AlphaFold Server reference):
source .venv/bin/activate && PYTHONPATH=src python3 run_alphafold_mlx.py \
--input examples/desi1_dimer.json \
--output_dir /private/tmp/test_out/desi1_dimer_full_pipeline/ \
--num_samples 1 --diffusion_steps 50 --seed 42 \
--precision float16 \
--run_data_pipeline
If you don’t want to use AF3_DB_DIR, you can pass:
--db_dir "<DB_DIR>"
5) Evaluate Baseline vs Full Pipeline¶
Run both conditions and print a summary table:
source .venv/bin/activate && PYTHONPATH=src python3 scripts/eval_desi1_pipeline_vs_placeholders.py --diffusion_steps 50 --seed 42 --precision float16
To skip the full pipeline (baseline only):
source .venv/bin/activate && PYTHONPATH=src python3 scripts/eval_desi1_pipeline_vs_placeholders.py --skip_full