Pan African Bioinformatics Network for H3Africa

H3ABioNet Tools

H3ABioNet Nodes have developed different computational tools ranging from visualization of population structure to specialized analysis of genome wide nucleosome positions invovled in genome function.

Genesis: PCA and Admixture plot viewer and publication quality image generator

Genesis can be used to display screen and publication quality pictures of population PCA and admixture charts and has been developed by: Wits Bionformatics, Sydney Brenner Institute for Molecular Bioscience, University of the Witwatersrand, Johannesburg

Why use Genesis?

Genesis takes the output of popular programs such as Admixture and EIGENSTRAT and produces good quality pictures, which the user can interactively change. There are first class tools that can be used to create good quality pictures, but they require expertise to use and best used when one already knows exactly what the output should look like. In practice there is a huge need for an interactive tool. Which PCs display interesting data can be interactively explored. Which colours are best to use is not just an aesthetic problem: in some cases a set of colours works well but with other data the same colours doesn't because the colours don't clearly contrast with a new position of the objects being drawn. There may be a need to rearrange the labelling or the data. We want to make the fonts as big as possible, but what is "big as possible" depends on the quantity and arrangement of data. Often when displaying admixture charts, multiple charts are shown in one diagram, we need to to keep consistency of colours and may want to play with the ordering of data.

We see the need for an interactive tool that can be used to explore possibilities and produce good quality data. Although tools like Distruct and R are more flexible and produce very high quality pictures, Genesis is interactive and requires much less expertise to use.


Genesis requires Java 1.7 with SWT libraries installed. Genesis runs on Windows, Linux and MacOS X.  For Mac OS X, X11 must be installed. (Download XQuartz here)

On Windows and Linux, the program should be run as:

java -jar Genesis.jar

On Mac OS X, X11 must be installed and the program should be run as:

java -XstartOnFirstThread -jar Genesis.jar

Some sample data files can be found here

Download: The executable can be otained from here (Latest version:0.2.5 27 January 2015)

GIT hub:  https://github.com/shaze/genesis
From the command line: git clone https://github.com/shaze/genesis

Documentation: The manual is available as a pdf file

Admixture Mapping Tools

The admixture mapping tools include a suite of tools for use on multi-way admixed populations to overcome the limitation of existing tools, which tend to work best with 2- or 3-way admixed populations only. The admixture tools included in this project are:
1.    Tool for selecting the best proxy ancestral populations for an admixed population
2.    Tool for inferring local ancestry in admixed populations

The first tool is an important precursor for the second as identifying the correct ancestral populations is crucial to be able to accurately infer local ancestry. A prototype for this first tool has already been developed in the group. PROXYANC has two novel algorithms including the correlation between observed linkage disequilibrium in an admixed population and population genetic differentiation in ancestral populations, and an optimal quadratic programming based on the linear combination of population genetic distances (FST). PROXYANC was evaluated against other methods, such as the f3 statistic using a simulated 5-way admixed population as well as real data for the local South African Coloured (SAC) population, which is also 5-way admixed. The simulation results showed that PROXYANC was a significant improvement on existing methods for multi-way admixed populations.

For the second tool, we have evaluated some of the existing methods for inferring local ancestry (or locus-specific ancestry) and determining the date of admixture on multi-way admixed populations including the SAC and simulated data. These methods include HapMix, ROLLOFF and a PCA-based method, StepPCO for dating admixture, and WinPOP and LampLD for local-ancestry. All three of the dating tools gave quite different predictions of the date of admixture events, showing the lack of accuracy of existing methods and need for a better one


PROXYANC implements an approach to select the best proxy ancestral populations for admixed populations. It searches for the best combination of reference populations that can minimize the genetic distance between the admixed population and all possible synthetic populations, consisting of a linear combination from reference populations. PROXYANC also computes a proxy-ancestry score by regressing a statistic for LD (at short distance < 0.25 Morgan) between a pair of SNPs in the admixed population against a weighted ancestral allele frequency differentiation. Download PROXYANC.

PROXYANANC can select AIMs based on the relationship between the observed local multi-locus linkage disequilibrium in a recently admixed population and ancestral population difference in allele frequency and based on the Kernel principal component analysis (Kernel-PCA), which is the extension of the linear PCA.

PROXYANC can identify possible unusual difference in allele frequency between pair-wise popualtions, as signal of natural selection.

PROXYANC compute the expected maximun admixture LD from proxy ancestral populations of the admixed population.

PROXYANC compute population pair-wise Fst (Genetic distance).



ancGWAS is an algebraic graph-based method to identify the most significant sub-network underlying ethnic difference in complex diseases risk in a recently admixed population. This approach integrates the association signal from a GWAS data set, the local ancestry, and SNP pair-wise linkage disequilibrium from the admixed population into the PPI network.



ancMETA is an application for leveraging cross-population Gene/Sub-network Meta-analysis to recover Disease Association Signal (DAS) risk in a homogenous or recently admixed population. This approach integrates the association signal from a GWAS data set, the local ancestry, and SNP pair-wise linkage disequilibrium into both the PPI and Protein-functional network.


Human Mutational Analysis web server

The Human Mutation Analysis (HUMA) web server has been developed as a freely available platform for the analysis of genetic variation in humans. HUMA provides an extensive database, populated from a myriad of different sources, and incorporates a number of tools to analyse and visualize the data. The HUMA database is populated with genes, transcripts, exons, proteins and protein structures, diseases and variations. All data has been linked to allow advanced search functionality. For example, searching for a protein will provided all related data including the genes that code it, the known SNPs and other variations within it, the diseases associated with it, and all the experimentally determined PDB structures.

The HUMA database has been created with the aim of analysing the effects on non-synonymous SNPs on protein stability and function. In order to do this, analysis tools needed to be incorporated into the web server. Firstly, a BLAST tool has been incorporated to allow users to search the HUMA database for homologous proteins. Secondly, a homology modelling pipeline has been included to allow users to model proteins with variations from the database included. In addition, tools, such as Polyphen 2.0 and nsSNPAnalyzer, that try to predict the effects of variations, will also be made available via the web interface.

Collaboration features have been built into HUMA to facilitate the sharing of job results and analysis. HUMA also allows users to upload their own variation data to the server. These datasets are stored privately and users can choose to share them with other users and groups.
Once launched, the HUMA web server will be freely available at https://huma.rubi.ru.ac.za

Job Management System (JMS) for High Performance Computing and Cluster Management

Modern computing has enabled research that was previously considered unfeasible. Parallel algorithms have been developed to run over powerful multicore machines. For even more computing power, these machines can be aggregated together into large high performance computing (HPC) clusters. On these clusters, jobs can be spread out across a large number of nodes instead of being executed on a single machine. This can substantially decrease the time required to execute resource intensive modeling and simulation jobs – a common requirement in the field of biophysics. It is also useful when a large number of much smaller jobs need to be executed. Unfortunately, running jobs on a cluster involves a steep learning curve. Jobs must be submitted via software systems known as resource managers. These systems can usually only be run via the command line and require expertise that most researchers don't have.

To solve this problem, we have developed JMS, a web-based front-end to an HPC cluster. JMS allows users to run, manage and monitor jobs via a user-friendly web interface. It also lets users create new tools that can be pipelined together along with existing tools to create complex computational workflows. These workflows can be saved, versioned and reused as needed. A detailed job history of all jobs is stored and can be accessed and download at any time. All tools, workflows and jobs can be shared with other users to create a highly collaborative work environment. In addition, tools and workflows can be made public via external interfaces. Although applicable to any field, JMS is currently being tailored toward structural bioinformatics with the introduction of tools and workflows for homology modelling, docking studies, and molecular dynamics.

JMS has been open-sourced and is freely available at https://github.com/RUBi-ZA/JMS.

JMS has been published in PLoS ONE:
David K Brown, David L Penkler, Thommas M Musyoka, and Özlem Tastan Bishop "JMS: An open source workflow management system and web-based cluster front-end for high performance computing"  PLoS ONE 10(8): e0134273, 2015. doi: 10.1371/journal.pone.0134273

Web base Protein Interaction Network Visualizer (PINV)

Protein Interaction Network Visualizer is an open source, native web application that uses the latest generation of web technologies to offer an interactive view of protein-protein interactions which is easily accessible from any modern browser by researchers. PINV enables researchers to explore preloaded or their own data using different methods. The visualization can be manipulated to highlight the proteins or interactions of interest. The resulting graphic can be exported into common graphic formats so it is suitable for publication and sharing purposes. Researchers using PINV can access their visualizations from any computer with an internet connection and a modern browser without the need for installation of any third party software or plugins; moreover the nature of the Web simplifies tasks such as sharing and publishing visualizations.

The visualization runs completely on the client and was developed using the latest web technology (HTML5) therefore the whole network viewer is fully embedded in the browser. A popular visualization library, D3, has been used which makes extensive use of HTML5 technologies in order to deliver simple and rich visualization. Two input files are processed and stored in a Solr server for ease of querying and obtaining quick responses. The members involved in the project can be split in two: developers and biologists. The latter group selects and curates the dataset but moreover they constantly test the prototype versions of the application, reporting bugs and suggesting new features.

Reference: A web-based protein interaction network visualizer
Salazar G. A., Meintjes A., Mazandu G. K., Rapanoel H. A., Akinola R. O. and Mulder N. J.
BMC Bioinformatics 2014, 15:129 doi:10.1186/1471-2105-15-129


Slides from Talks



BioJs Components

Recombination Detection Program (RDP)

RDP will, without any prior information on the names and numbers of ancestral populations, deconstruct chromosome-scale SNP/sequence datasets into any number of sub-datasets containing groups of aligned SNP/nucleotides that share common ancestries.  It will do this by applying a range of established heuristic recombination event detection and analysis tools that have only previously been usable in the study virus genome-scale datasets.
RDP is a PC application that takes aligned nucleotide sequence data in any of the standard alignment formats (e.g. nexus, fasta, clustalw, paup, phylip).  For the analysis of SNP data from multiple different individuals the SNPs will need to be arranged in the order that they occur on their respective  chromosomes and SNPs from different chromosomes should be analysed separately.  SNPs must also be phased (i.e. with all the SNPs in each input sequence derived from the same chromosome).  Finally the phased SNPs must aligned so that missing data/indels in particular individuals will be represented by a gap character = “-“.  It will then be possible to take these SNP alignments and directly loaded into the program for analysis. Alternatively if full sequences for either individual chromosomes or concatenated exome sequences on individual chromosomes are to be analysed, the sequences will need to be aligned (with a program such as mauve) and saved in a format such as xmfa before the program will be able to load them. It will be possible to analyse up to 1000 chromosomes at a time with the tool.
RDP will output result files in three different formats:

  1. The output in “.rdp” file format will allow a user to examine the output in great depth using the program RDP4.   RDP4 implements a wide range of heuristic and parametric recombination  analysis, tree drawing and matrix tools and also provides a variety of data representation features.
  2. The output in “.csv” file format will allow a user to browse the results in any standard  spreadsheet application (such as Microsoft excel) and will detail the positions of recombination breakpoints, the identity of recombinant sequences, the identity of sequences resembling the parental sequences, and degrees of statistical support, determined with a range of five different recombination detection methods, for the identified tracts of sequence falling between the identified breakpoints having been derived through recombination.
  3. RDP will provide “distributed alignments” of SNPS/genome fragments derived from any number of user specified ancestral populations – i.e. it will split the component nucleotides of input sequences up into different alignments based on the ancestral populations from which they were derived (note that although the recombination analysis carried out by the tool will not require specification of ancestral population numbers, it will be up to the user to specify how many alignments the program should split the data into).  

The key methods implemented by RDP are five of the heuristic recombination detection methods implemented in the computer program RDP4: RDP, GENECONV, MAXCHI, CHIMAERA and 3SEQ.  The combined results of these methods will then be processed by the tool using the same algorithm implemented in RDP4 for the identification of recombinant sequences, the identification of sequences resembling parental sequences and the identification of recombination breakpoint locations. Crucially the RDP4 algorithm does not require any prior information on either the number of underlying populations or the identification of likely admixed individuals.  Also, by identifying sequences with shared recombinant histories the algorithm counts recombination events relatively accurately and can therefore detect evidence of recombination hot and cold-spots.

RDP applies a number of recombination detection and analysis methods. It runs well under Windows 95/98/NT/XP/VISTA/7 and may/may not run properly under Windows 8. RDP also runs well on most windows emulators. For Mac users PlayOnMac is recommended and for Linux users Wine is recommended. You may download:

More information on RDP can be found at: http://web.cbio.uct.ac.za/~darren/rdp.html

 Please cite: Martin DP, Murrell B, Golden M, Khoosal A, & Muhire B (2015) RDP4: Detection and analysis of recombination patterns in virus genomes. Virus Evolution 1: vev003 doi: 10.1093/ve/vev003[PDF]

NUCPOS: Bioinformatics tool suite to analyse nucleosome positions in genomes

An insight into genome-wide nucleosome positions is required to understand the local regulation of genome function.  Bioinformatics tools to analyse nucleosome positions in genomes are limited. This paucity is addressed with NUCPOS, a suite that provides several utilities to analyse important aspects of nucleosomal organisation, including nucleosome density, the positioning strength of individual nucleosomes, the contribution of sequence to observed positions, and the average nucleosomal organization of specified genomic positions, such as pol II transcription start sites.

NUCPOS is available under the GNU public license (GPL-3). The C++ 11 source code may be downloaded from https://sourceforge.net/projects/nucpos/ and compiled with GCC (gcc.gnu.org) g++ version 4.7.3 or later.

Nucfrag: takes the SAM format (samtools.github.io/hts-specs/SAMv1.pdf) output file without headers generated by Bowtie2 (bowtie-bio.sourceforge.net) as input, and generates a series of data files of the number of nucleosomes centred at each base pair position of each chromosome. Nucfrag allows the selection of lower and upper fragment sizes to select subpopulations of nucleosomes, possibly with or without associated linker histone H1, from the bowtie2 output file.  Additional outputs include a file of the distribution of the aligned fragment sizes in the selected size range, and data files of virtual footprints, simulating genomic areas that would be protected from nuclease cleavage by nucleosomes.  The nucleosome position data files can be uploaded and viewed in a genome browser after addition of the appropriate BED or bigBed format headers (genome.ucsc.edu).

Dyad_bins: is a program that takes the nucleosome position data files generated by Nucfrag as input, and performs a binning analysis, i.e., it counts the number of times that a specific number of nucleosomes are co-aligned in the genome.  The output can simply be visualized with a graphing program such as Gnuplot (www.gnuplot.info).  The co-alignment of many nucleosomes at a specific genomic position generally indicates a strongly positioned nucleosome, which may have functional relevance.  This analysis provides insight into the general nucleosome density and number of well-positioned nucleosomes in specified genomic regions (Fig. 1 A).

Align_dyads: is a program that takes a text file with a list of genomic positions and the nucleosome position data files generated by Nucfrag as input, and generates a data file of superimposed nucleosome positions aligned at the specified genomic positions.  The user can select the number of nucleotides to include upstream and downstream of the listed positions.  The program is especially useful to gain insight into the average nucleosomal organization of transcription start sites, defined replication origins, and similar functional elements in a genome.  An output file is written than can be visualized in Gnuplot (Fig. 1 B).

The precise rotational and translational position of a nucleosome is determined by the DNA sequence accommodated by the nucleosome, as well as steric influences due to other proteins bound to the DNA. The contribution of intrinsic sequence effects is often useful to understand whether specific, functionally significant, well-positioned nucleosomes are precisely placed due to the inherent DNA sequence, or due other features of the genome.  The utility hp_fft allows the user to quantitatively access the contribution of dinucleotide periodicities to the anisotropic flexibility of nucleosomes positioned at identified genomic positions (Fig. 1 C).

hp_fft: performs the fast Fourier transform of the distribution of each of the 16 possible dinucleotides in a sliding 128 nt window, and provides the Fourier magnitude of the distribution of each dinucleotide at a periodicity of approximately 10 nt.  hp_fft takes as input the fractional occurrence of each dinucleotide at each sequence position in the sequence of interest.  The factional distribution is generated with the utility dinucleotide_frequencies, which takes as input the FastA format sequence file, which may contain multiple sequences representing nucleosomes positioned at specific genomic features. hp_fft requires the open source FFTW library (www.fftw.org).

Fig. 1. (A) Distribution of co-aligned nucleosomes into bins. Nucleosomes present on coding regions (filled circles) and on non-coding regions (white circles) are shown. (B) The average nucleosomal occupancy of a polyadenylation site.  Note the nucleosome depleted region at position 0, and the strongly positioned nucleosome at position 150. (C) Fourier amplitude of a genomic region.  Dinucleotide distributions that could support well-positioned nucleosomes are present at positions 50 and the region 200-400. All images were generated from NUCPOS output files with Gnuplot.