1. Overview
  2. Instructions for use
  3. Requirement
  4. Usage
  5. Options
  6. Input files
  7. Default output files
  8. Optional output files
  9. Colour code
  10. Explanation of output files
  11. Contact 
  12. Test dataset

1. Overview


ABACAS is intended to rapidly contiguate (align, order, orientate) , visualize and design primers to close gaps on shotgun assembled contigs based on a reference sequence. It uses MUMmer to find alignment positions and identify syntenies of assembly contigs against the reference. The output is then processed to generate a pseudomolecule taking overlaping contigs and gaps in to account. MUMmer's alignment generating programs, Nucmer and Promer are used followed by the 'delta-filter' utility function. Users could also run tblastx on contigs that are not used to generate the pseudomolecule.

If the blast search or consecutive MUMmer alignments result in mapping of extra contigs, finishers can use the Arthemis Comparision Tool (ACT) to easily modify ordering of contigs on the pseudomolecule by dragging and droping contigs to the desired location. Overlapping contigs and gaps in the pseudololecule are represented by "N"s. Overlapping contigs are often due to low quality contig ends and low complexity regions. ABACAS could automatically extract gaps on the pseudomolecule and generate primer oligos for gap closure using Primer3. Uniqueness of primer sets is checked by running a sensitive NUCmer alignment. If a quality file (contig_name.qual) exists in the working directory, it will be used during the primer design step.

2. Instructions for use

  1. Download ABACAS from the download page
  2. Run: perl abacas.pl -r <reference> -q <contigs> -p <nucmer|promer>
    NOTE: If ABACAS cannot find MUMmer from the default path - it will prompt the user to enter the location of MUMer
  3. ABACAS may take several minutes to run for large genomes/chromosomes and will produce a number of different output files in the working directory
  4. Start ACT and load the sequence and comparison files as printed out by ABACAS.
  5. In ACT, load the contig names by going to 'File', <query>, 'Read an Entry', and select the file <query>_<reference>.tab
  6. You can also load the repeat plot for the reference which tells whether or not gaps are due to repetitive sequence. You can load this by going to 'Graph', '<reference>', 'Add User Plot', and select the file '<reference>.Repeats.plot'
  7. The file '<query>.bin' contains the names of the contigs that were not mapped and mapped multiple times to the reference.

3. Requirement

ABACAS requires MUMmer to be installed in the working path for ordering and orienting of contigs. If MUMmer is not found in the working path, users will be asked to provide a valid path for MUMmer. The Arthemis Comparision Tool (ACT) should be downloaded for visualizing scaffolding of contigs. Primer design requires Primer3. Optionally, BLASTALL is required in order to run tblastx on the contigs that are not mapped using Nucmer or Promer.

4. Usage

 

abacas.pl -r <reference file: single fasta> -q <query sequence file: fasta> -p <nucmer/promer> [Options]

    for contig ordering and primer design

OR
abacas.pl -r <reference file: single fasta> -q <pseudomolecule/ordered file: fasta> -e

    to escape contig ordering and go directly to primer design

OR

abacas.pl -h for help

5. Options

-d         use default NUCmer or PROmer parameters

            *ABACAS uses --maxmatch to increase mapping sensitivity during mapping. The option  -d could be useful while dealing with larger genomes or when a higher sensitivity is  not required.

-m         print ordered and orientated contigs to file

            *This option is helpful if users want to further investigate the ordering using other alignment algorithms such as blast.

-b         print contigs in the bin file to multi-fasta file

            *contigs that are not used in generating the pseudomolecule will be placed in a '.bin' file. Since this file only contains contig names, the -b option could be used to print these contigs to a file for further analysis. Note that this option is required if users are interested in running a blast search.

-N         generate a pseudomolecule without 'N's

            *ABACAS produces a pseudomolecule ('.fasta' file) and fills gaps with 'N's. It also puts  100 'N's between overlapping contigs. This option will produce another pseudomolecule without padding (.NoNs.fasta).

-i         default 40

            *minimum percent identity could vary from 0 to 100 depending on the closeness of the two genomes. Choosing a smaller value will pull in more contigs and vice versa

-v         default 40

            *minimum contig coverage: set a value between 0 and 100

-V         default 1

        *minimum contig coverage difference. Use -V 0 to place contigs randomly to one of the positions (in cases where a contig maps to multiple places)

-l         default 100

        * contigs below this cutoff will not be used

-t         run tblastx on contigs that are not used to generate the pseudomolecule

            *-t will run blastall on a fasta file of contigs in the .bin file. The option -b should be used to generate this file

-g         file_name

            * will print sequences of the reference that correspond with gaps on the pseudomolecule in a multi-fasta format

-a         append contigs in the .bin file to the end of the pseudomolecule

            *Contigs could then be easily manipulated and re-ordered using ACT's graphica interface

-o         prefix (string)

            *output files will have this prefix

-P         pick primer oligos to close gaps

-f         default 1000

            *number of flanking bases on either side of a gap for primer design (default 1000bp )

-R         avoid running mummer

-e         Escape contig ordering i.e. go to primer design

-c         Reference sequence is circular


6. Input files

Two fasta files containing the reference and query (contigs) sequences are required. The reference file should be in a single fasta format for speedy contig ordering and orientation.


7. Default output files

Running ABACAS with default options will generate the following files:

  1. Ordered and orientated sequence file (reference_query.fasta or prefix.fasta)

  2. Feature file (reference_query.tab or prefix.tab)

  3. Bin file that contains contigs that are not used in ordering (reference_query.bin or prefix.bin)

  4. Comparison file (reference_query.crunch or prefix.crunch)

  5. Gap information (reference_query.gaps, prefix.gaps)

  6. Information on contigs that have a mapping information but could not be used in the ordering (unused_contigs.out)

  7. Feature file to view contigs with ambiguous mapping (reference.notMapped.contigs.tab). This file should be uploaded on the reference side of ACT view.

  8. A file that shows how repetitive the reference genome is (reference.Repeats.plot).

Files 7 & 8 should be uploaded on the reference side of ACT view.

Please note that contigs in the '.fasta' file will be reverse complemented if they are found to map on the reverse strand. However, the ACT view shows the initial orientation of these contigs i.e. they will be shown on the reverse strand. If you write a fasta file of the pseudomolecule from ACT, the resulting sequence will be a set of ordered contigs (the orientation will not change). It is therefore recommended to use the '.fasta' pseudomolecule file automatically generated for further investigation.


8. Optional output files

It is also possible to generate additional files including:

  1. A list of ordered and orientated contigs in a multi-fasta format (-m ) .

  2. A pseudomolecule with all unmapepd contigs appended to the end for reordering (-a ) .

  3. A pseudomolecule where the gaps are not padded with N (-N )

  4. A multi-fasta file of all unmapped contigs (-b )

  5. A multi-fasta file of regions on the reference that correspond to gaps on pseudomolecule (-g file_name)

  6. A list of sense and antisense primer sets in separate files

  7. A list of locations where sense and antisense primers are found in two separate files

  8. Non-unique primers and primers near contig ends will be printed to a file (primers to exclud)
  9. A standard primer3 output summary file with a detailed information on oligos.


9. Colour code

The feature (file 2 from default output section) file has the following colour codes:

Dark blue (4): contigs with forward orientation

Dark green (3): contigs with reverse orientations

Sky blue (5): contigs that overlap with the next contig

Yellow (7): contigs that have no hit (only added to the pseudomolecule if '-a ' is used)


10. Explanation of output files

Comparison file (.crunch file)

72 100 1 23328 contig00198 1 16690 unknown NONE

100 100 29503 29782 contig00002 22865 23144 unknown NONE

100 100 29952 52948 contig00087 23314 46310 unknown NONE

100 100 52986 63111 contig00243 46348 56473 unknown NONE

100 94 63118 63576 contig00217 56480 56938 unknown NONE

100 100 63775 63932 contig00224 57137 57294 unknown NONE

100 100 63933 64216 contig00250 57295 57578 unknown NONE

The first seven columns of the comparison file represent coverage, percent identity, start on pseudomolecule, end on pseudomolecule, contig ID, start on reference and end on reference.

Gap file (.gaps)

Gap 6174 23329 29502 16691 22864

Gap 6 63112 63117 56474 56479

Gap 198 63577 63774 56939 57136

Columns 2-6 represent gap size, start on pseudomolecule, end on pseudomolecule, start on reference and end on reference.

Bin file (.bin)

A list of contig names that could not be used in generating a pseudomolecule.

Fasta file (.fasta)

This is the pseudomolecule generated from ordered and orientated contigs. Overlapping contigs are separated by 100 'N's. Gaps are also represented by 'N's.


11. Contact

Please email sa4 {at} sanger.ac.uk if you have any problems or comments.

12. Test dataset

We have provided a test dataset from Streptococcus suis which consists of a set of 454 contigs and the reference genome.
454 Contigs download
Reference download

SourceForge      sanger logo    biomalpar