Algorithmic Arts Bio2MIDI Advanced DNA Sequence Marking

Marking DNA files for MIDI Conversion

1. Simple file Conversion

Marking a coding sequence (cds) DNA file

Coding sequence (cds) DNA files contain the sequences corresponding to a translatable messenger RNA (mRNA). Cds files are usually identified as such in the DNA database records.

To ensure an accurate translation, you must mark the beginning and end of the protein-coding sequence, which is not necessarily the beginning and end of your DNA file. Usually a cds DNA file contains the information that identifies the beginning and end of the coding sequence. For example in the Huntington's Disease DNA file included with Bio2MIDI (hunt_dna.txt), the line

CDS 316..9750

contains this information. The tilde (~) is placed before Base# 316 and after Base# 9750. Note that each line of the DNA sequence contains reference numbers that allow you to identify these specific bases. A protein coding sequence will always begin with the bases ATG and end with one of the following triplets: TGA, TAG or TAA.

Decide whether you want to play the DNA sequence itself or a translation of the sequence.

If you want to play only the DNA, then be sure that the DNA option is selected. This will result in a MIDI file composition that uses only 4 different notes for the A, C, T, G bases of the DNA sequence.

If you want to play the protein encoded by a DNA sequence, then select the Protein option. This will result in a MIDI file composition that uses 20 different notes, for the 20 amino acids that make up the protein "alphabet" of the translated (bases into amino acids) sequence.

2. Advanced file conversion

Marking a DNA file that includes introns and exons

1. Exons and Introns.

The genes of all organisms except bacteria and some viruses do not consist of continuous coding sequences. The genes are very long sequences composed of exons, which contain the coding information, and introns, which are noncoding sequences interspersed among the exons. For example the structure of the beta-globin gene is:

Exon 1----Intron 1----Exon 2----Intron 2----Exon 3

In a cell, the genetic information is processed so that protein is synthesized using a message that contains only the information of the exons. Bio2Midi includes a feature that plays back a continuous translation of DNA files in which the boundaries of the exons have been marked.

2. Beta-globin: a sample gene with exons and introns.

For demonstration purposes, the file beta_dna.txt has been included. This file contains only some general information about the full gene sequence and about the last 13308 bases of the 73308-base sequence of the human beta-globin region. The information necessary to create a translatable sequence is in the line:

CDS join(62187..62278,62409..62631,63482..63610)

3. Marking the exon boundaries.

You would mark the sequence for translation as follows:

: before base # 62187

This marks the beginning of the protein coding sequence (beginning of the first exon).

...and marks the end of the first exon (beginning of the first intron):

; after base # 62278

Then mark the beginning of the second exon (end of the first intron):

: before base # 62409

...and the end of the second exon (beginning of the second intron):

; after base # 62631

Then mark the third exon:

: before base # 63482

; after base # 63610

If you have selected the DNA option, Bio2Midi will produce a MIDI file corresponding to a DNA coding sequence.

If you have selected the Protein option, it will produce a MIDI file corresponding to the translated protein.

Full DNA sequences should be played only in DNA mode, i.e. with the DNA option selected. Bio2MIDI will dutifully translate the bases, but the translation of such a sequence is not biologically meaningful.

Back