@title("Silk Genome Read Format")
= Silk Genome Read Format
Silk Genome Read Format is a data format for describing genome read data in [silk.html Silk]. This format consists of three types of data objects: coordinate, reference and read:
* coordinate(group:utgb, system:scaffold, species:medaka, revision:version1.0, type:colorspace)
** specifies a coordinate system for the following data.
** The group value must be 'utgb' (predefined for the future extension)
** The system value is a coordinate system name: e.g., chromosome, scaffold, superconfig, contig, clone,
** The species value is e.g., human, mouse, medaka, etc.
** The revision value is for specifying sequence version (e.g, hg18, mm9, etc.).
** The type value must be {b|colorspace} or {b|letterspace}. If the type value is not present, letterspace is used as the default value.
* reference(name, start, end, strand)
** specifies a reference sequence to which the following read objects are aligned.
** The name value is, for example, scaffold1, chr1, etc.
** The start and end values are the position of the reference sequence on the coordinate system.
** reference sequence can be described in a multi-line format. '{b|">"}' symbol is an indicator for using the multi-line format. For example,
-sequence: A010132310011011111212333333333333333333333331212121
is the same with:
-sequence>
A01013231001101111121
233333333333333333333
3331212121
* read(name, start, end, strand, sequence, ...)|
** You can add arbitrary parameters to read objects
** '{b|"|"}' symbol indicates tab-separated data will follow for describing read objects.
** The order of the attributes is not significant. You can change the attribute order: for example, read(name, start, strand, QV, sequence).
== Indentation
* Tab characters cannot be used for the indentation in Silk format. Use space characters.
* Indentation of coordinate and reference objects must be level 0 (no space).
* Read objects belong to a reference sequence, so its indentation level is 1.
== Example
%silk(version:1.0)
# coordinate specifies the target sequence
-coordinate(group:utgb, system:scaffold, species:medaka, revision:version1.0, type:colorspace)
# specifies a reference sequence
-reference(name:scaffold1, start:1043, end:1200 ,strand:+)
-sequence>
A010132310011011111212333333333333333333333331212121
3213123012312031230123102320032123032102312312302130
21312031230132003110001320202020123012310
# read data aligned to the reference sequence
-read(name, view_start, view_end strand, sequence, QV*, _[json])|
seq1 1043 1054 + A010012113 [20.1, 20.5] {"memo":"seq1 data"}
seq2 1047 1059 + C01232103011 [24.5, 12.5, 34]
# start data for another reference
-reference(name:scaffold1034, start:0, strand:+)
-sequence>
C4323423434101323100110111112123333312111
21312031230132003110001320202020123012
-read(name, start, strand, sequence, QV*)|
seq3 0 + A010012113 [20.1, 21, 25]
seq4 0 + C01232103011 [24.5, 23, 35, 15]
== InDel
To specify indel positions to display, use:
* {b|() (open parentheses)} for minor insertions (which will be hidden in the track view)
* {b|[] (brackets)} for major insertions (which will be displayed in the track view)
* {b|- (hyphen)} for specifying gaps in the sequence. The gap symbol can be used in both the reference and read sequences.
%silk(version:1.0)
-coordinate(group:utgb, species:medaka, revision:version1.0, type:colorspace)
-reference(name:scaffold1, start:1043, strand:+)
-sequence>
A01013(31)011011111212333333333333333333333331212121
32131230123120312[301231]320032123032102312312302130
21312031230132003110001320202020123012310
-read(name, start, strand, sequence)|
seq1 1043 + A0-100--12113
seq2 1047 + C01232-103-011
== SOLiD Color Space Table
letter color
A 00 (0)
C 01 (1)
G 10 (2)
T 11 (3)
Take XOR of two bases:
seq code
AA 0
AC 1
AG 2
AT 3
CA 1
CC 0
CG 3
CT 2
GA 2
GC 3
GG 0
GT 1
TA 3
TC 2
TG 1
TT 0