@title("Silk Genome Read Format") = Silk Genome Read Format Silk Genome Read Format is a data format for describing genome read data in [silk.html Silk]. This format consists of three types of data objects: coordinate, reference and read: * coordinate(group:utgb, system:scaffold, species:medaka, revision:version1.0, type:colorspace) ** specifies a coordinate system for the following data. ** The group value must be 'utgb' (predefined for the future extension) ** The system value is a coordinate system name: e.g., chromosome, scaffold, superconfig, contig, clone, ** The species value is e.g., human, mouse, medaka, etc. ** The revision value is for specifying sequence version (e.g, hg18, mm9, etc.). ** The type value must be {b|colorspace} or {b|letterspace}. If the type value is not present, letterspace is used as the default value. * reference(name, start, end, strand) ** specifies a reference sequence to which the following read objects are aligned. ** The name value is, for example, scaffold1, chr1, etc. ** The start and end values are the position of the reference sequence on the coordinate system. ** reference sequence can be described in a multi-line format. '{b|">"}' symbol is an indicator for using the multi-line format. For example, -sequence: A010132310011011111212333333333333333333333331212121 is the same with: -sequence> A01013231001101111121 233333333333333333333 3331212121 * read(name, start, end, strand, sequence, ...)| ** You can add arbitrary parameters to read objects ** '{b|"|"}' symbol indicates tab-separated data will follow for describing read objects. ** The order of the attributes is not significant. You can change the attribute order: for example, read(name, start, strand, QV, sequence). == Indentation * Tab characters cannot be used for the indentation in Silk format. Use space characters. * Indentation of coordinate and reference objects must be level 0 (no space). * Read objects belong to a reference sequence, so its indentation level is 1. == Example %silk(version:1.0) # coordinate specifies the target sequence -coordinate(group:utgb, system:scaffold, species:medaka, revision:version1.0, type:colorspace) # specifies a reference sequence -reference(name:scaffold1, start:1043, end:1200 ,strand:+) -sequence> A010132310011011111212333333333333333333333331212121 3213123012312031230123102320032123032102312312302130 21312031230132003110001320202020123012310 # read data aligned to the reference sequence -read(name, view_start, view_end strand, sequence, QV*, _[json])| seq1 1043 1054 + A010012113 [20.1, 20.5] {"memo":"seq1 data"} seq2 1047 1059 + C01232103011 [24.5, 12.5, 34] # start data for another reference -reference(name:scaffold1034, start:0, strand:+) -sequence> C4323423434101323100110111112123333312111 21312031230132003110001320202020123012 -read(name, start, strand, sequence, QV*)| seq3 0 + A010012113 [20.1, 21, 25] seq4 0 + C01232103011 [24.5, 23, 35, 15] == InDel To specify indel positions to display, use: * {b|() (open parentheses)} for minor insertions (which will be hidden in the track view) * {b|[] (brackets)} for major insertions (which will be displayed in the track view) * {b|- (hyphen)} for specifying gaps in the sequence. The gap symbol can be used in both the reference and read sequences. %silk(version:1.0) -coordinate(group:utgb, species:medaka, revision:version1.0, type:colorspace) -reference(name:scaffold1, start:1043, strand:+) -sequence> A01013(31)011011111212333333333333333333333331212121 32131230123120312[301231]320032123032102312312302130 21312031230132003110001320202020123012310 -read(name, start, strand, sequence)| seq1 1043 + A0-100--12113 seq2 1047 + C01232-103-011 == SOLiD Color Space Table letter color A 00 (0) C 01 (1) G 10 (2) T 11 (3) Take XOR of two bases: seq code AA 0 AC 1 AG 2 AT 3 CA 1 CC 0 CG 3 CT 2 GA 2 GC 3 GG 0 GT 1 TA 3 TC 2 TG 1 TT 0