-
Notifications
You must be signed in to change notification settings - Fork 0
/
LRGschemadoc
435 lines (434 loc) · 12.4 KB
/
LRGschemadoc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
Documentation of the LRG XML schema version 1.8
The LRG schema is defined in
RelaxNG
format in the file "
LRG_1_8.rnc
". This RelaxNG file is used as the reference
schema when building new LRG XML files, and is also used to validate existing LRG files. The order, hierarchy and
content of the elements within the schema must be preserved in the LRG XML files. Elements not described in the
schema will not be valid in an LRG.
All published LRGs should validate successfully against this schema.
The LRG XML schema is divided into two main sections:
Fixed annotation
The fixed annotation is, as its name suggests, fixed and permanent. Once an LRG is created, it is intended that the fixed
annotation section remains unaltered for the lifetime of an LRG. If significant changes must be made, then a new LRG
will be created. The idea is that this section contains the core information about the LRG. Here is some info about the
key elements in the fixed annotation section:
•
id –
this is the identifier for the LRG, and should be in the format LRG_[number], e.g. LRG_1, LRG_24
•
hgnc_id –
this is the identifier for the corresponding HGNC ID. Usually a LRG is associated with a HGNC
entry
•
sequence_source
– the accession number of the RefSeqGene on which the LRG is based. The genomic
reference sequence of the LRG is identical to this RefSeqGene
•
organism –
all LRGs will be for human at first; this element allows for future expansion into other species
•
source –
this is an instance of the standard source element and contains source information for the sequence,
along with contact details
•
mol_type –
a description of the type of sequence e.g. "dna"
•
creation_date –
date of the creation of the first version of the LRG
•
sequence –
the agreed genomic DNA sequence for the LRG. By default this sequence will have 5000bp 5' and
2000bp 3' of the main transcript sequence. This will be extended if necessary to include all transcripts encoded
by the gene specified in
lrg_locus
or if requested by the community. It should contain no characters other than
[ACGT]
•
transcript –
the fixed annotation section will contain one or more transcripts, each with their own element and
child elements. Each transcript must have a name (e.g. "t1", "t2"), and a start and end coordinate. All
coordinates, unless otherwise stated, are relative to the LRG genomic sequence, and 1-indexed - i.e. start=1
would be the first basepair of the LRG genomic sequence
◦
comment –
Additional information about the transcript (optional)
◦
coordinates –
coordinates of the specified sequence relative to the LRG. The strand of the specified
feature is also included. The gene specified by
lrg_locus
is always transcribed by the positive strand,
specified as strand=1.
◦
cdna –
contains a sequence element as above that holds the unprocessed, spliced cDNA sequence (UTRs
included, introns removed)
◦
coding_region –
this element represents the coding region (or CDS) of the cDNA. The
coding_region
element has a child element which contains the translated protein sequence. Each transcript element can
have zero or more translated proteins. The
coding_region
element can also contain the optional child
element
translation_exception
which will contain any exceptions to the usual codon usage (for example
see LRG_417 - <translation_exception codon="523"><sequence>U</sequence></translation_exception>)
◦
exon –
there then follows one or more exon elements, between each of which will be an intron element.
The RelaxNG schema is designed to unambiguously define exons in a transcript. Each exon is defined by
start and end coordinates on the LRG, rather than any numbering system, which could differ between
laboratories. Exons have LRG specific labels. Each exon element can contain three sets of coordinates,
containing the coordinates relative to the LRG sequence, the cDNA sequence and the peptide sequence.
◦
intron –
these are found in between exon elements. They have only one attribute, phase, which can take a
value of 0, 1 or 2. This represents the location of the exon/intron boundary in terms of its location in the
three basepair codon, according to the following system:
▪
0
- intron falls between codons
▪
1
- intron falls between 1st and 2nd base of codon
▪
2
- intron falls between 2nd and 3rd base of codon
Updatable annotation
The updatable annotation contains one or more annotation set elements, each of which contains a set of annotation from
one source. Currently the updatable section of each LRG contains 3 annotation sections. One contains the mapping
information of the LRG genomic sequence to the current genome assembly, and then one each for Ensembl and NCBI
containing feature annotation and other elements.
•
source –
as above, but this should be the source of the annotation in this annotation_set
•
comment –
Additional information about the annotation set (optional)
•
modification_date –
obvious; this should be changed when the annotation is updated, either automatically or
by hand
•
mapping –
this element contains the mapping by sequence alignment of the LRG sequence to a specified
reference sequence, for example the current genome assembly. The reference sequence is specified by the
attributes
coord_system
,
other_name and other_id.
Multiple instances are allowed such that we can store
mappings to multiple reference sequences. The mapping element has the following attributes:
◦
coord_system
◦
other_name –
Usually the same as the
coord_system
for the transcript mappings
◦
other_id
◦
other_start
and
other_end
- these coordinates represent the most 5' and most 3' coordinates covered by
all of the contained
mapping_span
elements, i.e. the total span of the mapping
◦
mapping_span
–
each mapping span represents a largely contiguous area of alignment between the LRG
sequence and the reference. The mapping element can have one or more child mapping_span elements.
The reason for this is if the LRG maps into several large separate segments. The
mapping_span
element
has the following attributes:
▪
lrg_start
and
lrg_end
– LRG coordinates
▪
other_start
and
other_end
– Coordinates from an other system (e.g. GRCh37, Ensembl transcript,
RefSeqGene transcript)
▪
strand
▪
diff
–
Each
mapping_span
element can then contain one or more diff elements - this allows us to
precisely describe any single- or multiple-basepair differences between LRG and the reference, as
well as any insertions or deletions. The
diff
element has the following attributes:
•
type
- can be mismatch, lrg_ins (representing an insertion in the LRG) or genomic_ins
(representing an insertion in the reference, or to think of it another way, deletion in the LRG)
•
lrg_start
and
lrg_end
- describes the coordinates of the difference in LRG coordinates
•
other_start
and other_
end
- describes the coordinates of the difference on the other system (
e.g.
GRCh37, Ensembl transcript, RefSeqGene transcript)
•
lrg_sequence
- the sequence in the LRG - used for mismatches and lrg_ins instances
•
other_sequence
- the sequence of the difference in the other system (used for mismatches and
insertions)
•
fixed_transcript_annotation
–
this information contains updatable information relating to the transcripts
included in the fixed section, for example exon numbering
◦
other_exon_naming
– this element contains alternate exon naming for each exon within the transcripts
included in the fixed section.
▪
description –
Attribute
defining
the source name of the other exon naming.
▪
url
– URL where more information about the source of the other exon
naming
can be found
(optional).
▪
comment –
Additional information about the other exon naming (optional).
▪
exon
– This element contains several attributes:
•
coordinates –
Exon coordinates in LRG coordinate system
•
label –
Alternative exon naming
◦
alternate_amino_acid_numbering
–
this element contains alternate amino acid numbering for a protein
included in the fixed section
▪
description
– Attribute defining the source name of the alternative amino acid numbering.
▪
url
– URL where more information about the source of the alternative amino acid numbering can be
found (optional).
▪
comment
– Additional information about the alternative amino acid numbering (optional).
▪
align
– This element contains several attributes:
•
lrg_start
and
lrg_end –
Start and end of the protein alignment in protein LRG coordinate
system.
•
start
and
end
– Alternative coordinates in amino acid.
•
lrg_locus
– Main gene symbol associated with the LRG
•
features
–
this element contains the genomic annotations found in the region mapped to by the LRG. It is
distinct from the information found in the fixed section in that it may change, and may contain more genes and
transcripts than the fixed layer. These annotations should not beused as the reference point for further
annotations - only the features described in the fixed section should be used for this
◦
gene
- represents a gene, and has, along with attributes for coordinates and strand (all relative to the LRG
sequence), an attribute "symbol" - this should be the HGNC standard symbol for the gene. Gene, transcript
and protein_product elements can all contain optional elements from the following set:
▪
symbol -
Symbol of the gene (usually, this is an HGNC symbol)
•
synonym –
Other name(s) used to define the gene
▪
coordinates –
In LRG coordinate system
▪
partial
- indicates that the feature only overlaps the region mapped by the LRG partially, and can be
'5-prime' or '3-prime'. 3-prime indicates that the 3-prime end of the transcript lies outside of the LRG,
while 5-prime indicates the 5-prime end
▪
long_name
- a long name for the feature, e.g. collagen, type I, alpha 1
▪
db_xref
- an element representing an identifier for this feature in an external database. These
identifiers can currently only come from a limited set of sources: GeneID, HGNC, MIM, GI, RefSeq',
Ensembl, CCDS, UniProtKB
•
synonym –
Other name(s) liked to the external identifier
▪
transcript
- can be zero or more transcript elements per gene. It contains information about the
protein product if it codes for a protein.
•
coordinates –
In LRG coordinate system
•
long_name
- a long name for the transcript
•
db_xref
- an element representing an identifier for this feature in an external database. These
identifiers can currently only come from a limited set of sources: GeneID, HGNC, MIM, GI,
RefSeq', Ensembl, CCDS, UniProtKB
◦
synonym
– Other name(s) liked to the external identifier
•
exon –
List of the exon names in the source
◦
coordinates –
In LRG coordinate system
•
protein_product
- each transcript element can have zero or more protein_product child elements,
with attributes representing the coding start and end in LRG coordinates
◦
coordinates –
In LRG coordinate system
◦
long_name
- a long name for the protein
◦
db_xref
- an element representing an identifier for this feature in an external database. These
identifiers can currently only come from a limited set of sources: GeneID, HGNC, MIM, GI,
RefSeq', Ensembl, CCDS, UniProtKB
▪
synonym
– Other name(s) liked to the external identifier
Changes from LRG schema 1.7
Pure schema changes:
•
New attribute "
label
" in the "
exon
" tags (fixed section).
•
New optional tag "
hgnc_id
" in the tag "
fixed_annotation
"
•
Changes on the tag "
symbol
":
◦
Only one tag "
symbol
" is allowed per "
gene
"
◦
The symbol name has been moved from the content of the tag "
symbol
" to a new attribute "
name
".
◦
The tag "
synonym
" (0 to many) has been added as a child of the tag "
symbol
".
◦
Moved up the tag "
symbol
" at the top position in the tag "
gene
".
•
Change the pattern of the LRG transcript and protein coordinate systems (i.e. the attribute "
coord_system
" in
the tag "
coordinates
") by removing the character "
_
" between the LRG name and the transcript and protein
name, e.g.:
◦
LRG_1t1
instead of LRG_1_t1
◦
LRG_1p1
instead of LRG_1_p1
•
Added optional tags "
url
" and "
comment
" in both tags "
other_exon_naming
" and
"
alternate_amino_acid_numbering
" (optionals).
•
Remove the tag "
comment
" in the tag "
fixed_transcript_annotation
" (i.e. moved into "
other_exon_naming
"
and "
alternate_amino_acid_numbering
").
Other schema related changes:
•
The
other_exon_naming
and
alternate_amino_acid_numbering
data are now stored exclusively in the
"
Community
" annotation set. If there is no data for at least one of these 2 tags, there won't be a "
Community
"
annotation set for the LRG.