At vero eos et accusamus et iusto odio dignissimos ducimus qui blanditiis praesentium voluptatum deleniti atque corrupti quos dolores et quas molestias excepturi sint occaecati cupiditate non provident, similique sunt in culpa qui officia deserunt mollitia animi, id est laborum et dolorum fuga. Et harum quidem rerum facilis est et expedita distinctio. Nam libero tempore, cum soluta nobis est eligendi optio cumque nihil impedit quo minus id quod maxime placeat facere possimus, omnis voluptas assumenda est, omnis dolor repellendus. Itaque earum rerum hic tenetur a sapiente delectus, ut aut reiciendis voluptatibus maiores alias consequatur aut perferendis doloribus asperiores repellat.
# In RNA Seq, all mRNA transcripts are purified from other cellular components. The purified mRNAs are fragmented or cut into segments 20 base pairs in length (in this example), the fragments are individually sequenced, and the fragment sequences are mapped back onto the sequence of the entire gene. # The key is that these mRNA transcripts have been spliced (i.e., introns removed) and are ready for transcription into proteins. Some or all of these transcripts have been alternately spliced - that is, they contain different sets of introns. These alternatively spliced transcripts encode different protein products with different - though often related - functions. # Your mission, should you chose to accept it, is to take a set of sequenced fragments and map them back onto the initial gene. The purpose is to see how different alternatively spliced variants there are, how many introns produce these alternatively spliced variants, and to obtain sequences for the altneratively spliced transcripts.
## A couple administrative notes: # 1. I encourage collaboration (more experienced, please help those who are out of their depth) and will clarify and guide as asked. I will not confirm correct solutions until all working on it have either reached a consensus or given up. I learned that lesson on the last challenge. # 2. While possibly helpful and certainly instructive, computer programming is not necessary to the solution. It can be done by a human with paper, pencil and some time and mental effort. # 3. If there proves to be group interest in this problem, I intend to post a second part - inferring abundance of various altneratively spliced transcript variants from sequence data - and a third part - dealing with homologues in RNA Seq data. If interest is there, we can extend the problem to dealing with subsets of real data from real RNA Seq experiments and if interest is really there, we can play a bit with various probabilistic mapping techniques which are a great deal of fun. # 4. Lastly, there is no deadline when the solution will be posted. Everyone will have their chance; then the solution will be posted and if there is interest, we'll go on to the other topics. # 5. The bio challenge is back by popular request. You asked (some of you long ago, thank you for your patience) to be informed: @TranceNova @kma230 @Frostbite @Preetha @sasogeek . My apologies to any I forgot.
Known sequence of whole gene: ATGAGAGGCG CTCGCGGCGC CTGGGATTTT CTCTGCGTTC TGCTCCTACT # These are the first fifty bases in the human cKIT gene, ENA|EU826594|EU826594.1 # Spaces indicate 10 base intervals and are included for counting ease. They are not biologically signficant. ## Fragment transcript sequences CTCTGCGTTCTGCTCCTACT GAGCGCCTGTCTGCGTTACT CCTGGGATTTTCTCTGCGTT GGGATTTTCTCTGCGTTACT ATGAGAGCGCCTGTCTGCGT ATGAGAGCGCCTGGGATTTT ATGAGAGGCGCTCGCGGCGC GTCTGCGTTCTGCTCCTACT GAGAGCGCCTGTCTGCGTTA GCGGCGCCTGTCTGCGTTCT AGGCGCTCGCGGCGCCTGTC CGGCGCCTGTCTGCGTTACT CTCGCGGCGCCTGGGATTTT ATGAGAGGCGCTCGCGGCGC ATGAGAGGCGCTCGCGGCGC
So to be clear, this is what a solution will answer: 1. How many altneratively spliced transcript variants of the gene? What are their sequences? 2. How many introns in the gene? What are their sequences? 3. Which introns are present in which transcript variants?
There have been a few inquiries and I'll address a point that is causing a little pain: 1. The known sequence for the whole thing - ATGAGAGGCG CTCGCGGCGC CTGGGATTTT CTCTGCGTTC TGCTCCTACT - is an mRNA sequence! So the reads are NOT complementary to it. They are identical. 2. The problem is about sequence alignment. In cases like this, where you have the whole sequence and are looking for which bits of it are missing (i.e., introns), I strongly suggest you map each fragment back onto the whole sequence rather than try to map them onto each other. 3. Those of you with exams this week and college applications due, do all that first! This is fun over the holidays and very much secondary to the important stuff in your lives.
it's been months since i touched anything bio lol but this looks like fun... will try and give this a shot... gotta do some reading though :)
@Frostbite , it seems to have eaten your question. As other people are still trying the problem at this point, solutions shouldn't include complete sequences of the various transcript variants. Instead please post your solutions as info - how many variants present, how many introns, etc.