Genes Families hacked by Microsoft Excel

Few months ago I was playing with some data provided by the Sanger Institute. It was Copy Number Variation (CNV) data for 417 genes across a panel of 780 cell lines of interest in oncology. Data was provided in an Excel file where each column represents a gene and each line represents a cell line.

Since I wanted to integrate these data I wrote a php script to get the data in the right format. One of the tasks was to replace gene names by more stable gene identifiers such as the Entrez Gene Ids or the Ensembl Gene Ids. During this integration process I found out that some Hugo Gene names provided were out of date.

  • ALO17 is now RNF213
  • CEP1 is now CEP110
  • MSF is now SEPT9
  • NBS1 is now NBN
  • SIL is now STIL
  • TRD is now TRD@

But the more interesting thing was a gene name that I was not able to retrieve in the NCBI Gene Website:

  • sept-06

Why ? Actually I found out that the initial gene name was SEPT6. But it has been automatically reformatted by Excel into the date sept-06.

It seems that the two following genes families are affected by Excel. The septin gene family (SEPT1, SEPT2 .....) and the febrile convulsions gene family (FEB1, FEB2 .....).

What can we learn from this little integration adventure?

The first lesson is to try not using the .xls format (Excel format) to store or exchange your data because it can be automatically transformed !!!! But if you or other scientists want to do so, then remember to switch off the automatic text formatting of Excel.

The second one si that it is better to use a stable gene identifier like NCBI Gene Id or Ensembl Id are. To learn more about the comparison of the identifiers there is this very interesting web page called A guide to Associating Drug Target Names with Sequences for Querying Databases which finally recommends to use either the NCBI Gene Id (called Entrez Gene Id) or the Ensembl Id.

Last but not least, a big thanks to people from the Sanger Institute (Jorge Soares and his team) because as usual they are always eager to help you whatever is your problem, questions or comments.

Last minutes remarks: According to a today's discussion at Biostar called What are the most common stupid mistakes in bioinformatics?, Chris Evelo told me that the gene DEC1 (Deleted in Esophageal Cancer 1) is also affected by Excel. More over Simon Cockell reported an interesting publication called Mistaken Identifiers: Gene name errors can be introduced inadvertently when using Excel in bioinformatics.

If you already have such funny bugs during you integration process you are welcome to share it.

The Bioinformatics Open Space : Get/Give help with Biostar

This article is the kickoff of a sequel of articles called "Bioinformatics Open Space". It will mainly talk about the social web resources that can improve your productivity in bioinformatics. By productivity I mean either it will help you to fix a problem, keep you informed or give you inspiration for you daily tasks related to bioinformatics so that you won't feel alone. Today I will talk about a great resource that will help you to solve your question(s) related to Bioinformatics : "Biostar, Questions and Answers in Bioinformatics".

Once upon a time

Biostar was launched in march 2010 by Istvan Albert Associate Professor, Bioinformatics, Biochemistry and Molecular Biology at Pennsylvania State University.

Last minute correction from Istvan: BioStar was launched on Sept 30th 2009, but for the first few months it had no regular visitors other than myself. I kept adding a few questions and answers ). Giovanni Dall'Olio (user number 24) was BioStar's first regular contributor, he joined on January 18th, 2010. He brought in a lot of enthusiasm and brought us to a turning point. So I would consider this date as its "birthday"...

Last minute note from Giovanni: wow, thank you for remembering that! :-) For me, it all started with this thread on the biopython-dev mailing list: There were a few other alternatives for bioinformatics questions, all inspired by stackoverflow, but biostar was the only one with a bit of traffic already. After opening a few generic questions to attract newbies (what is your favorite programming language, etc..) the site started attracting visitors and became what is now.

Biostar : Questions and Answers in Bioinformatics

The aim of Biostar stands in his title. But that's not all because behind that there are two key ingredients that I will describe later and that make Biostar so appealing : the Stack Exchange system and the great Biostar's users who take time to share their bioinformatics knowledge.

Actually I find it so useful that during the last year (2010), I closed the Google Groups Group-4-Bioinformatics I was administrating. The explanation is very simple. First of all I think it is much better to have only few places, ideally one place to discuss about bioinformatics questions. And since I think that the Biostar ergonomics is a much better that the Google groups one I decided to close my Google group and suggest new comer to join other bioinformatician in Biostar. And the success of the ergonomics of the Stack-exchange system is certainly the reason that forced Google to change the design of the Google groups during the last year.

Great Ergonomics : The Stack Exchange model

So what make make the stack-exchange system so different from regular forum ?

First of all, the listing of the questions is very informative and easy to visualize. Indeed in just two lines you have a view at a glance of:

  1. The subject : What the question is about ?
  2. The rating : How other people evaluate the question. Like all rating this depend on the users. But in general high ratings mean interesting questions for a lot of people. Either because it solves a similar problem they have/had or they simply think it is a good question and they are curious about the answers provided.
  3. The number of answers : Help you to visualize how many answers have been provided so far to the question. And we are not talking about replies that could include tons of comments like in usual forum but real answers. And comments related to a given answer will be embedded in a related block. If this number is in green like in our instance below, it means that one of the responses has been considerated as the nicest one by the person who asked the question.
  4. The number of views of the question : It can give a sense of the popularity of the question.
  5. The tags : Use of tags allows people to categorize the questions. It can help you to highlight questions related to a given topics.
  6. The last edition : It tells you when the question was last answered or edited.
Biostar response instance

One other interesting feature about Biostar is that the question you ask or the answer you give can be rated by your peers. On of the aim of this feature is that the best rated answer for a given question will be automatically moved just below the answer. It make sense because it is supposed to be the best response according to other people so it save time for visitors that are usually looking for it.

More over there is like a credit that is linked to your profile. So each time a person give you a positive rating you receive like 10 points of credit or so. It is like numerical thank you for the time you spend trying to help other people. Actually you can see Biostar users ordered by their credit here>.

Great users : people who Take time to help

A great application is nothing without great content.

Indeed what would be Google Map and Google Earth applications if the Googlers didn't take pictures of the all streets in the world with their strange Google cars ? Nothing !

What would be an Antibody Database if scientists didn't take the time to share their Western Blot experiments information ? Nothing !

That's the same dilemma with Biostars and fortunately there are nice folks who take time to give help like Pierre, Neil, Kadher, Giovanni and many others.

Knowledge is the only treasure that increases on sharing

So if you want to help other bioinformatician and/or get help you are more than welcome to join the party. Signing-in it is pretty straightforward since you can use you an OpenId like for

Of course if you are aware of any other resource that can give help to people in the field of Bioinformatics you are welcome to share it through the comments feature.

The Bioinformatics solutions

Cool data visualization using augmented reality animation

Each friday my colleague Steve sends to colleagues his nice "week-end reading" newsletter related to any interesting topics in bioinformatics or biostatistics. Just before christmas vacation it was about an interesting BBC's video of Hans Rosling, a professor of global health at Sweden's Karolinska Institute.

It is very funny because it reminded me that this guy is cited in Garr Reynolds's book "Presentation Zen" on page 207 : "Coventional wisdom say never never stand between the screen and the projector. Generally this is good advice. But as you can see from the photo, Rosling at times defies conventional wisdom and gets involved with the data in an energetic way that engages his audience with tha data and his story".

In this video you have a nice instance showing how Rosling gives life to his presentations. Moreover this time he is using augmented reality animation wich make him one step deeper in the heart of his presentation. In about 4 minutes he shows the evolution of life expectancy against income in 200 countries during the last 200 years.

You can see more cool presentations by Rosling or other good speakers at the TED website. The annual TED (Technology, Entertainment, Design) conference brings together the world's most fascinating thinkers and doers, who are invited to give insanely great talks on stage in only 18 minutes.

Related reading suggestions

My first Bioinformatics Zen Presentation

One year ago I found out two very interesting books related to Power Point/KeyNote presentations "Presentation Zen" and "Presentation Zen Design". Through his books Garr Reynolds gives advices to deliver engaging and appealing oral presentations. I found it very interesting because sometimes it is not so easy to keep the scientists/biologists interested in the bioinformatics application we are presenting. In june 2010 I started to adapt an old excel plug-in dedicated to help scientists of our pharmacology group to analyze their Meso Scale raw data. Meso Scale assay technology provides a rapid and convenient method for measuring one or more protein targets within a single small-volume sample. Their 96-well plates supply a platform for the development of sandwich immunoassays.

When I was preparing my presentation related to the Plug-in I decided to spend some of my spare time to find ideas, pictures and sentences in order to produce a presentation zen that would keep researchers awake even if it take place after lunch time !!!!

So this article shows pictures that helped me to illustrate most of the messages/ideas I wanted to expose to the future users of the application. Hopefully it will give you inspiration for your future bioinformatics presentations. At the end you will find some good book(s) and Iphone application(s) related to the presentation zen philosophy.

Introduction slide 1 - why did I take care of the plug-in development : The Fake reason

Urban legend

The message

In order to start my presentation I wanted to establish a connection between the audience and I. So I used two slides to explain why I did take care of the plug-in implementation instead of people from our IT department. And to bring some fun I provided two explanations, a fake one and a real one. The fake one is related to a colleague/friend of mine who was the initiator of this mission. Since he will work in Cambridge, Massachusetts next year (wish you the best Ronan) I decided to invent a joke related to the chinese fortune cookies and their embedded messages. I told that once Ronan was in a chinese restaurant and got two fortune cookies. The first message was that he would be soon linked to Boston. It is true in few days so that give credibility to my story. And the second message was about the my implication in the plug-in development.

The picture

It is the two fortune cookies that illustrate my story. If I remember well it was a free picture available at iStockphoto. Sometimes you are looking for pictures to illustrate your ideas. But this time it is the picture that bring me the story.

Introduction slide 2 - why did I take care of the plug-in development : The real reason

excel plug-in next exit

The message

For the second slide I wanted to tell the truth. So I explained that few years ago I had already implemented a similar plug-in for our biomarker group. So starting from this version instead of starting from scratch would need less time of development. Then upon discussions, comments and suggestions I would be able to provide a new version that fit their user's needs in a short period of time. So it would save meetings, time and money for everybody.

The picture

I bought the original picture at iStockphoto. I wanted to symbolize that the solution to develop the plug-in internally was the best solution to get something implemented as soon as possible.

The User Guide

Uniform results

The message

To say that the user guide is very friendly to read and that it contains a lot of nice pictures which explain all the steps to use the plug-in.

The picture

I took a picture of my kids reading the user guide in order to show how simple it is to read it. I have to admit that I still have users questions that can be answered by the yellow tips or red remarks that are in the user guide. But at least there is interaction between the developer and the users and it is fine because it brings new ideas of improvement.

Very easy to use

easy to use

The message

To illustrate the fact that the application is easy to handle in a minutes period of time.

The picture

I used a free available picture at iStockphoto showing a senior couple in front of a computer. They seem to enjoy the Excel Plug-in.

Speed-up your analysis

Speed up analysis

The message

To highlight that by using the plug-in, users decrease the time for their analysis from 6-7 hours to about 1 hour.

The picture

I bought a picture at iStockphoto with two turtles. Turtles are very famous for their slow behavior. Here, one turtle is looking at the other one that wear a rocket on his back. It gives her the speed. So I used Photoshop in order to label the rocket with Excel plug-in .

Avoid the copy/paste boring operation

copy paste operation

The message

Analysis of the Meso Scale raw data with their 96- well plates format can be very boring. Moreover you can generate a lot of errors because of the copy and paste operations that you have to perform in order to associate information like drug names, concentrations, administration times, dilution, sample amounts, number of groups, number of samples by group, so on and so forth..

The picture

Once again I bought a picture at iStockphoto (even if retroactively I should have taken a picture of one of my kids). I wanted to symbolize the copy/paste operation by a kid using glue. this manual activity reinforce the sensation of boring operation in an adult point of view.

Generating uniform results among different users

Uniform results

The message

The point is to demonstrate that whatever the person generating the result file it will respect a common format. Of course, there will be some differences such as color or information for the graphics and the tables generated because of user's personal preferences. But in general someone who open the result file will be familiar with the format and positions of the data even if it is a file generated by a colleague.

The picture

This picture is from my last summer vacation in Sweden. You can find these houses at Pixbo near Gļæ½teborg where my friend Lars lives with his family. As you can see all the houses look like very similar even if there are some differences like the color of the walls. I thought it was a good comparison with the generated files.

Constant improvement

constant improvment

The message

During the six months of development of the new version of the plug-in I did improve or add a lot of features thanks to user's comments, suggestions and questions. I wanted to express the fact that the ergonomics of the Plug-in was a lot better than the initial version I launched 4 years ago.

The picture

One more time the origin of the picture is iStockphoto. The evolution of the wheel over centuries is a good analogy with the improvement of the plug-in over time.


Multi function

The message

During the testing of the plug-in some users tried to analyze experimental raw data from other systems of protein detection in 96-well plates format. They just used a home made template where they pasted their 12 columns x 8 rows data. Then they launched the plug-in and get the same result as the one they used to produce in a less automatic way. Finally it turn out that it was possible to treat raw data from Luminex or Tecan technology. The question is what else ?

The picture

Multi function

"What else ?" is the very famous slogan (at least in France) of the coffee machine brand Nespresso that have George Clooney as ambassador. So I decided to make a fake advertisement where George is promoting the Meso Scale Plug-in. I did it because this presentation was for internal purpose only and I don't think that Nespresso will sue me for that slide. So I just use Google images search to find out the picture that would fit my needs and used Photoshop to do the customization. As a bonus I appended below the first picture I wanted to use. It is a Swiss army knife that symbolize also the multi-task purpose of the plug-in.

Make the 96-well plates looks more interesting

Multi function

The message

A the beginning of the demonstration of the plug-in I wanted to show the 96-well plates in a less boring way than a simple 12 x 8 excel table. So I thought that transforming it in a game would me more interesting for future users. Then I choose the puissance 4.

The picture

I took a basic illustration that I modified in Photoshop to have the 96-well plates format with his 12 columns by 8 rows. Then I added some red and yellow coins that I labeled to discriminate different treatments : C for control, 1 for treatment 1, 2 for Treatment and so one so forth.


The message

Since developer and users of the plug-in are in the same department I wanted to insist on the importance to report any problem or to ask any questions instead of being stuck at any step of the analyses. It is very important to take advantage of the fact that the "help-desk" is not anonymous and right beside the users.

The picture

I took this picture at iStockphoto. The man is desperate with his is computer. and it seems that he has wait too long before calling help to the right people. It is the situation that I don't want to see with the plug-in's users. Waiting too long before asking for help.

Thank You

The message

I wanted to say thank you to all the users that help me to improve the Plug-in through their comments, questions, critics and suggestions. They also tested the plug-in in order to see if I didn't miss a user behavior that would lead to a bug.

The picture

I wanted to show people more or less anonymous. Since these people are in the same department I wanted also to highlight their team spirit. The aim was also to symbolize that the plug-in was tested, challenged in rough conditions. Very naturally I choose this picture of players from the same American Football team at iStockphoto. As you can see there is a lot of place to write the name at the bottom of the picture.

Conclusion : Lesson learned from this first Zen Presentation

It was a lot of fun to prepare this new kind of presentation. It took some of my spare time and some bucks but it was well worth it. Indeed people didn't sleep at all and were surprised by most of all the big pictures that pop-up like advertisements. And most importantly it seems that they well received all the ideas I wanted to share.

On a the technical point of view I added few words with each picture to explain the message I wanted to deliver. In general it was on a rectangular layer with a semi-transparent white background located at the bottom of the slide. I used the font Garamond in black. But this font is not installed by default on every computer. So I had to install it on the computer I used for my presentation in order to avoid a bad surprise.

Recently I did another presentation related to an internal Antibody Database. I also used some nice pictures to illustrate some key points of such social and scientific web application. I will be the subject of an upcoming article.

If you want to share your best images and their related key messages feel free to add it through the comments feature.

Bonus 1 : Books suggestion

In my library

I learned a lot of good things with these two books. Actually their aim are slightly different.

The first one (Presentation Zen) educate you to the technics of delivering simple and efficient presentations. The main advices I got from it are :

  • To sketch the presentations on Post-It using paper pen. This method free you mind from any technical constraints in order to allow you to focus only on your presentation.
  • To avoid to produce "slidedocument". You know, the slide with three graphs and three paragraphs for which a Word document should be more suited.
  • The fact than one slide with one big picture plus few words representing a main idea and your voice is more effective than a slide with ten bullets.

In my Christmas list

I don't have these books yet. But I hope the get them soon. Indeed "slide:ology" got nice reviews in amazon and I am very curious to see if it tend to deliver the same message delivered by Garr Reynolds. And a "Naked Presenter" is the last book from Garr : So I am sure this one is more focused on the speaker and may provide some technics to be quickly connected to you audience

Bonus 2 : Iphone applications suggestion


Concerning this topics there is PresenterPro is freely available. It reminds you some good principles to deliver nice presentations. It is in the same way of thinking as Garr Reynolds.

The Body Browser By Google

Google is providing a new lab called the Body Browser. This new browser uses the WEBGL technology of 3D representation (instead of Flash). So you are able to explore the human body as you would do with Google Earth. You can show or hide several layers such as skin, bones, muscles etc. Then you can click on objects to get their respective label. If you want to give a try to this new tool you need a recent browser compatible with WEBGL technology such as Chrome 9 or Firefox 4.


The starting point is very similar to Google Earth. On the left you have the zoom and orientation feature and on the right you have the object at his maximal size. There is also the search box at the top right corner and a slider below the zoom.

Body Browser overview

The Slider feature

At the bottom of the slider there is a button that you can use to switch from left to right in order to change his functionalities.

Body Browser slider

Left position: You can move the slider from Top to down. So you will make the body layers hidden in the following orders:

  • Skin
  • Muscles
  • Bones
  • Organes
  • Blood system
  • Nervous system

Here I just displayed the nervous system.

Body Browser slider

Right position: You can hide each of the layers separately. Below I masked all the layers except the blood system.

Body Browser slider

The Search feature

In the search box I tried kidney and it automatically zoomed to the labelled organ. Unfortunately there is no link to an information webpage like wikipedia.

Body Browser search

The search box has the suggestion feature. So if you start to type femur you will get several suggestions that give you an idea of what kind of object you can look for with the body browser.

Body Browser search

What's next ?

This new Google lab is very interesting. I don't know if it will be possible to change for the male gender. Indeed there are differences between male and female anatomy.

And another interrogation is : will we get an new lab with a Google Genome Browser.

New design

Six years ago, I started for fun and also in order to improve my HTML, CSS, SQL, PHP and Javascript skills. I also wanted to play with the Google services likes Google Analytics, Adsense, FriendConnect and Search. But during this period of time my experience in ergonomy and the design for websites and web applications was improving. And at beginning of 2010 I was shocked by how ugly the interface of was.

Hunting for a nice design

Then I decided to take some time in order to improve it. I looked around to identify a good design. I wanted it to be simple and nice at the same time. I browsed a lot of free design template websites but I never found the One.

And like most of the time when I was not hunting for my precious I finally discovered the simple and nice design I was looking for with : Simple, practical, nice with an easy to read font. His structure : a nice illustration for the header, a practical horizental menu and a main section with his sidebar for extra features and options. I was ready to work on it. I liked it so much that I also used it for a website dedicated to kinesitherapy.

Since I am not a graphist at all I browse the Istockphotos website and I did download this nice train and his books that I customized with some references to Bioinformatics.

So what's new and what's next

Now you have an easy access to all the Bioinformatics section through the horizental menu. For jobs, companies, laboratories, degrees and journals ressources you have filters to select the ones you are looking for.

All the subsections of the previous version are now considered as tags in order to make it easier for me to maintain the ressources section.

According to my spare time I will try to deliver news and articles with a broad scope but always related in some ways to Bioinformatics. Here is a short list of the potential next articles :

  • Get Ensembl IDs using Ensembl public MySQL Servers
  • Entrez Gene ID is the strongest link
  • Molsoft allows you to embed structure into website and presentation
  • Bioinformatics Open Space : Get inspiration
  • Bioinformatics Open Space : Get and Give help
  • My first try to deliver a Bioinformatics Zen presentation

I will try to make use of the Google Friend features in order to allow members to save their favorites jobs, journals or ressources.

That's all folks and I wish you an happy Thanksgiving.