Few months ago I was playing with some data provided by the Sanger Institute. It was Copy Number Variation (CNV) data for 417 genes across a panel of 780 cell lines of interest in oncology. Data was provided in an Excel file where each column represents a gene and each line represents a cell line.
Since I wanted to integrate these data I wrote a php script to get the data in the right format. One of the tasks was to replace gene names by more stable gene identifiers such as the Entrez Gene Ids or the Ensembl Gene Ids. During this integration process I found out that some Hugo Gene names provided were out of date.
- ALO17 is now RNF213
- CEP1 is now CEP110
- MSF is now SEPT9
- NBS1 is now NBN
- SIL is now STIL
- TRD is now TRD@
But the more interesting thing was a gene name that I was not able to retrieve in the NCBI Gene Website:
Why ? Actually I found out that the initial gene name was SEPT6. But it has been automatically reformatted by Excel into the date sept-06.
It seems that the two following genes families are affected by Excel. The septin gene family (SEPT1, SEPT2 .....) and the febrile convulsions gene family (FEB1, FEB2 .....).
What can we learn from this little integration adventure?
The first lesson is to try not using the .xls format (Excel format) to store or exchange your data because it can be automatically transformed !!!! But if you or other scientists want to do so, then remember to switch off the automatic text formatting of Excel.
The second one si that it is better to use a stable gene identifier like NCBI Gene Id or Ensembl Id are. To learn more about the comparison of the identifiers there is this very interesting web page called A guide to Associating Drug Target Names with Sequences for Querying Databases which finally recommends to use either the NCBI Gene Id (called Entrez Gene Id) or the Ensembl Id.
Last but not least, a big thanks to people from the Sanger Institute (Jorge Soares and his team) because as usual they are always eager to help you whatever is your problem, questions or comments.
Last minutes remarks: According to a today's discussion at Biostar called What are the most common stupid mistakes in bioinformatics?, Chris Evelo told me that the gene DEC1 (Deleted in Esophageal Cancer 1) is also affected by Excel. More over Simon Cockell reported an interesting publication called Mistaken Identifiers: Gene name errors can be introduced inadvertently when using Excel in bioinformatics.
If you already have such funny bugs during you integration process you are welcome to share it.