News, Events, Blogs

Could fake data undermine the age of open data in life science research?

Blog 15 August 2017

Open sign image

IN the past year, the credibility of news, facts, and data published on the web has been severely damaged. The US election in 2016 brought into focus just how vulnerable everyday web users are to fake content being published - content that looks authoritative, is widely shared and re-shared, and sounds plausible, but is ultimately entirely invented. Oxford Dictionaries, perhaps the fount of all truth, even went as far as to name their word of the year for 2016 'post-truth', often seen as a bedfellow of fake news.

In the life sciences, openness of data has become a byword for good scientific research. Public and private organisations involved in developing new drugs, treatments and general health care products rely on open data as an important weapon in their arsenal when developing hypotheses. Indeed, many working in academic sectors rely heavily on it to conduct their work.

Publishing data openly has become mandated for many publicly funded research projects, such as those funded by the National Institute for Health in the US and the UK government research body, BBSRC. The general intention is that open data should lead to greater chances of novel discoveries and is an effective way of communicating and validating experimental evidence to this end.

"
The system will live on, but the democratisation of research that open data is bringing about will have been badly damaged.

There is an obvious disconnect between these two worlds. In one world, openly published data - facts, articles, etc. - is frequently questioned as to its legitimacy and there is a growing trend to shutting down particular outlets. In the other, open life science data is considered valuable and there is a desire to publish ever greater amounts. This disconnect may exist for many reasons, most of which I am not qualified to opine on. Perhaps scientists are altruistic, to a point, especially those working in academia and so we trust them. It may also be that peer recognition is a much sought after reward and that certainly requires publishing results and data. The two are not unrelated. Personally speaking, I have not received funding from government grants for quite a long time now and yet I still review papers. I still publish papers. That's because I feel like it's my duty as a member of a community to do my bit.

Trust is exploitable

That trust is a key currency in the open data world should be a concern. The trust in openly published scientific data is implicit and this, in cyber security terms, represents a major vulnerability. This trust can of course have many faces. There is trust that comes from peer-review. There is also trust granted because the publishers are from noteworthy organisations, such as large universities. Even then though, there is precedent for fake scientific data. In 2012 Dutch psychologist Diederik Stapel was found to have published fabricated data in 30 peer-reviewed papers. In the same year, anesthesiology researcher Yoshitaka Fujii was found to have fabricated 183 scientific papers.

Just this month, a major paper on gene editing was retracted but not before the damage had been done. As Nature reported:


It is hard to overstate the impact of the Han paper following its publication last year, especially in China, where the paper originated. Coverage in the Chinese media was extensive, with headlines heralding the discovery of an entirely new gene editing system. The NgAgo report was easily the most widely covered paper in China last year; according to media monitor Meltwater, nearly 4,000 Chinese news stories cited the Han paper in just the first two months after publication.

We are of course vulnerable to honest mistakes but fortunate that these can often picked up in peer-review; point out an error to an author can often bring mutual enlightenment. But importantly we are also vulnerable when data is deliberately faked and it is the latter that may be harder to spot when there is motivation to hide it.

It is not hard to envisage a scenario in which open data becomes vulnerable to malicious practice on a far wider scale than we have seen so far, deliberately generated en masse and published with misleading claims and evidence. Clearly, this has happened already in isolated but high-profile incidents. Creating data from existing data is trivially easy and altering some crucial parts of this data - such as some of the metadata describing the diseases, or phenotypes, or perhaps altering a few base pairs in a sequence read, or a few measurement readings of significance - is similarly trivial. Importantly, much of this could be automated which opens the door to those involved in far more malicious forms of exploits - cyber warfare.

Recent cyber attacks have exposed just how vulnerable we have become to malicious content publishers and those wishing to infiltrate private data with fake data. Cyber security was described by the UK government as one of their top priorities in the coming parliament. The damage done by attacks, such as the attack on the NHS in May, are extensive with regards financial and human cost.

Compounding the problem, perversely, is the rate of advancement in science in enabling ever higher volumes of data to be generated at a cheaper cost. For organisations involved in publishing data, analysing this and confirming findings is a task that simply will not scale in the medium term. There are countless published articles about the difficulty in reproducing results in legitimate scientific studies (see http://www.nature.com/news/reproducibility-1.17552). That legitimate studies are published which are hard to confirm hardens the idea that there is a vulnerability in the system.

Damaging implications, but not life-threatening

The implications are potentially damaging but are not likely to lead to drugs that stop working or treatments that make people ill; the systems for testing data are extremely rigorous once they reach any further stage of research and require data to be confirmed de novo. Downloaded data is not going to suffice. It would be incorrect to state this will lead directly to damage to human health. It won't. The impact is more on the cost to get to the early stages of research in the first place - lost months chasing erroneous targets which have been leaked into a system which relies on trust.

Over time, an accumulation of fake data will begin to poison the well and determining what is fake and what is not will be hard. At this point, researchers will begin to trust only certain types of data from certain submitters - specific labs in specific organisations. The system will live on, but the democratisation of research that open data is bringing about will have been badly damaged.

Why?

An obvious question at this point would be, why would anyone do this? I'm barely qualified to comment on the motivation of hackers, so will refer you to other articles on that [12] . There are already countless cases of hacking for the pleasure of hacking. Some see it as a constructive learning experience for both sides. Some use exploits for political gain, destabilising functioning systems. Some use exploits for financial gain, for instance, maintaining records of data that has been tampered with, in this case the 'fake data', and selling that list back to consumers such as large organisations. What we do know is, that if there are vulnerabilities, exploits can occur.

Sleepwalking into this situation could be avoided, however. There are already constructs in place which can be utilised, not least the teams of data curators that help QC data, though this is impacted by reduction in scientific funding and increasing volumes of data. It is likely algorithmic approaches may also help, although how an algorithm can spot fake data which, in theory, could be biologically plausible, makes this very challenging. It should also not dismiss data that looks fake simply because it is novel and rare; having not previously seen a finding before is not sufficient as a flag.

There are constructs in other fields that can play an important role, and that do already in other areas where trust is a key currency. Social media platforms, such as Twitter, have long employed the 'blue verified badge' (often called the 'blue tick') status to noteworthy users - not to imply that what they say is true of course, but to verify they are who they claim to represent and that they are what Twitter calls of 'public interest'. A situation in which organisations publishing open data generated by the public, such as NIH, EMBL, Nature Group, coordinate with institutes and companies to verify submitters are who they claim to be, ensure they use a 'verified' institute email, and so on, is a relatively low overhead, though it is not entirely free of cost. This at least provides something of a clear audit as to who published data in clear terms.

Verifying audit trails and transactions is also one of the strengths of Blockchain, the technology behind Bitcoin. This could be employed to provide verifiable trails back to the originator of a data set as well as whoever has touched the data since, and indeed if the data set has been altered since that original publication. This would require something of a paradigm shift in open data - that the publisher no longer has to be one of the big central organisations and could be anyone, though it would require a coordination effort to agree on how that verification takes place. Fortunately, getting together to develop standards is not alien to the life science community.

Forewarned is forearmed

So why am I writing this? Because I generated some fake data recently and in the process of testing I lost which was the fake bit. We needed some for testing an algorithm we are developing and I needed both real data and data that was a bit like the real data, but slightly different. So I set about programmatically changing subtle parts of the data sets to generate this new real + fake data set. It took less than an hour and after that time it was hard to tell them apart. Did I publish it? No, absolutely not. Could I have published it? Quite possibly. So I write this not as a ludicrous scare story that we need to stop using open data immediately, but more as a 'we should probably think about this a bit'. Forewarned is forearmed kind of thing. And this sort of thing does happen.

In July 2017, fake data was deliberately published by Nasdaq as the exchange closed solely for testing purposes and not for wider consumption. There were many warning this was going to happen. Despite this, various major outlets including Bloomberg and Google, published this data as real. So anyone can be fooled, even when you're told about it in advance.

Caveat emptor.


For further information, please contact:
Tony Stephenson
Chief Operating Officer, FactBio
T: 07899 796655
E: tony@factbio.com


FactBio To Sponsor ICBO 2017
FactBio CEO will also give keynote speech

News 9 August 2017

FactBio Bird Logo FactBio is sponsoring ICBO 2017 (International Conference on Biomedical Ontology) in Newcastle, UK.

The global event, which is now an established feature in the ontology world, brings together representatives of all major communities involved in the development and application of ontologies in biomedicine and related areas. Together, they address issues pertaining to coordinated development of ontological resources, as well as their optimal use in applications.

In addition to sponsoring the event, Dr James Malone, CEO of FactBio will also be giving a keynote speech on the future of bio-ontologies and data curation. The talk will cover topics including the impact of machine learning and crowd sourcing on building ontologies and the role ontologies play in security and fake data detection.

Dr James Malone, CEO of FactBio said: "With the growth of ontologies in managing biological data, ICBO is now a crucial event for ensuring the best practices are shared globally. We are delighted to sponsor such a prestigious event, and look forward to hearing the many excellent presentations and discussions."


For further information, please contact:
Tony Stephenson
Chief Operating Officer, FactBio
T: 07899 796655
E: tony@factbio.com


FactBio Launches Kusp Version 2
Upgraded platform now includes collaboration features and better tools

News 5 July 2017

Kusp logo FactBio, has launched Version 2 of Kusp (Knowledge Sharing Platform), its flagship product.

The improved platform boasts a number of new features, which have been developed in response to user feedback. These include a collaborative element and improved annotation rules.

A major new feature is the improved collaboration tool which allows users to share data they load into Kusp with other users or groups of users. Collaboration features allow sharing of BioBuckets between individuals and between groups of people along with setting permission levels such as 'viewer' or 'editor' of a BioBucket. Alongside this, there is a groups feature to group datasets together into related categories, which can then be shared with collaborators.

The improved annotation rules allow users to add context specific rules. This could include specific rules related to individual companies and organisations, and allow users to tailor matches to specific column headers. This includes partial matching, which allows users to accelerate auto-annotation, enabling individual cells to be matched based on parts of their values. This helps to cut through the noise often found with data to detect the value of interest, and accelerate the annotation process. In addition, Kusp now has stopwords, which enable users to tell Kusp to ignore specific values not intended to be annotated.

Finally, Kusp now has faster upload and downloads of data, which will improve annotation speeds. In addition, FactBio is now working to improve access to Kusp through an API.

Dr James Malone, CEO of FactBio said: "These new features have been developed in response to user feedback. As we expand the number of users for Kusp, we continue to develop the platform to ensure that Kusp becomes the 'go-to' platform for data annotation. In addition to the current improvements included in version 2, we have a series of new developments in the pipeline which will be announced later this year."


For further information, please contact:
Tony Stephenson
Chief Operating Officer, FactBio
T: 07899 796655
E: tony@factbio.com


FactBio Moves to New Offices and Expands Team

News 27 March 2017

Kusp logo FactBio, a developer of novel bioinformatics software with a focus on improving knowledge management and data sharing, has moved to new offices in Cambridge.

Alongside the move to the new offices in the Innovation Centre on the Cambridge Science Park, FactBio has hired additional staff to support the development of Kusp, its data curation and data discovery platform.

The move to the new offices follows a strong start to the year which has included securing a number of large pharmaceutical companies as clients.

Dr James Malone, CEO of FactBio said: "The new offices and additional staff will enable FactBio to expand. With a growing portfolio of clients and the on-going addition of new features to Kusp, we are now well placed for future growth. The user base of Kusp continues to extend and with the additional staff we will be able to accelerate development and improve the utility of the platform."

FactBio’s new address is: Innovation Centre, Unit 23, Cambridge Science Park, Milton Road, Cambridge, CB4 0EY.


For further information, please contact:
Tony Stephenson
Chief Operating Officer, FactBio
T: 07899 796655
E: tony@factbio.com


FactBio Secures Distribution Agreement with Filgen

News 5 January 2017

FactBio Bird Logo FactBio has secured a distribution agreement with Filgen, a Japanese based distributor of life sciences products.

The agreement will see Filgen granted exclusive distribution rights in the Japanese market for two years to sell the full range of FactBio products, including Kusp, FactBio’s novel data curation and data discovery platform. Financial terms were not disclosed.

Dr James Malone, CEO of FactBio said: "I am delighted to have secured this agreement with a leading distributor such as Filgen. They have built an excellent reputation in selling life sciences products and will be an excellent partner as we commercialise our products."

The deal with Filgen builds on the September 2016 launch of FactBio’s Kusp platform. In addition to this, the company has launched new products to support Kusp, including resource mapping and improved annotation features for Kusp.


For further information, please contact:
Tony Stephenson
Chief Operating Officer, FactBio
T: 07899 796655
E: tony@factbio.com


FactBio Announces Improvements to the Kusp Data Curation Platform
New features in Kusp 1.1 include resource mapping, and increased numbers of ontology classes, including plant

News 1 November 2016

Kusp logo FactBio, has launched Kusp 1.1, an update to its data curation platform which sees a number of new key features added.

The new features include a resource mapping module, improved search and annotation algorithms, an improved interface, and faster uploading of data. In addition, the range of ontology classes in Kusp has now been expanded to over 200,000.

The Kusp Resource Mapping module allows users to map reference entities, such as those in ontologies and biological databases, to other reference entities, enabling users to better integrate and interpret their data. Companies and organisations may use different or internal descriptions to describe their own data and by using the Kusp Resource Mapping module, they can connect and align all of their internal data with public data. Kusp now has over 300,000 resource and ontology mappings.

The user interface has also been improved, and users can now personalise their experience. This includes allowing users to specify which ontologies and databases they see when they search and annotate, an improved history function, which allows users to see their last five BioBuckets searched, and a feature which shows file upload progress.

Dr James Malone, CEO of FactBio said: "Kusp 1.1 has a number of significant improvements at the request of our users. Through our customer research, we identified a need for the ability to do ontology mapping, which will allow our users to better integrate their own internal annotations with external standardised ontologies. Through this, and a number of other additions, Kusp is now even better placed to support its users, and improve data standards globally."


For further information, please contact:
Tony Stephenson
Chief Operating Officer, FactBio
T: 07899 796655
E: tony@factbio.com


FactBio Announces Kusp Early Access Programme
KEAP will allow users to propose new features for Kusp

News 3 October 2016

Kusp logo FactBio, a developer of novel bioinformatics software with a focus on improving knowledge management and data sharing, has launched KEAP (Kusp Early Access Programme).

KEAP will allow early adopters of Kusp to work alongside FactBio developers to develop new functions which meet their emerging requirements. This will enable early adopters to have an important role in setting the future direction of Kusp, and to tailor the platform to their specific needs.

Early adopters of KEAP will have direct access to FactBio’s development team and will be able to propose new features and additions to the platform depending on their specific needs. This could include for example, adding new capacities to the API, supporting new import and export formats and developing new interface modes. Once the FactBio development team has the customer requirements the team will then prioritise them and work to add these to the platform.

Dr James Malone, CEO of FactBio said: "Curation of biomedical data using standards is a hugely valuable step, enabling reuse, sharing, and integration of a data set. Kusp aims to make the process of curation as simple and as rapid as possible. It improves accuracy whilst reducing the cost of what has traditionally been an expensive but important task. Given the volume of data now available, aligning towards community standards, such as in reference ontologies, in cost-effective ways is increasingly important for exploiting the data we have now and in the future."


For further information, please contact:
Tony Stephenson
Chief Operating Officer, FactBio
T: 07899 796655
E: tony@factbio.com


FactBio Launches First Product
New Platform Will Improve Data Curation and Data Discovery

News 12 September 2016

Kusp logo FactBio, a developer of novel bioinformatics software with a focus on improving knowledge management and data sharing, has launched its first product, Kusp (Knowledge Sharing Platform).

Kusp is a platform, developed specifically for the life sciences industry, which improves data curation and data discovery using community standards, to improve the use and integration of data. The platform allows users to simplify and accelerate data curation through automation, and the application of advanced data science techniques.

Kusp offers access to a range of reference ontologies and biomedical identifiers for rapid annotation of a user’s biological entities, with a high level of accuracy. The platform is also able to learn new patterns of annotation based on a user’s inputs, improving future performance by learning over time.

The platform can be configured for a range of users. For larger companies, the platform can be deployed behind a firewall, while single users and small to medium enterprises can use a secure cloud based service. Alongside this, Kusp has been designed to integrate with users data at the highest level to ensure data silos are avoided.

Dr James Malone, CEO of FactBio said: "Curation of biomedical data using standards is a hugely valuable step, enabling reuse, sharing, and integration of a data set. Kusp aims to make the process of curation as simple and as rapid as possible. It improves accuracy whilst reducing the cost of what has traditionally been an expensive but important task. Given the volume of data now available, aligning towards community standards, such as in reference ontologies, in cost-effective ways is increasingly important for exploiting the data we have now and in the future."


For further information, please contact:
Tony Stephenson
Chief Operating Officer, FactBio
T: 07899 796655
E: tony@factbio.com


New Research Provides Ten Rules for Selecting a Bio-ontology

News 16 February 2016

PLOS Computational Biology logo New research published in PLOS Computational Biology, has identified the top ten rules for selecting a bio-ontology.

The rules cover a range of issues including ensuring that the bio-ontology is about a specific domain of knowledge, it is current, it needs to be written by domain experts and it should be under active development. However, the authors also go on to add that in certain circumstances, a bio-ontology may not be needed at all.

The research which comes from scientists at FactBio, the University of Manchester and the European Bioinformatics Institute has identified a series of rules to allow researchers to better understand how to choose the correct bio-ontology. These include:

  • The ontology should be about a specific domain of knowledge
  • The ontology should reflect current understanding of biological systems
  • The ontology classes and relationships should persist
  • Classes should contain textual definitions
  • Textual definitions should be written for domain experts
  • The ontology should be developed by the community but not incapacitated by it
  • The ontology should be under active development
  • Previous versions should be available
  • Open data requires open ontologies
  • Sometimes an ontology is not needed at all

Dr James Malone, CEO of FactBio, and first author of the paper said: "Bio-ontologies represent an important tool for describing and sharing data. Selecting the most appropriate ontology can be a big challenge for researchers, especially those new to the area. We hope this set of rules, while not exhaustive, will help the scientist in deciding how to best choose a bio-ontology that is fit for their needs."

Professor Robert Stevens, from the University of Manchester, and co-author of the paper added: "The primary message of the paper is think about your requirements. There is a desire in the community to share and integrate data which highlights the value of using a bio-ontology. By using the rules described in the paper it is possible to identify if an ontology will be of use to a scientist and help bring some clarity to the frequently asked question of how one selects a bio-ontology."


For further information, please contact:
Tony Stephenson
Chief Operating Officer, FactBio
T: 07899 796655
E: tony@factbio.com


FactBio Joins Pistoia Alliance

News 22 January 2016

pistoia alliance logo FactBio, a developer of novel bioinformatics software with a focus on improving knowledge management and data sharing, has joined the Pistoia Alliance, a global organisation supporting life sciences R&D.

Dr James Malone, CEO of FactBio said: "The Pistoia Alliance has made excellent progress in supporting life sciences R&D and many of its projects are now having an impact globally. By joining the Pistoia Alliance, we hope to help in tackling some of the challenges faced by life sciences R&D."

FactBio was established in 2015 to develop novel bioinformatics software to improve life sciences research. In particular the company will focus on the development of Kusp (Knowledge Sharing Platform), a knowledge management system which will allow researchers to select a series of BioBuckets and use these to track entities of interest to them, and receive updates of new developments. The entities could include genes, pathways, proteins, or even people and publications as required. Kusp will also be fully integrated into social media, allowing researchers to share their discoveries with the global community.


For further information, please contact:
Tony Stephenson
Chief Operating Officer, FactBio
T: 07899 796655
E: tony@factbio.com


FactBio to Sponsor SWAT4LS International Conference

News 5 December 2015

SWAT4LS logo FactBio, a developer of novel bioinformatics software with a focus on improving knowledge management and data sharing, is sponsoring SWAT4LS 2015, an international conference on semantic web applications and tools for life sciences.

In addition to providing support in organising the conference, which is happening between the 7th and 10th December in Cambridge, FactBio will also be sponsoring a prize for the best poster at the conference, which will be presented on 9th December.

Dr James Malone, CEO of FactBio said: "We are very pleased to be able to sponsor SWAT4LS. The conference is now one of the main events for semantic web applications and very much supports the products we are now building at FactBio."


For further information, please contact:
Tony Stephenson
Chief Operating Officer, FactBio
T: 07899 796655
E: tony@factbio.com


New Bioinformatics Company Launched
FactBio To Develop New Products for Knowledge Management

News 2 December 2015

FactBio Bird Logo FactBio, a developer of novel bioinformatics software with a focus on improving knowledge management and data sharing, has been established.

FactBio has been established to develop novel bioinformatics software to improve life sciences research. In particular the company will focus on the development of Kusp (Knowledge Sharing Platform), a knowledge management system which will allow researchers to select a series of BioBuckets and use these to track entities of interest to them, and receive updates of new developments. The entities could include genes, pathways, proteins, or even people and publications as required. Kusp will also be fully integrated into social media, allowing researchers to share their discoveries with the global community.

Users of Kusp will be able to use a number of BioBuckets to track entities of interest for free, but for users who want to keep abreast of a wider range there will be an additional charge. Packages will also be available for enterprise users.

Alongside, Kusp, which will be launched in early 2016, FactBio has plans to develop a range of additional bioinformatics software products. In addition, FactBio is working with a number of life science companies on a consultancy basis and is running a series of training events for bioinformaticians.

Dr James Malone, CEO of FactBio said: "With the increasing amount of data now available to researchers, there is a need for new and improved ways to monitor an increasing range of interests. Through Kusp, researchers can automate keeping track of their many interests, whether they are individual genes, pathways, drugs, research papers or even people. In addition, FactBio will also be developing advanced analysis software, drawing on our machine learning background."

The company has been established by Dr James Malone, an experienced bioinformatician who will become CEO, and Tony Stephenson as Chief Operating Officer, alongside Simon Jupp as Technology Consultant.

Dr Malone and Mr Jupp have particular experience in projects such as Centre for Therapeutic Target Validation, EBI Linked Data Platform, Orphanet Rare Disease, and Experimental Factor Ontology.


For further information, please contact:
Tony Stephenson
Chief Operating Officer, FactBio
T: 07899 796655
E: tony@factbio.com

Content

Interested in using Kusp?

FactBio Products

It’s free to try Kusp.

Content