When scientific citations go out of control: Uncovering 'hidden references'

The solitary researcher—separated from the world and the rest of the larger scientific community—is a classic but misguided image. Research, in reality, is based on constant exchange within the scientific community: First you understand the work of others, and then you share your findings.

Reading and writing articles published in academic journals and presented at conferences is a fundamental part of being a researcher. When researchers write an academic paper, they should reference the work of their colleagues to provide context, detail sources of inspiration, and explain differences in approaches and results. A positive citation by other researchers is a key measure of the visibility of a researcher’s own work.

So what happens when this citation system is manipulated? A recent paper by our academic detective team (including information scientists, a computer scientist, and a mathematician) published in the Journal of the Association for Information Science and Technology outlined a sneaky method for artificially inflating citation counts through metadata manipulation: sneaky references.

Covert manipulation

People are becoming more aware of scientific publications and how they work, including their potential flaws. Last year alone, more than 10,000 scientific papers were retracted. The problems around citation games, including the damage they do to the scientific community and its credibility, are well documented.

Citations of scientific works are subject to a standard referencing system: each reference clearly states at least the title of the cited publication, the names of the authors, the year of publication, the name of the journal or conference, and the page numbers. These details are stored as metadata, not directly visible in the text of the article, but are assigned a digital object identifier, or DOI, which is a unique identifier for each scientific publication.

References in a scientific publication allow authors to justify their methodological choices or present results of past studies and highlight the iterative and collaborative nature of science.

However, in a fortuitous encounter, we found that when some malicious actors submit articles to scientific databases, they add additional references that do not appear in the text but are present in the metadata of the articles. The result? Citation counts for some researchers or journals skyrocketed, even though those references were not cited by the authors in their articles.

A chance discovery

The investigation began when Guillaume Cabanac, a professor at the University of Toulouse, wrote a post on PubPeer, a website dedicated to post-publication peer review where scientists discuss and analyze publications. In the post, he detailed how he noticed a discrepancy: A Hindawi journal article, which he suspected was fake because it contained strange wording, had many more citations than downloads, which is very unusual.

The post caught the attention of a few detectives who are now the authors of the JASIST paper. We used a scientific search engine to search for articles that cited the original paper. Google Scholar found none, but Crossref and Dimensions found references. What’s the difference? Google Scholar relies heavily on the main text of the paper to extract references that appear in the bibliography section, while Crossref and Dimensions use metadata provided by publishers.

A new type of fraud

To understand the extent of manipulation, we examined three scientific journals published by the Technoscience Academy, which was the publisher of articles containing questionable citations.

Our investigation consisted of three phases:

We have listed references that are clearly included in the HTML or PDF versions of the article.
When we compared these lists with the metadata recorded by Crossref, we discovered extra references that were added to the metadata but did not appear in the articles.
We found further discrepancies when we checked Dimensions, a bibliometric platform that uses Crossref as a metadata source.

In journals published by the Technoscience Academy, at least 9% of the references recorded were “stealth references.” These additional references were only in the metadata, skewing citation counts and giving an unfair advantage to certain authors. Some legitimate references were also missing, meaning they were not present in the metadata.

Moreover, when we analyzed the covert references, we found that they benefited some researchers a lot. For example, a single researcher associated with Technoscience Academy benefited from more than 3,000 additional covert citations. Some journals from the same publisher benefited from several hundred additional covert citations.

We wanted external validation of our results, so we published our work as a preprint, informed both Crossref and Dimensions of our findings, and provided them with a link to the previously published research. Dimensions acknowledged the illegitimate citations and confirmed that their database reflected Crossref’s data. Crossref also confirmed the additional references in Retraction Watch, emphasizing that this was the first time they were aware of such a problem in their database. The publisher took action to correct the problem based on Crossref’s investigation.

Conclusions and possible solutions

Why is this discovery important? Citation counts greatly affect research funding, academic promotions, and institutional rankings. Manipulating citations can lead to unfair decisions based on inaccurate data. Even more worrying, this discovery raises questions about the integrity of scientific impact measurement systems—a concern that researchers have been highlighting for years. These systems can be manipulated to promote unhealthy competition among researchers, encouraging them to take shortcuts to publish faster or get more citations.

We recommend several measures to combat this practice:

Rigorous validation of metadata by publishers and agencies such as Crossref.
Independent audits to ensure data reliability.
Increased transparency in the management of references and citations.

This study is, to our knowledge, the first to report metadata manipulation. It also discusses its impact on the evaluation of researchers. The study highlights once again that over-reliance on metrics to evaluate researchers, their work, and their impact can be inherently flawed and inaccurate.

Such overconfidence is likely to encourage questionable research practices, such as hypothesizing or HARKing after results are known; splitting a single dataset into several papers (known as salami slicing); data manipulation; and plagiarism. It also inhibits transparency, which is key to more robust and productive research. While problematic citation metadata and covert references have now seemingly been fixed, the corrections may have occurred too late, as is often the case with scientific corrections.

This article was published in collaboration with Binaire, a blog dedicated to understanding digital issues.

This article was originally published in French.

This article is republished from The Conversation, a nonprofit, independent news organization that brings you facts and trusted analysis to help you understand our complex world. By Lonni Besançon Linkoping University and Guillaume Cabanac, Toulouse Institute for Informatics Research

Read more:

Lonni Besançon receives funding from the Marcus and Amalia Wallenberg Foundation.

Guillaume Cabanac is funded by the European Research Council (ERC) and the Institut Universitaire de France (IUF). He is the Director of the Problematic Paper Screener, a public platform that uses metadata from Digital Science and PubPeer through free agreements.

Thierry Viéville does not work for, consult, own shares in, or receive funding from any company or organization that would benefit from this article, and has disclosed no affiliations beyond his academic appointment.

When scientific citations go out of control: Uncovering ‘hidden references’

Covert manipulation

A chance discovery

A new type of fraud

Conclusions and possible solutions

Leave a Reply Cancel reply