16 October 2023 Blogs, Academic, Faculty, Librarian

Researchers text mine newspapers to reveal new insights

Text and Data Mining methods like Natural Language Processing allow researchers to extract insights from unstructured data and provide new avenues for exploration

By Matt Tobey, special to the ProQuest blog

A team of researchers from the University of Illinois Urbana-Champaign, Northeastern University, and Washington University in St. Louis are working together to peel back the layers of American history. Armed with algorithms and a grant from the Mellon Foundation, they're text mining newspapers from the 19th and 20th centuries for the project "The Virality of Racial Terror in US Newspapers, 1863-1921." The team’s findings promise to shift our understanding of how racial violence has been reported, and crucially, how it correlates with contemporary narratives.

“We can study how the reporting about these sorts of incidents changes over time and how the characterization shifts over that 60-year period,” the project’s lead researcher Professor Ryan Cordell explained to the University of Illinois Urbana-Champaign News Bureau.

Harnessing the power of machine learning

Levi Boxell and Jacob Conway, economics doctoral researchers at Stanford University sought to understand journalists and media bias by examining the role of journalists in shaping content and to assess whether journalists or news outlets themselves have a more significant influence on the political slant of news articles. To test this, they developed a precise measure of media slant and collected a vast dataset of articles from U.S. Newsstream, a mammoth news database from ProQuest, part of Clarivate. They fine-tuned a machine learning model to predict the political affiliations of politicians based on their tweeted articles and applied it to millions of news articles using ProQuest TDM Studio. The findings revealed a trend towards more liberal content, especially after the 2016 election, and indicated that journalists' political affiliations did influence the content they produced.

“By providing comprehensive, full-text data from hundreds of newspapers across multiple years, TDM Studio enabled us to track journalists as they changed outlets and measure media slant with a high degree of accuracy,” said Dr. Boxell.

The era of text mining

Data is abundant, but so are the challenges of sifting through it, so navigating the digital realm of data can often feel like drinking from a fire hose. In the face of information saturation, Text and Data Mining (TDM) methods like Natural Language Processing (NLP) have become indispensable tools, allowing researchers to extract valuable insights from unstructured text data and providing new avenues for academic exploration.

Despite the promise of TDM, barriers remain, particularly the costs and complexities of gaining access to premium content like current newspapers.

Why newspapers matter in the digital age

The vast, often untapped, ocean of newspaper archives presents fertile ground for academic research. Newspapers provide not just depth but also a breadth of perspectives, capturing the zeitgeist of their times. It’s no surprise then that newspapers have become prime resources for text mining efforts. With text mining, researchers can quantify and analyze data from large collections of newspapers that until recently seemed impossible to effectively sift through.

While Professor Cordell's project focuses on historical and social issues, researchers from any academic discipline can text mine newspapers to uncover new insights.

In a timely response to the regulatory actions taken during the COVID-19 pandemic, researcher Zhoudan Xie of George Washington University partnered with ProQuest, part of Clarivate, to analyze public opinion in U.S. news articles. Using TDM Studio, ProQuest's specialized text and data mining platform, and ProQuest’s database of local, national and international news outlets, Xie scrutinized over a million articles published between January and April 2020. The research identified 16 major topics that garnered media attention, from "quarantine and reopening" to "oil price," giving researchers and policymakers a panoramic view of reporting on public sentiment.

Academic librarians as TDM advocates

In light of the growing demand for text and data mining, academic librarians have a pivotal role to play as facilitators for this powerful tool. At the University of Toronto, Data Librarian Kara Handren's experiences illuminate the challenges in accessing and using TDM resources.

“A researcher may have secured funding and a well-thought-out research process, including a trained model, but would be held up for months due to the time required to negotiate and purchase access to a particular corpus. These roadblocks were particularly visible in the case of newspaper articles, which were frequently requested,” said Handren.

With the adoption of TDM Studio, Handren noted the university has managed to overcome these dreaded licensing bottlenecks. By leveraging a tool that doesn't require additional licensing negotiations, the library has streamlined the entire process, allowing researchers to focus on what matters the most: research.

“Patrons seeking to mine content across a large breadth of newspaper articles... can be directed to TDM Studio without requiring any additional licensing negotiations,” Handren explained.

The adoption of TDM Studio has also had a cascading effect on the quality of research across the University of Toronto. From undergraduates examining media's representation of robotics to Ph.D. students scrutinizing regional newspapers on government and gang activity, the versatility of applications demonstrates the cross-disciplinary nature of TDM Studio.

“TDM Studio removed barriers and encourages researchers in non-traditional areas to explore text mining avenues within their fields of research,” Handren noted.

Advancing research with TDM

By utilizing TDM tools like TDM Studio, researchers can navigate through the dense forest of information, removing barriers that once impeded academic inquiry. From groundbreaking research on racial terror in the U.S. to real-time analyses of public sentiment during the COVID-19 pandemic, text mining newspapers is revolutionizing the way we interact with and understand the evolving narratives of society.

While the technology and tools for text mining may be intricate, the role of academic librarians in this ecosystem cannot be overstated. They play a critical role in democratizing access to advanced research methodologies. In removing licensing bottlenecks and providing expert guidance, they not only make TDM more accessible but also open new doors for cross-disciplinary research.

Researchers who embrace TDM as a vital tool for academic inquiry are not just adopting a technology, they’re becoming part of a broader, more inclusive narrative that values data-driven insights and interdisciplinary collaboration.

By seizing the opportunities presented by TDM, both researchers and academic librarians are not just keeping pace with the times but are actively helping to shape the future of academic research.

Request a free 30-day trial of ProQuest TDM Studio and take your research to the next level.

Matt Tobey

Matt Tobey

Matt Tobey is an award-winning writer and creative director whose work has been published by The New York Times, NPR, and The Onion.

Recent Blogs

arrow_upward