Academic Jobs - Home of Higher Ed Logo

A*STAR Tunes Targeted AI Model to Surpass Global Models in Multilingual Regional News Summaries

144views
Submit News
Marina Bay Sands, Singapore
Photo by Lily Banse on Unsplash

Revolutionizing Local News Processing with Specialized AI in Singapore

In the fast-paced world of digital media, keeping up with regional news across multiple languages poses significant challenges for journalists, researchers, and educators alike. Singapore's Agency for Science, Technology and Research (A*STAR), through its Institute for Infocomm Research (I2R), has made a groundbreaking advancement by developing CLUST-McMs, a targeted artificial intelligence model that excels in summarizing multilingual regional news. This innovation demonstrates that smaller, fine-tuned models can outperform massive global giants like GPT-4 when it comes to capturing local nuances and factual accuracy.

The development stems from recognizing limitations in general-purpose large language models, which often prioritize repetitive information over subtle cultural details or timely local events. For Singapore, a multilingual hub where English, Mandarin, Malay, and Tamil coexist, such technology holds immense promise for higher education institutions training the next generation of media professionals and researchers.

The Core Challenges in Multilingual News Summarization

Summarizing news from Southeast Asia involves navigating diverse languages, dialects like Singlish, and context-specific events such as regional elections or policy changes. Global models frequently hallucinate facts, mix timelines, or overlook entities unique to local contexts, leading to biased or incomplete overviews. This is particularly problematic in academic settings where precise analysis is crucial for journalism students at institutions like the National University of Singapore (NUS) and Nanyang Technological University (NTU).

Researchers at A*STAR I2R identified these pain points through extensive testing on real-world datasets, highlighting the need for models that act like knowledgeable local editors—discerning key facts, filtering noise, and preserving cultural fidelity.

Introducing CLUST-McMs: A Two-Stage AI Pipeline

CLUST-McMs, short for CLUST-Multi-lingual, Cross-lingual, and Multi-document Summarization, represents a sophisticated two-stage pipeline tailored for event-centric news clustering and summarization. Developed by Longyin Zhang, Bowei Zou, and Ai Ti Aw from A*STAR's Aural and Language Intelligence (ALI) department, the model integrates dynamic clustering with data sharpening techniques.

The first stage focuses on grouping articles by specific events rather than vague topics. For instance, articles on a new Singapore law would cluster together based on triggers like 'passage of bill' or associated who-what-when details. The second stage refines inputs by balancing information density and diversity, ensuring summaries are concise yet comprehensive.

Diagram of CLUST-McMs AI pipeline for news summarization

Event-Centric Clustering: Precision in Grouping News

Traditional topic modeling falls short for news, as broad categories like 'politics' dilute focus. CLUST-McMs employs a dynamic clustering algorithm (DyClu) that iteratively adjusts thresholds to form tight event clusters. It leverages multilingual sentence-BERT embeddings enhanced with language model-generated main event (ME) descriptions, including attributes like participants, locations, and outcomes.

On the SEASUMM-v1 dataset—curated from Southeast Asian sources in English, Chinese, Malay, and Indonesian—the approach achieved a Normalized Mutual Information (NMI) score of 93.68%, far surpassing baselines. This precision aids university researchers analyzing regional trends, enabling deeper dives into Singapore-Malaysia relations or ASEAN summits.

Data Sharpening and Localization: Elevating Summary Quality

Data sharpening optimizes input by sampling sentences proportionally from clusters, maximizing a score combining normalized information volume and entropy. This mitigates position bias in language models, where early sentences dominate. A localization module fine-tunes models via temporal question-answering (TQA) tasks on local news, ensuring citations stick to source facts and timelines.

Fine-tuned on SeaLLM-v2 and Qwen2.5-Instruct (both 7B parameters), the model uses LoRA for efficiency. Results show marked improvements in event coverage (F1: 58.97%) and entity faithfulness (57.29% accuracy), outperforming GPT-4 significantly. For details, see the full study here.

Superior Performance on Southeast Asian Benchmarks

Tested on SEASUMM-v1 (9,075 articles, 152 clusters) and GLOBESUMM, CLUST-McMs delivered ROUGE-L scores of 36.42 on local data, edging out GPT-4 while excelling in fidelity metrics. Custom evaluations like Eve-Cov (event coverage) and Ent-Faith (entity accuracy) underscore its edge in long-tail localization—handling rare local events that global models mishandle.

  • ROUGE-1: 55.98 (vs. GPT-4: 56.45)
  • ROUGE-2: 30.88 (vs. GPT-4: 30.13)
  • Event Coverage F1: 58.97 (vs. GPT-4: 23.45)

These gains stem from targeted training on 400K TQA instances from Singapore and SEA news.

Integration with Singapore's National AI Ecosystem

This work aligns with Singapore's National AI Strategy 2.0 and the Multimodal Large Language Model Programme, including MERaLiON—a SEA-tuned LLM led by Ai Ti Aw. MERaLiON supports speech summarization and code-switching, complementing CLUST-McMs for audiovisual news. A*STAR I2R's efforts bolster the Smart Nation initiative, enhancing media literacy in higher education. Read more on A*STAR's highlights here.

Collaborations Between A*STAR and Singapore Universities

Longyin Zhang has guided students from NUS and NTU in data analysis, fostering talent in NLP. A*STAR I2R partners with NUS on analytic projects like DBS Bank collaborations and NTU on hybrid AI programs with CNRS. These ties translate research into curricula, equipping journalism and computing students with tools for local news AI. For instance, NUS's AI Singapore initiative echoes these multilingual capabilities.

A*STAR collaboration with NUS and NTU on AI research

Implications for Higher Education and Journalism Training

In Singapore's universities, CLUST-McMs enables advanced courses in computational journalism, where students analyze SEA news clusters for bias detection or trend forecasting. It supports AI literacy goals under EdTech Masterplan 2030, training future professionals to leverage localized models. Faculty can use it for research on media ethics, ensuring summaries respect cultural sensitivities in multilingual classrooms.

Broader Impacts and Challenges Ahead

Beyond academia, the model aids newsrooms in rapid synthesis, combating information overload amid Singapore's vibrant media landscape. Challenges include scaling to real-time processing and ethical deployment to avoid amplifying biases. A*STAR's focus on faithfulness addresses this, promoting trustworthy AI in education.

A view of a city from a balcony

Photo by Jiachen Lin on Unsplash

Future Outlook: Multimodal and Beyond

Future expansions target multimodal inputs like video news, building on MERaLiON's speech capabilities. As Singapore invests S$1B in AI research (2026), expect deeper university-A*STAR synergies, positioning local talent at the forefront of regional AI innovation. Longyin Zhang notes: "The AI community needs to shift from scaling to cultural awareness."

Portrait of Jarrod Kanizay
About the author

Jarrod KanizayView author

Academic Jobs In House Author

Discussion

Sort by:

Be the first to comment on this article!

You

Please keep comments respectful and on-topic.

New0 comments

Join the conversation!

Add your comments now!

Have your say

Engagement level

Browse by Faculty

Browse by Subject

Frequently Asked Questions

🤖What is CLUST-McMs?

CLUST-McMs is A*STAR's two-stage AI pipeline for event-centric multilingual news clustering and summarization, fine-tuned for Southeast Asian contexts.

📈How does CLUST-McMs outperform GPT-4?

It achieves higher event coverage (F1 58.97%) and entity faithfulness (57.29%) via data sharpening and localization, tested on SEASUMM-v1 dataset. Paper details.

🌏What languages does the model support?

Primarily English, Chinese, Malay, Indonesian from SEA news, with cross-lingual summarization to English.

👥Who developed CLUST-McMs?

Longyin Zhang, Bowei Zou, and Ai Ti Aw at A*STAR I2R's ALI department.

🔗How does it connect to MERaLiON?

Aligns with Singapore's national LLM programme led by Ai Ti Aw, enhancing multimodal SEA AI capabilities.

🎓Implications for Singapore universities?

Supports NUS/NTU NLP courses, journalism training, AI literacy under EdTech 2030.

📊What is SEASUMM-v1 dataset?

First cross-lingual benchmark with 9,075 SEA articles in 4 languages, 152 clusters for evaluation.

🚀Future expansions planned?

Multimodal for audiovisual news, real-time processing.

🏛️Role of A*STAR in Singapore AI?

Leads research via I2R, collaborates with unis for Smart Nation initiatives.

📰Benefits for local journalism education?

Enables precise analysis of regional events, training students in ethical AI use for media.

📉Key metrics of success?

ROUGE-L 36.42, superior Eve-Cov and Ent-Faith over GPT-4.