Hanh Tran

Hello, I'm Hanh Tran

I am an NLP Engineer at Arkhn.

I got my Ph.D. diploma in the Cotuelle program between La Rochelle University, France, and Jožef Stefan Institute, Slovenia supervised by Prof. Antoine Doucet and Assist. Prof. Senja Pollak. Previously, I worked as a Data Scientist at Samsung SDSV.

My research interests are natural language processing, information extraction, low-resourced languages, generative AI, and large-scale language models.

NEWS

PUBLICATIONS

For a complete list of publications, please refer to my Google Scholar page.

SEKE: Specialised Experts for Keyword Extraction

SEKE: Specialised Experts for Keyword Extraction

Matej Martinc, Hanh Thi Hong Tran, Senja Pollak, Boshko Koloski

Findings of the Association for Computational Linguistics: EMNLP 2025

We propose a novel supervised keyword extraction approach based on the mixture of experts (MoE) technique. MoE uses a DeBERTa as the backbone model and builds on the MoE framework, where experts attend to each token, by integrating it with a bidirectional Long short-term memory (BiLSTM) network, to allow successful extraction even on smaller corpora, where specialisation is harder due to lack of training data. The MoE framework also provides an insight into inner workings of individual experts, enhancing the explainability of the approach.

LlamATE: Automated terminology extraction using large-scale generative language models

LlamATE: Automated terminology extraction using large-scale generative language models

Hanh Thi Hong Tran, Carlos-Emiliano González-Gallardo, Antoine Doucet, Senja Pollak

Terminology: International Journal of Theoretical and Applied Issues in Specialized Communication, 2025

We present LlamATE, a framework to verify the impact of domain specificity on ATE when using in-context learning prompts in open-sourced LLM-based chat models, namely Llama-2-Chat. We evaluate how well the LLM-based chat models perform with different levels of domain-related information in the dominant language in NLP research from ACTER datasets, i.e., in-domain and cross-domain demonstrations with and without domain enunciation.

LIAS: Layout Information-Based Article Separation in Historical Newspapers

LIAS: Layout Information-Based Article Separation in Historical Newspapers

Wenjun Sun, Hanh Thi Hong Tran, Carlos-Emiliano González-Gallardo, Mickaël Coustaty, Antoine Doucet

International Conference on Theory and Practice of Digital Libraries (TPDL 2024)

We propose LIAS, a method based on layout information, and conduct experiments on historical newspapers. The method initially identifies the separator lines of the newspaper, analyzes the layout information to reconstruct the in- formation flow of the document, performs segmentation based on the semantic relationship of each text block in the information flow, and ultimately achieves article separation.

LIT: Label-Informed Transformers on Token-Based Classification

LIT: Label-Informed Transformers on Token-Based Classification

Wenjun Sun, Hanh Thi Hong Tran, Carlos-Emiliano González-Gallardo, Mickaël Coustaty, Antoine Doucet

International Conference on Theory and Practice of Digital Libraries (TPDL 2024)

We propose LIT, an end-to-end pipeline architecture that integrates the transformer's encoder-decoder mechanism with an additional label semantic to token classification tasks.

Leveraging Open Large Language Models for Historical Named Entity Recognition

Leveraging Open Large Language Models for Historical Named Entity Recognition

Carlos-Emiliano González-Gallardo, Hanh Thi Hong Tran, Ahmed Hamdi, Antoine Doucet

International Conference on Theory and Practice of Digital Libraries (TPDL 2024)

(Best Paper Awards)

We develop methods to detect semantic ambiguous and complex entities in short and low-context settings of Complex NER using three different prompt-based approaches.

Is Prompting What Term Extraction Needs?

Is Prompting What Term Extraction Needs?

Hanh Thi Hong Tran, Carlos-Emiliano González-Gallardo, Julien Delaunay, Antoine Doucet, Senja Pollak

International Conference on Text, Speech, and Dialogue (TSD 2024)

We evaluate the applicability of open and closed-sourced LLMs on the ATE task compared to two benchmarks where we consider ATE as sequence-labeling (iobATE) and seq2seq (templATE) tasks.

Global-SEG: Text Semantic Segmentation Based on Global Semantic Pair Relations

Global-SEG: Text Semantic Segmentation Based on Global Semantic Pair Relations

Wenjun Sun, Hanh Thi Hong Tran, Carlos-Emiliano González-Gallardo, Mickaël Coustaty, Antoine Doucet

International Conference on Document Analysis and Recognition (ICDAR 2024)

We propose Global-SEG, utilizing global semantic pair relations from both token- and sentence-level language models for text semantic segmentation.