Generation of World’s Largest Multi-Modal Multi-Lingual Datasets and their Analysis for Automatic Summarization and Keyword Generation

Verma, Yash (2022) Generation of World’s Largest Multi-Modal Multi-Lingual Datasets and their Analysis for Automatic Summarization and Keyword Generation. Masters thesis, Indian Institute of Science Education and Research Kolkata.

[img] Text (MS dissertation of Yash Verma (16MS154))
16MS154_Thesis_file.pdf - Submitted Version
Restricted to Repository staff only

Download (3MB)
Official URL:


We can now represent information with various modalities thanks to significant advancements in techniques like encoder-decoder models. Many downstream tasks in information retrieval and natural language processing will benefit from this knowledge; nonetheless, improvements in multi-modal approaches and their performance evaluation require large-scale multi-modal data with adequate diversity. Multi-lingual modelling is used for a variety of tasks, including multi-modal summarization, text generation, and translation and keyword extraction. Summarization is essential to obtain information in a compact form, for a large span of document. Apart from summarization, keyword extraction is another language processing task which is essential to signify what a document represents. Many downstream tasks, including clustering, recommendation, search, and classification, require keyword extraction. Keyword extraction approaches require an exhaustive dataset for development and evaluation; unfortunately, the community currently lacks large-scale multilingual datasets. In this work, we discuss the generation of world’s largest multi-modal multi-lingual summarization and keyword extraction dataset, consisting of a million instances for summarization and over half a million instances for keyword extraction, while spanning across 20 languages. We perform various experimentation using various baselines to assess the dataset’s quality. Given its scale, diversity in terms of languages used, topics addressed, and historical periods, as well as its concentration on under-studied languages, we believe the proposed dataset will assist improve the field of automatic keyword extraction and text summarization.

Item Type: Thesis (Masters)
Additional Information: Supervisor: Dr. Sriparna Saha (IIT Patna); Co-Supervisor: Dr. Dwaipayan Roy
Uncontrolled Keywords: Automatic Summarization; Keyword Generation; Multi-modal Multi-lingual Keyword Extraction; Multi-Lingual Multi-Modal Summarization; Natural Language Processing
Subjects: Q Science > QA Mathematics
Divisions: Department of Mathematics and Statistics
Depositing User: IISER Kolkata Librarian
Date Deposited: 31 Aug 2022 11:20
Last Modified: 31 Aug 2022 11:20

Actions (login required)

View Item View Item