МЕТОД ЗЛИТТЯ БАГАТОМОДАЛЬНИХ ВЕКТОРНИХ ПРЕДСТАВЛЕНЬ СЛІВ У МАЛОРЕСУРСНОМУ СЕРЕДОВИЩІ

Roman SHAPTALA; Gennadiy KYSELOV

doi:10.31891/2219-9365-2023-73-1-23

Authors

Roman SHAPTALA National Technical University of Ukraine «Igor Sikorsky Kyiv Polytechnic Institute» https://orcid.org/0000-0002-4367-5775
Gennadiy KYSELOV National Technical University of Ukraine «Igor Sikorsky Kyiv Polytechnic Institute» https://orcid.org/0000-0003-2682-3593

DOI:

https://doi.org/10.31891/2219-9365-2023-73-1-23

Keywords:

machine learning, natural language processing, mathematical modelling, neural networks, word embeddings, string distance

Abstract

This paper presents a method of multimodal word embeddings fusion in a low-resource setting. This method, unlike other methods of word embeddings fusion, takes into account the limitations of a low-resource environment, and allows combining word embeddings from different sources, such as documents and dictionaries. The method relies on string distance calculations instead of building complete syntactic and morphological models which is often impossible in low-resource languages. This method can be used at intermediate stages of building natural language processing systems and machine learning when solving practical problems, such as machine translation or document classification.

Additionally, we present an analysis of various multimodal word embeddings fusion methods in a low-resource setting. The paper describes the advantages, disadvantages, and limitations of every approach given the task of building unified vector representation of text combined with data from additional sources. As an example of low-resource environment, our study shows the efficacy of described methods on the task of classifying Kyiv City petitions written in Ukrainian language.

Abundance of string distance functions makes the choice of one a difficult task. We propose a set of recommendations in the context of low-resource settings as well as a methodology to select the best one given practical task. Analyzed string distance functions include Levenstein distance, Jaccard similarity, Block distance, Hamming distance, and Dice’s coefficient. Our results demonstrate that Levenshtein distance is more effective than others in this context, providing insights into which methods are best suited for low-resource analysis of string data. These findings have practical implications for various fields, including natural language processing, text mining, and information retrieval.

METHOD OF MULTIMODAL WORD EMBEDDINGS FUSION IN A LOW-RESOURCE SETTING

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Language

Information

StrikePlagiarism

Indexing