• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

Researchers at HSE in St Petersburg Develop Superior Machine Learning Model for Determining Text Topics

Researchers at HSE in St Petersburg Develop Superior Machine Learning Model for Determining Text Topics

© iStock

They also revealed poor performance of neural networks on such tasks

Topic models are machine learning algorithms designed to analyse large text collections based on their topics. Scientists at HSE Campus in St Petersburg compared five topic models to determine which ones performed better. Two models, including GLDAW developed by the Laboratory for Social and Cognitive Informatics at HSE Campus in St Petersburg, made the lowest number of errors. The paper has been published in PeerJ Computer Science.

Determining the topic of a publication is usually not difficult for the human brain. For example, any editor can easily tag this article with science, artificial intelligence, and machine learning. However, the process of sorting information can be time-consuming for a person, which becomes critical when dealing with a large volume of data. A modern computer can perform this task much faster, but it requires solving a challenging problem: identifying the meaning of documents based on their content and categorising them accordingly.

This is achieved through topic modelling, a branch of machine learning that aims to categorise texts by topic. Topic modelling is used to facilitate information retrieval, analyse mass media, identify community topics in social networks, detect trends in scientific publications, and address various other tasks. For example, analysing financial news can accurately predict trading volumes on the stock exchange, which are significantly influenced by politicians' statements and economic events.

Here's how working with topic models typically unfolds: the algorithm takes a collection of text documents as input. At the output, each document is assessed for its degree of belonging to specific topics. These assessments are based on the frequency of word usage and the relationships between words and sentences. Thus, words such as ‘scientists,’ ‘laboratory,’ ‘analysis,’ ‘investigated,’ and ‘algorithms’ found in this text categorise it under the topic of ‘science.’

However, many words can appear in texts covering various topics. For example, the word ‘work’ is often used in texts about industrial production or the labour market. However, when used in the phrase ‘scientific work,’ it categorises the text as pertaining to ‘science.’ Such relationships, expressed mathematically through probability matrices, form the core of these algorithms.

Topic models can be enhanced by creating embeddings—fixed-length vectors that describe a specific entity based on various parameters. These embeddings serve as additional information acquired through training the model on millions of texts. 

Any phrase or text, such as this news item, can be represented as a sequence of numbers—a vector or a vector space. In machine learning, these numerical representations are referred to as embeddings. The idea is that measuring spaces and detecting similarities becomes easier, allowing comparisons between two or more texts. If the similarities between the embeddings describing the texts are significant, then they likely belong to the same category or cluster—a specific topic.

Scientists at the HSE Laboratory for Social and Cognitive Informatics in St Petersburg examined five topic models—ETM, GLDAW, GSM, WTM-GMM and W-LDA, which are based on different mathematical principles:

  • ETM is a model proposed by the prominent mathematician David M. Blei, who is one of the founders of the field of topic modelling in machine learning. His model is based on latent Dirichlet allocation and employs variational inference to calculate probability distributions, combined with embeddings.
  • Two models—GSM and WTM-GMM—are neural topic models.
  • W-LDA is based on Gibbs sampling and incorporates embeddings, but also uses latent Dirichlet allocation, similar to the Blei model.
  • GLDAW relies on a broader collection of embeddings to determine the association of words with topics.

For any topic model to perform effectively, it is crucial to determine the optimal number of categories or clusters into which the information should be divided. This is an additional challenge when tuning algorithms.

Sergei Koltsov

Sergey Koltsov, primary author of the paper, Leading Research Fellow, Laboratory of Social and Cognitive Informatics

Typically, a person does not know in advance how many topics are present in the information flow, so the task of determining the number of topics must be delegated to the machine. To accomplish this, we proposed measuring a certain amount of information as the inverse of chaos. If there is a lot of chaos, then there is little information, and vice versa. This allows for estimating the number of clusters, or in our case, topics associated with the dataset. We applied these principles in the GLDAW model.

The researchers investigated the models for stability (number of errors), coherence (establishing connections), and Renyi entropy (measuring the degree of chaos). The algorithms' performance was tested on three datasets: materials from a Russian-language news resource Lenta.ru and two English-language datasets - 20 Newsgroups and WoS. This choice was made because all texts in these sources were initially assigned tags, allowing for evaluation of the algorithms' performance in identifying the topics.

The experiment showed that ETM outperformed other models in terms of coherence on the Lenta.ru and 20 Newsgroups datasets, while GLDAW ranked first for the WoS dataset. Additionally, GLDAW exhibited the highest stability among the tested models, effectively determined the optimal number of topics, and performed well on shorter texts typical of social networks.

Sergey Koltsov, primary author of the paper, Leading Research Fellow, Laboratory of Social and Cognitive Informatics

We improved the GLDAW algorithm by incorporating a large collection of external embeddings derived from millions of documents. This enhancement enabled more accurate determination of semantic coherence between words and, consequently, more precise grouping of texts.

GSM, WTM-GMM and W-LDA demonstrated lower performance than ETM and GLDAW across all three measures. This finding surprised the researchers, as neural network models are generally considered superior to other types of models in many aspects of machine learning. The scientists have yet to determine the reasons for their poor performance in topic modelling.

See also:

Tickling the Nerves: Why Crime Content is Popular

Consumers of content about serial killers watch and read it to experience intense emotions that are often lacking in everyday life and to understand the reasons that drive people to commit crimes. However, such content does not contribute to increased aggression. These conclusions were drawn by sociologists from HSE University. The results of their study have been published in Crime, Media, Culture: An International Journal.

HSE Researchers Prove the Existence of Nash Equilibrium for a New Class of Problems in Game Theory

Researchers at HSE University's St Petersburg School of Economics and Management have been exploring methods for the efficient allocation of resources in systems involving multiple players. The scientists have proven the existence of strategies for optimal decision-making in competition for limited, discrete resources in four different cases. The developed mathematical model can be applied in various fields, ranging from education and medicine to managing networks and computing power. The paper has been published in Games and Economic Behaviour.

Researchers at HSE Centre for Language and Brain Reveal Key Factors Determining Language Recovery in Patients After Brain Tumour Resection

Alina Minnigulova and Maria Khudyakova at the HSE Centre for Language and Brain have presented the latest research findings on the linguistic and neural mechanisms of language impairments and their progression in patients following neurosurgery. The scientists shared insights gained from over five years of research on the dynamics of language impairment and recovery.

Neuroscientists Reveal Anna Karenina Principle in Brain's Response to Persuasion

A team of researchers at HSE University investigated the neural mechanisms involved in how the brain processes persuasive messages. Using functional MRI, the researchers recorded how the participants' brains reacted to expert arguments about the harmful health effects of sugar consumption. The findings revealed that all unpersuaded individuals' brains responded to the messages in a similar manner, whereas each persuaded individual produced a unique neural response. This suggests that successful persuasive messages influence opinions in a highly individual manner, appearing to find a unique key to each person's brain. The study findings have been published in PNAS.

Russian Scientists Improve Water Purification Membranes Using Metal Ions

Researchers have proposed using polymer membranes modified with copper, zinc, and chromium metal ions for water purification. These polymers were used for the first time in water purification via electrodialysis. Copper-based membranes demonstrated record selectivity for monovalent ions, opening new possibilities for sustainable water recycling. The study has been published in the Journal of Membrane Science

Independent Experts More Effective Than Collective Expertise in Decision-Making Under Uncertainty

A collaborative study by Sergey Stepanov, Associate Professor at the HSE Faculty of Economic Sciences, and experts from INSEAD Business School and NYU Shanghai, indicates that in making decisions under high uncertainty, where it is unclear which choice is superior, advice from independent experts may be more beneficial than a collective opinion from a group of experts. The study has been published in Games and Economic Behavior.

HSE Researchers Uncover Causes of Gender Pay Gap among Recent University Graduates in Russia

A study conducted at HSE University shows that despite having the same education and similar starting conditions, the pay gap between male and female recent graduates can be as high as 22%. This is partly because female students often choose less lucrative fields and also because they tend to seek jobs in sectors that offer lower pay but are perceived to have more stable and safer working conditions.

Scientists at HSE University Devise More Accurate Method for Predicting the Electrical Conductivity of Electrolyte Solutions

Researchers at HSE MIEM have developed a model for calculating the electrical conductivity of aqueous electrolyte solutions; for the first time, it considers the spatial distribution of ion charges instead of assuming their localisation at a single point. The model remains effective even at high electrolyte concentrations and across a wide temperature range. This breakthrough will contribute to the development of more efficient batteries and enable the calculation of electrical conductivity without the need for experimental testing. The study has been published in the Journal of Chemical Physics.

Russian Scientists Integrate Microdisk Laser and Waveguide on a Single Substrate

A group of Russian scientists led by Professor Natalia Kryzhanovskaya at HSE Campus in St Petersburg has been researching microdisk lasers with an active region based on arsenide quantum dots. For the first time, researchers have successfully developed a microdisk laser coupled with an optical waveguide and a photodetector on a single substrate. This design enables the implementation of a basic photonic circuit on the same substrate as the radiation source (microlaser). In the future, this will help speed up data transfer and reduce equipment weight without compromising quality. The study results have been published in Semiconductors.

Scientists Disprove Bunkbed Conjecture

Mathematicians from Russia, including two HSE graduates, have disproven a well-known mathematical conjecture that, despite lacking solid proof, had been considered valid for 40 years. The ‘Bunkbed Conjecture’ belongs to percolation theory—a branch of mathematics that studies the formation of connected structures in independent environments.