Hi, I'm Vilém/Vilda, a PhD student at ETH Zürich, Switzerland supervised by Mrinmaya Sachan and Menna El-Assady.     I have a passion for natural language processing research, especially:

Serious publications expand all publication collapse all publications

RELIC: Investigating Large Language Model Responses using Self-ConsistencyCHI 2024
paper

Furui Cheng, Vilém Zouhar, Simran Arora, Mrinmaya Sachan, Hendrik Strobelt, Mennatallah El-Assady
Large Language Models (LLMs) are notorious for blending fact with fiction and generating non-factual content, known as hallucinations. To tackle this challenge, we propose an interactive system that helps users obtain insights into the reliability of the generated text. Our approach is based on the idea that the self-consistency of multiple samples generated by the same LLM relates to its confidence in individual claims in the generated texts. Using this idea, we design RELIC, an interactive system that enables users to investigate and verify semantic-level variations in multiple long-form responses. This allows users to recognize potentially inaccurate information in the generated text and make necessary corrections. From a user study with ten participants, we demonstrate that our approach helps users better verify the reliability of the generated text. We further summarize the design implications and lessons learned from this research for inspiring future studies on reliable human-LLM interactions.
WMT 2023 Shared Task on Machine Translation with TerminologiesEMNLP 2023
paperlinkdataset

Kirill Semenov, Vilém Zouhar, Tom Kocmi, Dongdong Zhang, Wangchunshu Zhou, Yuchen Eleanor Jiang
The WMT 2023 Terminology Shared Task investigates progress in machine translation of texts with specialized vocabulary. The participants were given the source text and segment-level terminology dictionaries for three language pairs: Chinese→English, English→Czech, and German→English. We evaluate 21 submissions from 7 teams on two main criteria: general translation quality and the effectiveness of translating specialized terminology. Systems took varied approaches — incorporating terminology at inference time or weakly supervised training that uses terminology access. While incorporating terminology dictionaries leads to improvement in the translation quality, incorporating an equal amount of information from the reference leads to similar results. This challenges the position of terminologies being the crux of meaning in translation, it can also be explained by inadequate metrics which are not terminology-centric.
A Diachronic Perspective on User Trust in AI under UncertaintyEMNLP 2023
paperdemovideocode

Shehzaad Dhuliawala,= Vilém Zouhar,= Mennatallah El-Assady, Mrinmaya Sachan
In human-AI collaboration, users typically form a mental model of the AI system, which captures the user’s beliefs about when the system performs well and when it does not. The construction of this mental model is guided by both the system’s veracity as well as the system output presented to the user e.g., the system’s confidence and an explanation for the prediction. However, modern NLP systems are seldom calibrated and are often confidently incorrect about their predictions, which violates users’ mental model and erodes their trust. In this work, we design a study where users bet on the correctness of an NLP system, and use it to study the evolution of user trust as a response to these trust-eroding events and how the user trust is rebuilt as a function of time after these events. We find that even a few highly inaccurate confidence estimation instances are enough to damage users’ trust in the system and performance, which does not easily recover over time. We further find that users are more forgiving to the NLP system if it is unconfidently correct rather than confidently incorrect, even though, from a game-theoretic perspective, their payoff is equivalent. Finally, we find that each user can entertain multiple mental models of the system based on the type of the question. These results highlight the importance of confidence calibration in developing user-centered NLP applications to avoid damaging user trust and compromising the collaboration performance.
Enhancing Textbooks with Visuals from the Web for Improved LearningEMNLP 2023
papervideocode

Janvijay Singh, Vilém Zouhar, Mrinmaya Sachan
Textbooks are one of the main mediums for delivering high-quality education to students. In particular, explanatory and illustrative visuals play a key role in retention, comprehension and general transfer of knowledge. However, many textbooks lack these interesting visuals to support student learning. In this paper, we investigate the effectiveness of vision-language models to automatically enhance textbooks with images from the web. We collect a dataset of e-textbooks in the math, science, social science and business domains. We then set up a text-image matching task that involves retrieving and appropriately assigning web images to textbooks, which we frame as a matching optimization problem. Through a crowd-sourced evaluation, we verify that (1) while the original textbook images are rated higher, automatically assigned ones are not far behind, and (2) the precise formulation of the optimization problem matters. We release the dataset of textbooks with an associated image bank to inspire further research in this intersectional area of computer vision and NLP for education.
Re-visiting Automated Topic Model Evaluation with Large Language ModelsEMNLP 2023
papervideocode

Dominik Stammbach, Vilém Zouhar, Alexander Hoyle, Mrinmaya Sachan, Elliott Ash
Topic models help us make sense of large text collections. Automatically evaluating their output and determining the optimal number of topics are both longstanding challenges, with no effective automated solutions to date. This paper proposes using large language models (LLMs) for these tasks. We find that LLMs appropriately assess the resulting topics, correlating more strongly with human judgments than existing automated metrics. However, the setup of the evaluation task is crucial — LLMs perform better on coherence ratings of word sets than on intrustion detection. We find that LLMs can also assist us in guiding us towards a reasonable number of topics. In actual applications, topic models are typically used to answer a research question related to a collection of texts. We can incorporate this research question in the prompt to the LLM, which helps estimating the optimal number of topics.
Tokenization and the Noiseless ChannelACL 2023
papertoolvideo

Vilém Zouhar, Clara Meister, Juan Luis Gastaldi, Li Du, Mrinmaya Sachan, Ryan Cotterell
Subword tokenization is a key part of most NLP pipelines. However, little is known about why some tokenizer and hyperparameter combinations lead to improved downstream model performance over others. We propose that good tokenizers lead to efficient channel usage, where the channel is the means by which some input is conveyed to the model and efficiency can be quantified in information-theoretic terms as the ratio of the Shannon entropy to the maximum entropy of the subword distribution. Nevertheless, an optimal encoding according to Shannon entropy assigns extremely long codes to low-frequency subwords and very short codes to high-frequency subwords.Defining efficiency in terms of Rényi entropy, on the other hand, penalizes distributions with either very high or very low-frequency subwords.We posit that (1) extremely high-frequency subwords are problematic because their meaning is not distinct and (2) that low-frequency subwords may not appear frequently enough for their meaning to be learned properly; encodings that induce unigram distributions with either can harm model performance. In machine translation, we find that across multiple tokenizers, the Rényi entropy has a very strong correlation with BLEU: 0.82 in comparison to just -0.30 for compressed length.
A Formal Perspective on Byte-Pair EncodingACL 2023
papervideocode

Vilém Zouhar, Clara Meister, Juan Luis Gastaldi, Li Du, Tim Vieira, Mrinmaya Sachan, Ryan Cotterell
Byte-Pair Encoding (BPE) is a popular algorithm used for tokenizing data in NLP, despite being devised initially as a compression method. BPE appears to be a greedy algorithm at face value, but the underlying optimization problem that BPE seeks to solve has not yet been laid down. We formalize BPE as a combinatorial optimization problem. Via submodular functions, we prove that the iterative greedy version is a 1/sigma*(1-e(-sigma))-approximation of an optimal merge sequence, where sigma is the total backward curvature with respect to the optimal merge sequence. Empirically the lower bound of the approximation is ~0.37.We provide a faster implementation of BPE which improves the runtime complexity from O(NM) to O(N log M), where N is the sequence length and M is the merge count. Finally, we optimize the brute-force algorithm for optimal BPE using memoization.
Evaluating Optimal Reference TranslationsJNLE, to appear 2024
paperdatasetcode

Vilém Zouhar, Věra Kloudová, Martin Popel, Ondřej Bojar
The overall translation quality reached by current machine translation (MT) systems for high-resourced language pairs is remarkably good. Standard methods of evaluation are not suitable nor intended to uncover the many translation errors and quality deficiencies that still persist. Furthermore, the quality of standard reference translations is commonly questioned and comparable quality levels have been reached by MT alone in several language pairs. Navigating further research in these high-resource settings is thus difficult. In this article, we propose a methodology for creating more reliable document-level human reference translations, called 'optimal reference translations,' with the simple aim to raise the bar of what should be deemed 'human translation quality.' We evaluate the obtained document-level optimal reference translations in comparison with 'standard' ones, confirming a significant quality increase and also documenting the relationship between evaluation and translation editing.
Navigating the Metrics Maze: Reconciling Score Magnitudes and AccuraciesIn review 2024
papertooldemocode

Tom Kocmi, Vilém Zouhar, Christian Federmann, Matt Post
Ten years ago a single metric, BLEU, governed progress in machine translation research. For better or worse, there is no such consensus today, and consequently it is difficult for researchers to develop and retain the kinds of heuristic intuitions about metric deltas that drove earlier research and deployment decisions. This paper investigates the 'dynamic range' of a number of modern metrics in an effort to provide a collective understanding of the meaning of differences in scores both within and among metrics; in other words, we ask what point difference X in metric Y is required between two systems for humans to notice? We conduct our evaluation on a new large dataset, ToShip23, using it to discover deltas at which metrics achieve system-level differences that are meaningful to humans, which we measure by pairwise system accuracy. We additionally show that this method of establishing delta-accuracy is more stable than the standard use of statistical p-values in regards to testset size. Where data size permits, we also explore the effect of metric deltas and accuracy across finer-grained features such as translation direction, domain, and system closeness.
Quality and Quantity of Machine Translation References for Automated MetricsIn review 2024
papercode

Vilém Zouhar, Ondřej Bojar
Automatic machine translation metrics often use human translations to determine the quality of system translations. Common wisdom in the field dictates that the human references should be of very high quality. However, there are no cost-benefit analyses that could be used to guide practitioners who plan to collect references for machine translation evaluation. We find that higher-quality references lead to better metric correlations with humans at the segment-level. Having up to 7 references per segment and taking their average helps all metrics. Interestingly, the references from vendors of different qualities can be mixed together and improve metric success. Higher quality references, however, cost more to create and we frame this as an optimization problem: given a specific budget, what references should be collected to maximize metric success. These findings can be used by evaluators of shared tasks when references need to be created under a certain budget.
PWESuite: Phonetic Word Embeddings and Tasks They FacilitateIn review 2024
papercode

Vilém Zouhar, Kalvin Chang, Chenxuan Cui, Nathaniel Carlson, Nathaniel Robinson, Mrinmaya Sachan, David Mortensen
Word embeddings that map words into a fixed-dimensional vector space are the backbone of modern NLP. Most word embedding methods encode semantic information. However, phonetic information, which is important for some tasks, is often overlooked. In this work, we develop several novel methods which leverage articulatory features to build phonetically informed word embeddings, and present a set of phonetic word embeddings to encourage their community development, evaluation and use. While several methods for learning phonetic word embeddings already exist, there is a lack of consistency in evaluating their effectiveness. Thus, we also proposes several ways to evaluate both intrinsic aspects of phonetic word embeddings, such as word retrieval and correlation with sound similarity, and extrinsic performances, such as rhyme and cognate detection and sound analogies. We hope that our suite of tasks will promote reproducibility and provide direction for future research on phonetic word embeddings.
Scaling the Authoring of AutoTutors with Large Language ModelsIn review 2024
paper

Sankalan Pal Chowdhury, Vilém Zouhar, Mrinmaya Sachan
Large Language Models (LLMs) have found several use cases in education, ranging from automatic question generation to essay evaluation. In this paper, we explore the potential of using Large Language Models (LLMs) to author Intelligent Tutoring Systems. A common pitfall of LLMs is their straying from desired pedagogical strategies such as leaking the answer to the student, and in general, providing no guarantees. We posit that while LLMs with certain guardrails can take the place of subject experts, the overall pedagogical design still needs to be handcrafted for the best learning results. Based on this principle, we create a sample end-to-end tutoring system named MWPTutor, which uses LLMs to fill in the state space of a pre-defined finite state transducer. This approach retains the structure and the pedagogy of traditional tutoring systems that has been developed over the years by learning scientists but brings in additional flexibility of LLM-based approaches. Through a human evaluation study on two datasets based on math word problems, we show that our hybrid approach achieves a better overall tutoring score than an instructed, but otherwise free-form, GPT-4. MWPTutor is completely modular and opens up the scope for the community to improve its performance by improving individual modules or using different teaching strategies that it can follow.
Poor Man's Quality Estimation: Predicting Ref.-Based MT Metrics Without ReferenceEACL 2023
papervideocode

Vilém Zouhar, Shehzaad Dhuliawala, Wangchunshu Zhou, Nico Daheim, Tom Kocmi, Yuchen Eleanor Jiang, Mrinmaya Sachan
Machine translation quality estimation (QE) predicts human judgements of a translation hypothesis without seeing the reference. State-of-the-art QE systems based on pretrained language models have been achieving remarkable correlations with human judgements yet they are computationally heavy and require human annotations, which are slow and expensive to create. To address these limitations, we define the problem of metric estimation (ME) where one predicts the automated metric scores also without the reference. We show that even without access to the reference, our model can estimate automated metrics (ρ = 60% for BLEU, ρ = 51% for other metrics) at the sentence-level. Because automated metrics correlate with human judgements, we can leverage the ME task for pre-training a QE model. For the QE task, we find that pre-training on TER is better (ρ = 23%) than training for scratch (ρ = 20%).
Sentence Ambiguity, Grammaticality and Complexity ProbesBlackboxNLP 2022
papercode

Sunit Bhattacharya,= Vilém Zouhar,= Ondřej Bojar
It is unclear whether, how and where large pre-trained language models capture subtle linguistic traits like ambiguity, grammaticality and sentence complexity. We present results of automatic classification of these traits and compare their viability and patterns across representation types. We demonstrate that template-based datasets with surface-level artifacts should not be used for probing, careful comparisons with baselines should be done and that t-SNE plots should not be used to determine the presence of a feature among dense vectors representations. We also show how features might be highly localized in the layers for these models and get lost in the upper layers.
Shrinking Knowledge Base Size: Dimension Reduction, Splitting & FilteringMaster thesis 2022
linkcode

Vilém Zouhar
ecently neural network based approaches to knowledge-intensive NLP tasks, such asquestion answering, started to rely heavily on the combination of neural retrievers andreaders. Retrieval is typically performed over a large textual knowledge base whichrequires significant memory and compute resources, especially when scaled up. OnHotpotQA we explore various filtering & splitting criteria. Primarily, we systematicallyinvestigate reducing the size of the KB index by means of dimensionality (sparse randomprojections, PCA, autoencoders) and numerical precision reduction.Our results show that PCA is an easy solution that requires very little data and is onlyslightly worse than autoencoders, which are less stable. All methods are sensitive to pre-and post-processing and data should always be centered and normalized both before andafter dimension reduction. Finally, we show that it is possible to combine PCA withusing 1bit per dimension. Overall we achieve (1) 100× compression with 75%, and (2)24× compression with 92% original retrieval performance.
Knowledge Base Index Compression via Dimensionality and Precision ReductionSpaNLP 2022
papervideocode

Vilém Zouhar, Marius Mosbach, Miaoran Zhang, Dietrich Klakow
Recently neural network based approaches to knowledge-intensive NLP tasks, such as question answering, started to rely heavily on the combination of neural retrievers and readers. Retrieval is typically performed over a large textual knowledge base (KB) which requires significant memory and compute resources, especially when scaled up. On HotpotQA we systematically investigate reducing the size of the KB index by means of dimensionality (sparse random projections, PCA, autoencoders) and numerical precision reduction. Our results show that PCA is an easy solution that requires very little data and is only slightly worse than autoencoders, which are less stable. All methods are sensitive to pre- and post-processing and data should always be centered and normalized both before and after dimension reduction. Finally, we show that it is possible to combine PCA with using 1bit per dimension. Overall we achieve (1) 100× compression with 75%, and (2) 24× compression with 92% original retrieval performance.
Neural Machine Translation Quality and Post-Editing PerformanceEMNLP 2021
papervideocode

Vilém Zouhar, Ondřej Bojar, Martin Popel, Aleš Tamchyna
We test the natural expectation that using MT in professional translation saves human processing time. The last such study was carried out by Sanchez-Torron and Koehn (2016) with phrase-based MT, artificially reducing the translation quality. In contrast, we focus on neural MT (NMT) of high quality, which has become the state-of-the-art approach since then and also got adopted by most translation companies. Through an experimental study involving over 30 professional translators for English -> Czech translation, we examine the relationship between NMT performance and post-editing time and quality. Across all models, we found that better MT systems indeed lead to fewer changes in the sentences in this industry setting. The relation between system quality and post-editing time is however not straightforward and, contrary to the results on phrase-based MT, BLEU is definitely not a stable predictor of the time or final output quality.
Providing Backtranslation Improves Users Confidence in MT, Not QualityNAACL 2021
papervideocode

V. Zouhar, M. Novák, M. Žilinec, O. Bojar, M. Obregón, R. L. Hill, F. Blain, M. Fomicheva, L. Specia, L. Yankovskaya
Translating text into a language unknown to the text’s author, dubbed outbound translation, is a modern need for which the user experience has significant room for improvement, beyond the basic machine translation facility. We demonstrate this by showing three ways in which user confidence in the outbound translation, as well as its overall final quality, can be affected: backward translation, quality estimation (with alignment) and source paraphrasing. In this paper, we describe an experiment on outbound translation from English to Czech and Estonian. We examine the effects of each proposed feedback module and further focus on how the quality of machine translation systems influence these findings and the user perception of success. We show that backward translation feedback has a mixed effect on the whole process: it increases user confidence in the produced translation, but not the objective quality.
Artefact Retrieval: Overview of NLP Models with Knowledge Base AccessAKBC CSKB 2021
paper

Vilém Zouhar, Marius Mosbach, Debanjali Biswas, Dietrich Klakow
Many NLP models gain performance by having access to a knowledge base. A lot of research has been devoted to devising and improving the way the knowledge base is accessed and incorporated into the model, resulting in a number of mechanisms and pipelines. Despite the diversity of proposed mechanisms, there are patterns in the designs of such systems. In this paper, we systematically describe the typology of *artefacts* (items retrieved from a knowledge base), retrieval mechanisms and the way these artefacts are *fused* into the model. This further allows us to uncover combinations of design decisions that had not yet been tried. Most of the focus is given to language models, though we also show how question answering, fact-checking and knowledgable dialogue models fit into this system as well. Having an abstract model which can describe the architecture of specific models also helps with transferring these architectures between multiple NLP tasks.
Sampling and Filtering of Neural Machine Translation Distillation DataNAACL SRW 2021
papercode

Vilém Zouhar
In most of neural machine translation distillation or stealing scenarios, the highest-scoring hypothesis of the target model (teacher) is used to train a new model (student). If reference translations are also available, then better hypotheses (with respect to the references) can be oversampled and poor hypotheses either removed or undersampled. This paper explores the sampling method landscape (pruning, hypothesis oversampling and undersampling, deduplication and their combination) with English to Czech and English to German MT models using standard MT evaluation metrics. We show that careful oversampling and combination with the original data leads to better performance when compared to training only on the original or synthesized data or their direct combination.
Leveraging Neural Machine Translation for Word AlignmentPBML 116
papercode

Vilém Zouhar, Daria Pylypenko
The most common tools for word-alignment rely on a large amount of parallel sentences,which are then usually processed according to one of the IBM model algorithms. The trainingdata is, however, the same as for machine translation (MT) systems, especially for neural MT(NMT), which itself is able to produce word-alignments using the trained attention heads. Thisis convenient because word-alignment is theoretically a viable byproduct of any attention-basedNMT, which is also able to provide decoder scores for a translated sentence pair.We summarize different approaches on how word-alignment can be extracted from align-ment scores and then explore ways in which scores can be extracted from NMT, focusing oninferring the word-alignment scores based on output sentence and token probabilities. Wecompare this to the extraction of alignment scores from attention. We conclude with aggregat-ing all of the sources of alignment scores into a simple feed-forward network which achievesthe best results when combined alignment extractors are used.
WMT20 Document-Level Markable Error ExplorationWMT 2020
papercode

Vilém Zouhar, Tereza Vojtěchová, Ondřej Bojar
Even though sentence-centric metrics are used widely in machine translation evaluation, document-level performance is at least equally important for professional usage. In this paper, we bring attention to detailed document-level evaluation focused on markables (expressions bearing most of the document meaning) and the negative impact of various markable error phenomena on the translation. For an annotation experiment of two phases, we chose Czech and English documents translated by systems submitted to WMT20 News Translation Task. These documents are from the News, Audit and Lease domains. We show that the quality and also the kind of errors varies significantly among the domains. This systematic variance is in contrast to the automatic evaluation results. We inspect which specific markables are problematic for MT systems and conclude with an analysis of the effect of markable error types on the MT performance measured by humans and automatic evaluation tools.
Extending Ptakopět for MT User Interaction ExperimentsPBML 115
papercode

Vilém Zouhar, Michal Novák
The problems of outbound translation, machine translation user confidence and user inter-action are not yet fully explored. The goal of the online modular system Ptakopět is to providetools for studying these phenomena. Ptakopět is a proof-of-concept system for examining userinteraction with enhanced machine translation. It can be used either for actual translation orrunning experiments on human annotators. In this article, we aim to describe its main com-ponents and to show how to use Ptakopět for further research. We also share tips for runningexperiments and setting up a similar online annotation environment.Ptakopět was already used for outbound machine translation experiments, and we cover theresults of the latest experiment in a demonstration to show the research potential of this tool.We show quantitatively that even though backward translation improves machine-translationuser experience, it mainly increases users’ confidence and not the translation quality.
Outbound Translation User Interface Ptakopět: A Pilot StudyLREC 2020
papercode

Vilém Zouhar, Ondřej Bojar
It is not uncommon for Internet users to have to produce a text in a foreign language they have very little knowledge of and are unable to verify the translation quality. We call the task “outbound translation” and explore it by introducing an open-source modular system Ptakopět. Its main purpose is to inspect human interaction with MT systems enhanced with additional subsystems, such as backward translation and quality estimation. We follow up with an experiment on (Czech) human annotators tasked to produce questions in a language they do not speak (German), with the help of Ptakopět. We focus on three real-world use cases (communication with IT support, describing administrative issues and asking encyclopedic questions) from which we gain insight into different strategies users take when faced with outbound translation tasks. Round trip translation is known to be unreliable for evaluating MT systems but our experimental evaluation documents that it works very well for users, at least on MT systems of mid-range quality.
Enabling Outbound Machine TranslationBachelor thesis 2020
linkcode

Vilém Zouhar
It is not uncommon for Internet users to have to produce text in aforeign language they have very little knowledge of and are unable to verify thetranslation quality. We call the task “outbound translation” and explore it byintroducing an open-source modular system Ptakopět. Its main purpose is toinspect human interaction with machine translation systems enhanced by ad-ditional subsystems, such as backward translation and quality estimation. Wefollow up with an experiment on (Czech) human annotators tasked to producequestions in a language they do not speak (German), with the help of Ptakopět.We focus on three real-world use cases (communication with IT support, describ-ing administrative issues and asking encyclopedic questions) from which we gaininsight into different strategies users take when faced with outbound translationtasks. Round trip translation is known to be unreliable for evaluating MT sys-tems but our experimental evaluation documents that it works very well for users,at least on MT systems of mid-range quality.

Less-serious projects

Metaphor Preservation in Machine Translation and Paraphrasing2023
papercode


Metaphors play a crucial role in human communication. Improving the handling of metaphors in NLP will enhance the quality and accuracy of cross-lingual communication, benefiting various applications such as multilingual chatbots, localization, and cross-cultural understanding. This paper reports an evaluation that focuses on the analysis of metaphor presence and preservation in machine-translated and paraphrased texts. The results suggest that textual language models do not have access to the metaphorical meaning and do not fully understand this literal device. They are not sensitive to the subtle differences between various paraphrases but can be used for the rudimentary analysis of machine translation output, which varies greatly with respect to metaphor preservation.
Ryanize bib2023
toolcode


Tool to check for common BibTeX best practice violations
Poetry, Songs, Literature, Legalese and Translationese2023
papercode


Although non-trivial to measure, natural texts come in varying complexities. As a result, multiple domains and genres can be compared based on their complexities. In this study, focused on measuring sentence complexity, I use automated methods of complexity estimation to compare poetry, natural prose, literary prose and machine and human translation. The conclusion is that old poetry and old literature is more complex than their modern counterparts, as measured by language model complexity, Flesch Reading Ease and syntactic depth. Furthermore, we observe that machine translations are faithful to human references in terms of sentence complexity, which is a positive result for the translation industry. Most importantly, this paper discusses the reason for different complexities across varying text domains, which is framed as ''form (complexity) follows function and aesthetics with least effort.''
Stolen Subwords2023
papercode


In learning-based functionality stealing, the attacker is trying to build a local model based on the victim's outputs. The attacker has to make choices regarding the local model's architecture, optimization method and, specifically for NLP models, subword vocabulary, such as BPE. On the machine translation task, we explore (1) whether the choice of the vocabulary plays a role in model stealing scenarios and (2) if it is possible to extract the victim's vocabulary. We find that the vocabulary itself does not have a large effect on the local model's performance. Given gray-box model access, it is possible to collect the victim's vocabulary by collecting the outputs (detokenized subwords on the output). The results of the minimum effect of vocabulary choice are important more broadly for black-box knowledge distillation.
Multimodal Shannon Game with ImagesPreprint 2022
paperdemocode

Vilém Zouhar,= Sunit Bhattacharya,= Ondřej Bojar
The Shannon game has long been used as a thought experiment in linguistics and NLP, asking participants to guess the next letter in a sentence based on its preceding context. We extend the game by introducing an optional extra modality in the form of image information. To investigate the impact of multimodal information in this game, we use human participants and a language model (LM, GPT-2). We show that the addition of image information improves both self-reported confidence and accuracy for both humans and LM. Certain word classes, such as nouns and determiners, benefit more from the additional modality information. The priming effect in both humans and the LM becomes more apparent as the context size (extra modality information + sentence context) increases. These findings highlight the potential of multimodal information in improving language understanding and modeling.
ÚFAL Bilingual scientific abstracts corpus2022
code

Rudolf Rosa, Vilém Zouhar
This is a parallel corpus of Czech and mostly English abstracts of scientific papers and presentations published by authors from the Institute of Formal and Applied Linguistics, Charles University in Prague. For each publication record, the authors are obliged to provide both the original abstract (in Czech or English), and its translation (English or Czech) in the internal Biblio system. The data was filtered for duplicates and missing entries, ensuring that every record is bilingual. Additionally, records of published papers which are indexed by SemanticScholar contain the respective link. The dataset was created from September 2022 image of the Biblio database and is stored in JSONL format, with each line corresponding to one record.
Stroop Effect in Multi-Modal Sight TranslationPreprint 2022
papercode

Sunit Bhattacharya, Vilém Zouhar, Věra Kloudová, Ondřej Bojar
This study investigates the human translation process from English to Czech in a multi-modal scenario (images) using reaction times. We make a distinction between ambiguous and unambiguous sentences where in the former, more information would be needed in order to make a proper translation (e.g. gender of the subject). Simultaneously, we also provide visual aid to help in disambiguation, which is necessary for the ambiguous sentences. We confirm that ambiguous sentences take longer to translate and the provision of disambiguating visual aid slows the translation process. When provided with an unrelated visual aid, humans are able to recognize and spend less time on it but still significantly more than in other conditions. These findings are a clear manifestation of the Stroop effect (longer processing times for incongruent combinations).
Machine Translate2022
linkcode


Open resources and community for machine translation
Random Strum Pattern Generator2022
tool


Little tool to generate practice guitar strum patterns.
Fusing Sentence Embeddings Into LSTM-based Autoregressive Language ModelsPreprint 2021
papercode

Vilém Zouhar, Marius Mosbach, Dietrich Klakow
Although masked language models are highly performant and widely adopted by NLP practitioners, they can not be easily used for autoregressive language modelling (next word prediction and sequence probability estimation). We present an LSTM-based autoregressive language model which uses prefix embeddings (from a pretrained masked language model) via fusion (e.g. concatenation) to obtain a richer context representation for language modelling. We find that fusion helps reliably in lowering the perplexity (16.74 → 15.80), which is even preserved after a transfer to a dataset from a different domain than the training data. We also evaluate the best-performing fusion model by correlating its next word surprisal estimates with human reading times. Contradicting our expectation, and despite the improvement in perplexity overall, the correlation remains the same as for the baseline model. Lastly, while we focus on language models pre-trained on text as the sources for the fusion, our approach can be possibly extended to fuse any information represented as a fixed-size vector into an auto-regressive language model. These include e.g. sentence external information retrieved for a knowledge base or representations of multi-modal encoders.
EMMT: An eye-tracking, EEG and audio corpus for multi-modal reading and translationPreprint 2021
papercode

Sunit Bhattacharya, Vilém Zouhar, Věra Kloudová, Ondřej Bojar
We present the Eyetracked Multi-Modal Translation (EMMT) corpus, a dataset containing monocular eye movement recordings, audio and 4-electrode electroencephalogram (EEG) data of 43 participants. The objective was to collect cognitive signals as responses of participants engaged in a number of language intensive tasks involving different text-image stimuli settings when translating from English to Czech.Each participant was exposed to 32 text-image stimuli pairs and asked to (1) read the English sentence, (2) translate it into Czech, (3) consult the image, (4) translate again, either updating or repeating the previous translation. The text stimuli consisted of 200 unique sentences with 616 unique words coupled with 200 unique images as the visual stimuli.The recordings were collected over a two week period and all the participants included in the study were Czech natives with strong English skills. Due to the nature of the tasks involved in the study and the relatively large number of participants involved, the corpus is well suited for research in Translation Process Studies, Cognitive Sciences among other disciplines.
Deep Molecule QSPR2021
code

Nikola Kalábová, Vilém Zouhar
The goal of Deep Molecule QSPR (Quantitative Structure-Property Relationships) is to predict several key temperature points of a molecule. The input is a graph of the given molecule and the output a single number: the predicted boiling or melting point. The model uses graph-informed feature extraction, which is then used as an input to simple feed-forward neural networks and achieves a significant performance, or a simple linear regression model, which allows for a degree of explainability. The novel contributions include the feature extraction itself (various atom weighting and structural functions), applicability to a wide range of molecule classes and the combination with a neural network to gain better performance compared to widely used linear regression models.
Statistical Natural Language Processing TutorialsTeaching material 2021
link

Vilém Zouhar, Awantee Deshpande, Julius Steuer
Fact Learning with Adaptive Color Palette: Effect of Stimuli-Independent Hints2021
code

Vilém Zouhar, Leander van Boven, Tianyi Li, Anjali Nair
This paper focuses on fact learning and potential improvements via color feedback. In one setting, the users see a color based on the estimated difficulty by SlimStampen. In another setting, each stimulus is mapped to a random but constant color. We find that a key property of the task is the high individual variance which prevents statistically significant conclusions. The results however suggest that certain conditions can increase learning speed, though this improvement is not retained during testing. The results also change when viewed from the perspective of test accuracy or number of learned words.
Hyperparameters of RNN Architectures for POS Tagging using Surface-Level BERT Embeddings2020
papercode


Contextual embeddings from pretrained BERTmodels have been useful in a variety of natural language processing tasks. This paper focuses on one of the basic of those tasks, Part of Speech tagging, and compares several simple recurrent neural network models (vanilla RNN, GRU, LSTM) and their hyperparameters (hidden state size, recurrent layers and dropout, bidirectional, dense layers). While stacking recurrent layers or densely connected output layers negatively affects the performance, adding bidirectionality and increasing hidden state size improve it significantly.
SlowAlign2020
code


Word alignment is a well-established task, which found its use mostly in PBMT. This report presents SlowAlign, a system combining multiple hard alignment extracting strategies, which are determined by a small number of parameters. The main functionalities of SlowAlign are (1) heuristic parameter estimation in a supervised fashion using gridsearch, (2) combination of multiple soft alignments and (3) data-less alignment based on diagonal alignment, Levenstein distance and blurring.
Slow Align Displayer2020
linkcode


Creates quick graphs given word alignment.
MosQEto2019
linkcode

Vilém Zouhar, Ondřej Měkota
Word level quality estimation (QE) of machine translation is a task aiming to identify badly translated words and spaces between words. We propose a framework for experiment replication of QE systems MosQEto. We were also experimenting with several methods, trying to improve the quality estimation of systems implemented in OpenKiwi by smartly preprocessing and synthesizing training data.
Dorfromantik solver2023
toolcode


A tool to help with a small game.
Call for Menza2019
code


Aggregator of daily menus around Charles University. Unmaintained since I moved out of Prague.
SMAKE2019
code

Vilém Zouhar, Petr Houška
Simple Markable And Keyword Extraction.
TNTranslator2019
code


Translation inspector for n-best list navigator.
A Collection of Machine Learning ExcercisesTeaching material 2018
link

Martin Holub, Barbora Vidová Hladká, Vilém Zouhar
ZimaDB2018
code

Vilém Zouhar, Petr Chmel
SQLite-like database implementation from scratch.
ASM Hell2017
democode


Learn basic assembly instructions through a game (LD41 submission).
Prolog KNN2017
code


Implementation of kNN in Prolog, a language the least suited for this.

Miscellaneous



I'm currently advised by Mrinmaya Sachan and Menna El-Assady. Previously during my bachelor's and master's I was advised by Dietrich Klakow, and Ondřej Bojar. I had the privilige to supervise Yijie Tong, Haokun He, Abhinav Kumar, and David Gu.

In my free time I'm interested in veganism, videogames and literature.