Master's Thesis Projects - Information and Knowledge Representation, Retrieval, and Reasoning LAB

Retrieval-Augmented Generation for Health Information Retrieval

This thesis explores the current directions of Retrieval-Augmented Generation (RAG) in the domain of Health Information Retrieval, focusing on the integration of Large Language Models (LLMs) to generate accurate and contextually relevant responses to medical queries.

Objectives and tasks include:

Review of RAG methodologies: Analyze existing RAG architectures, including Naive RAG, Advanced RAG, and Modular RAG, and their applications in healthcare contexts.
Dataset evaluation: Examine the characteristics of datasets used in RAG-based medical question answering, noting the predominance of English-language datasets and the implications for model performance and generalization.
Model performance assessment: Evaluate the effectiveness of proprietary models like GPT-3.5/4 in RAG applications within healthcare, identifying strengths and limitations.
Development of evaluation frameworks: Propose standardized evaluation metrics and frameworks to assess the performance of RAG-enhanced LLMs in medical question answering tasks.
Ethical considerations: Investigate the ethical implications of using RAG in healthcare, including issues related to data privacy, model transparency, and bias.

The expected outcomes may include a comprehensive review of RAG methodologies and their application in health information retrieval, an assessment of dataset characteristics and their impact on model performance, the development of standardized evaluation frameworks for RAG-based medical question answering systems, insights into the ethical considerations of deploying RAG-enhanced LLMs in healthcare settings.

Students interested in this thesis topic are welcome to contact Marco Viviani (marco.viviani@unimib.it) for further information.

Analyzing Economic Flow Concentration in the Ethereum Ecosystem

This thesis, developed in collaboration with Dr. Davide Mancino from the University of Milano-Bicocca, aims to reconstruct and analyze the network of economic flows within the Ethereum ecosystem to study concentration, fairness, and efficiency under the PBS regime. Ethereum is one of the most widely used blockchain platforms, where each transaction generates complex economic interactions among multiple actors. With the introduction of Proposer-Builder Separation (PBS), these interactions have become even more intricate, especially in the context of Maximal Extractable Value (MEV), where builders and validators play a central role in transaction ordering and block construction.

The work involves:

Extracting and processing Ethereum transaction data (including logs and internal traces) for a specific time window (e.g., February 2024), converting values to USD and distinguishing Externally Owned Accounts (EOAs) from smart contracts.
Reconstructing a directed, weighted “who-pays-whom” graph, separating PBS payments (builder → validator) from ordinary transfers.
Applying graph-based community detection techniques (e.g., Leiden algorithm) and network analysis metrics (centrality, in/out strength, concentration) to identify dominant actors, clusters, and recurring value-extraction patterns.

The expected outcome includes a reproducible analysis pipeline, quantitative indicators of economic power concentration, and visual representations such as Sankey diagrams, heatmaps, and stacked bar charts.

Students interested in this thesis topic are welcome to contact both Marco Viviani (marco.viviani@unimib.it) and Davide Mancino (davide.mancino@unimib.it) for further information.

Privacy and LLMs

This thesis, carried out in collaboration with Prof. Giovanni Livraga from the University of Milan (“La Statale”), explores the intersection between privacy preservation and Large Language Models (LLMs). As LLMs are increasingly integrated into applications that process sensitive or personal information, understanding and mitigating privacy risks has become a crucial research challenge.

The work may focus on one or more of the following directions:

Analysis of privacy threats in the training and deployment of LLMs (e.g., data memorization, model inversion attacks, or information leakage).
Evaluation of privacy-preserving strategies, such as differential privacy, data sanitization, or synthetic data generation.
Investigation of how LLMs handle sensitive content in user interactions and how to ensure compliance with privacy regulations (e.g., GDPR).

The thesis will combine a literature study with experimental analysis, using open-source LLMs and publicly available datasets to test privacy-related behaviors and protection mechanisms.

Students interested in this thesis topic are welcome to contact Marco Viviani (marco.viviani@unimib.it) for further information.

Social and Economic Dynamics of Memecoins

This thesis, developed in collaboration with Prof. Barbara Guidi from the University of Pisa, investigates memecoins as a social and economic phenomenon, integrating analyses of social media discourse and blockchain transactions. Building upon previous work on NFTs and Twitter data, this research focuses on Reddit as a primary platform for discussion and community formation around memecoins. The study aims to explore the interplay between online discussions, public sentiment, and market activity.

The thesis may include the following components:

Data collection from Reddit discussions about selected memecoins and extraction of corresponding blockchain transaction data.
Natural Language Processing (NLP) analyses, such as sentiment analysis, topic modeling, and trend detection over time.
Correlation analysis between social dynamics (e.g., sentiment, engagement, emerging topics) and economic signals (e.g., trading volume, price fluctuations).

This work offers an opportunity to combine computational social science, blockchain analytics, and text mining to understand how online communities influence and reflect cryptocurrency market behaviors.

Students interested in this thesis topic are welcome to contact Marco Viviani (marco.viviani@unimib.it) for further information.

Centralization of Social Capital in Blockchain Online Social Media

This thesis, developed in collaboration with Prof. Barbara Guidi from the University of Pisa, explores the “rich-get-richer” effect within the context of blockchain-based social networks. This effect, well known in economics and traditional social systems, describes the tendency for wealth—or influence—to become increasingly concentrated among a few individuals over time. While this phenomenon has been extensively studied in conventional online platforms and economic networks, its presence and dynamics within decentralized social media ecosystems remain largely unexplored.

The goal of this work is to:

Define a model to quantify an individual’s social capital in blockchain-based social media, considering factors such as posts, comments, followers, interactions, and other forms of feedback.
Apply the model to empirical data to assess whether social capital tends to become concentrated among a limited number of users over time.

The project will combine social network analysis, blockchain data analytics, and computational modeling to better understand inequality and centralization dynamics in decentralized social ecosystems.

Students interested in this thesis topic are welcome to contact Marco Viviani (marco.viviani@unimib.it) for further information.

Reproducing State-of-the-Art Experiments on Non-Monotonic Reasoning in LLMs

This thesis, developed in collaboration with Dr. Luca Herranz-Celotti from the Université Paris Cité, focuses on reproducing and extending recent experiments on non-monotonic reasoning in Large Language Models (LLMs). While most reasoning benchmarks evaluate monotonic inference—where conclusions remain valid when new information is added—human reasoning is often non-monotonic (or defeasible), meaning that new evidence can modify or retract previous conclusions.

Objectives and tasks include:

Reproducing state-of-the-art experiments on non-monotonic reasoning using existing benchmarks such as LogicBench and SymTex.
Running and adapting training/evaluation scripts from original repositories or Hugging Face dataset loaders.
Comparing reproduced results with those reported in the literature, using metrics such as accuracy, calibration, and targeted error analysis.
Extending the evaluation with additional datasets from recent ACL, NAACL, ICLR, NeurIPS, and ICML papers, covering both synthetic and crowd-sourced data.

The expected outcomes include a reproducible experimental pipeline, comparative analyses of LLM performance in non-monotonic reasoning, and insights into their limitations in handling dynamic and contradictory information.

Students interested in this thesis topic are welcome to contact both Marco Viviani (marco.viviani@unimib.it) and Luca Herranz-Celotti (luca.celottiherranz@unimib.it) for further information.

Reproducing State-of-the-Art Experiments in Automated Theorem Proving with Lean

This thesis, developed in collaboration with Dr. Luca Herranz-Celotti from the Université Paris Cité, focuses on reproducing and extending state-of-the-art experiments in automated theorem proving using the Lean proof assistant. Recent advances integrate Large Language Models (LLMs) and machine learning to teach models how to formally prove mathematical theorems, ensuring logical correctness through formal languages like Lean. This research has theoretical and practical relevance, including applications in software verification, mathematics education, and human-machine collaboration for automated theorem discovery.

Objectives and tasks include:

Reproduce existing SOTA experiments in Lean, including complex domains such as formalized complex analysis.
Use community-provided datasets and toolkits, such as LEAN-GitHub, ProofNet, and LeanDojo, for training and evaluating proof-generating models.
Align reproduced experiments with published SOTA claims, identifying differences due to dataset versions or execution environments.
Evaluate model performance on formal proof generation tasks using metrics and tools from the original benchmarks.

The expected outcomes include a reproducible experimental pipeline for automated theorem proving in Lean, with analyses of model performance, strengths, and limitations relative to current SOTA methods.

Students interested in this thesis topic are welcome to contact both Marco Viviani (marco.viviani@unimib.it) and Luca Herranz-Celotti (luca.celottiherranz@unimib.it) for further information.