Giulia Preti

Researcher @ CENTAI

«That quite definitely is the answer. I think the problem, to be quite honest with you, is that you’ve never actually known what the question is.»
“The Hitchhiker's Guide to the Galaxy” by Douglas Adams

All hail!

I am Giulia Preti.

I am a Researcher in the fields of Graph Mining and Complex Network Analysis at CENTAI, Turin (Italy).

I was a Post Doc in the area of Learning and Algorithms for Data Analytics at ISI Foundation, Turin (Italy), under the coordination of Francesco Bonchi.

I was a member of the DbTrento research group, and I worked on techniques for mining relevant structures in dynamic and heterogeneous datasets.

I got my PhD in Information and Communication Technology at the University of Trento (Italy), under the supervision of Prof. Yannis Velegrakis.

I was also the teaching assistant for the Computability and Computational Complexity course from 2015 to 2018.

I got my master's degree in Computer Science and my bachelor's degree in mathematics, both of which pursued at the University of Trento.


[Download CV]

Research Interests:

My research focuses on graphs, a versatile data model that has been increasingly used to represent a large plethora of data, from biology to social networks, and from computer networks to smart cities. In particular, I consider weighted graphs and dynamic graphs.

Weighted graphs are graphs whose nodes and edges are labeled with weights indicating their relevance or quality. Moreover, in applications aiming at offering personalized products and services to each user rather than ``one size fits all'' solutions, each element of the graph naturally carries multiple weights, one for each user. My goal is to identify structures that appear frequently in the graph and whose appearances are characterized by large weights, and hence are relevant for the user, under the assumption that larger weights indicate higher interest.

Dynamic graphs are graphs that change over time, meaning that their nodes and edges can undergo both structural and attribute changes. They are generally modeled as sequences of static graphs called snapshots. In this context, I am interested in detecting groups of edges that evolve in a convergent manner, meaning that they display a positive correlation in their behavior. These groups of correlated edges, especially when they involve topologically close edges, can represent regions of interest in the network.

More recently, I started investigating pattern mining techniques and null models for complex networks such as hypergraphs and simplicial complexes, which are higher-order generalizations of graphs able to model k-ary relations among entities. Hypergraphs are collections of hyperedges, which are edges that connect more than two vertices. Conversely, a simplicial complex is a collection of polytopes such as triangles and tetrahedra, which are called simplices.

During my PhD studies, I also worked on entity resolution in highly heterogeneous and temporal databases, defined as collections of records characterized by different schemata and timestamps indicating the date of creation. The reconciliation of the records in this kind of situation requires specialized similarity functions that take into consideration both the heterogeneity and the dynamism of the data. In my work, I proposed a suitable time-aware schema-agnostic similarity measure and a framework that uses this measure to identify maximal groups of similar temporal records.

Publications:

Hyper-distance Oracles in Hypergraphs.


The VLDB Journal, 2023

MaNIACS: Approximate Mining of Frequent Patterns through Sampling.


TIST, 2023
10.1145/3587254

ALICE and the Caterpillar: A More Descriptive Null Model for Assessing Data Mining Results.


ICDM, 2022
(PDF)

FreSCo: Mining Frequent Patterns in Simplicial Complexes.


The Web Conference, 2022
(PDF)

MaNIACS: Approximate Mining of Frequent Patterns through Sampling.


KDD, 2021
(PDF)

Discovering Dense Correlated Subgraphs in Dynamic Networks.


PAKDD, 2021
(PDF)

STruD: Truss Decomposition of Simplicial Complexes.


The Web Conference, 2021
(PDF)

Mining Dense Subgraphs with Similar Edges.


ECML-PKDD, 2020
(PDF)

ExCoDE: a Tool for Discovering and Visualizing Regions of Correlation in Dynamic Networks.


ICDM Workshops, 2019
(PDF) (poster)

Mining Patterns in Graphs with Multiple Weights.


Distributed and Parallel Databases Journal, 1-39, 2019
https://doi.org/10.1007/s10619-019-07259-w

Beyond Frequencies: Graph Pattern Mining in Multi-weighted Graphs.


EDBT, 2018
(PDF) (PPT)

Projects and Code:

HypED

Overview

HypED is an algorithm to answer point-to-point s-distance queries in hypergraphs, approximately. The algorithm can answer three types of queries: vertex-to-hyperedge, vertex-to-vertex, and hyperedge-to-hyperedge. This is achieved by constructing a distance oracle, which can be stored on disk for future usage.

The distance oracle stores distances from landmark hyperedges to reachable hyperedges, so that the distance between two hyperedges can be approximated via triangle inequalities. The algorithm requires in input an integer L used to compute the desired oracle size O = L x |E|, where |E| is the number of hyperedges in the hypergraph.

Code

The code of this project is publicly available on GitHub.

ALICE

Overview

ALICE is a suite of two Markov-Chain Monte-Carlo algorithms for sampling datasets from our novel null model, based on a carefully defined set of states and efficient operations to move between them.

This null model preserves the Bipartite Joint Degree Matrix of the bipartite graph corresponding to the dataset, which ensures that the number of caterpillars, i.e., paths of length three, is preserved, in addition to the item supports and the transaction lengths.

ALICE-A is based on Restricted Swap Operations (RSOs) on biadjacency matrices, which preserve the BJDM. ALICE-B adapts the CURVEBALL approach to RSOs, to essentially perform multiple RSOs at every step, thus leading to faster mixing.

Code

The code of this project is publicly available on GitHub.

FreSCo

Overview

FreSCo is an algorithm to find frequent patterns in simplicial complexes. A pattern, or simplet is defined as a subcomplex, and its frequency is determined by its occurrences in the complex.

The algorithm generalizes the Minimum Node Image-based (MNI) frequency measure to the complex setting and then returns the set of patterns whose frequency is greater than a given threshold.

Code

The code of this project is publicly available on GitHub.

MaNIACS

Overview

MaNIACS is a sampling-based randomized algorithm for computing approximations of the collection of the subgraph patterns that are frequent in a single vertex-labeled graph, according to the Minimum Node Image-based (MNI) frequency measure.

The output of MaNIACS comes with strong probabilistic guarantees. The quality of the approximation is obtained using the empirical Vapnik-Chervonenkis (VC) dimension, a key concept from statistical learning theory. In particular, given a failure probability, a frequency threshold, and a sample size, with at least such probability over the choice of the sample of such size, the output of MaNIACS contains each pattern of size k with relative MNI frequency greater than the threshold and with estimated frequency within epsilon from the relative MNI frequency.

MaNIACS leverages properties of the frequency function to aggressively prune the pattern search space, and thus to reduce the time spent in exploring subspaces containing no frequent patterns. The framework includes both an exact and an approximate mining algorithm.

Code

The code of this project is publicly available on GitHub.

STruD

Overview

STruD is a framework to perform the simplicial truss decomposition of simplicial complexes. A simplicial complex is a generalization of a graph: a collection of n-ary relationships (instead of binary as the edges of a graph), named simplices.

STruD generalizes the graph notion of truss decomposition to complexes and includes algorithms to find (i) the simplicial truss decomposition of a simplicial complex, (ii) the top-n simplices with maximum trussness, (iii) the k-truss of simplices of size greater or equal to a given q, and (iv) the standard truss decomposition of the 1-skeleton of a simplicial complex.

Code

The code of this project is publicly available on GitHub.

ReSuM

Overview

ReSuM is a framework to mine relevant patterns from large-weight and multi-weighted graphs. Assuming that the importance of a pattern is determined not only by its frequency in the graph, but also by the edge weights of its appearances, we propose four scoring functions to compute the relevance of the patterns. These functions satisfy the apriori property, and thus can rely on efficient pruning strategies.

The framework includes an exact and approximate mining algorithm. The first is characterized by intelligent storage and computation of the pattern scores, while the second is based on the aggregation of similar weighting functions to allow scalability and avoid redundant computations.

Code

The code of this project is publicly available on GitHub, while the datasets used in the experiments may be provided upon request.

ExCoDE

Overview

ExCoDE is a general framework to mine diverse dense correlated subgraphs from dynamic networks. The correlation of a subgraph is computed in terms of the minimum pairwise Pearson correlation between its edges. The density of a subgraph is computed either as the minimum average degree among the snapshots of the networks where the subgraph is active, or as the average average degree among the snapshots of the networks where the subgraph is active. The similarity between different subgraphs is measured as the Jaccard similarity between the corresponding sets of edges.

Code

The demo of this project is available on GitHub, together with several datasets. Additional information may be provided upon request.