Speaker identification meets graphs

· 7 min read

In social network analysis, a conventional approach relies heavily on available metadata, allowing to match a virtual entity (social network account) to a real-world entity (person, company) in the network. However, a single person using multiple accounts for any reason obviously breaks the connection, forming multiple virtual entities in the network. Or multiple people can share their account, forming a single virtual entity in the network. If these cases are not taken into account, they can affect reliability of social network analysis significantly without any warning, possibly leading to misinformed decisions and further bad consequences.

A possible solution how to overcome this limitation is to identify real-world entities using a pattern matching of chosen behaviour characteristics, for example extracted from textual or voice content.

In speaker identification systems, audio records are processed to filter out background noise and to extract human voice biometric characteristics, forming a voiceprint. Voiceprints can be matched to determine how likely two voiceprints belong to the same speaker. Voiceprint matching is already widely used for user authentication or identification purposes. Speaker verification is a 1:1 match, where the voiceprint is matched to a chosen stored voiceprint. Speaker identification is a 1:N match, where the voiceprint is matched against N stored voiceprints.

Let’s raise the bar again. Think about having N voiceprints of many accounts, only knowing voiceprint similarities and what voiceprints appeared in the same conversation. The goal is to detect unique speakers across accounts, and what speakers frequently talk to each other. Is it possible?

Background

Phonexia is a company focused on voice biometrics and speech analysis, with a long history of success, proven by their presence at conferences and with clients from more than 60 countries. GraphAware and Phonexia met at a conference, where we actually had our booths next to each other and we quickly realized a potential of putting our minds together to build something new.

The input for our experiments is a matrix of N:N signed floating point numbers, where each number represents voiceprint similarity, indicating how likely the two voiceprints match the same speaker. Speaking in graph terminology, this matrix can be considered as a weighted adjacency matrix of a complete graph, where for our case any value (including 0) means that the two nodes are connected by an edge with weight equal to the value. Another part of the input are pairs of voiceprints, as they appeared in the same conversation.

A naïve approach to detect unique speakers could be to filter out all edges with value below a threshold, detect connected components, and assume that each component forms a single speaker. However, it depends on all values being correct, which is unrealistic expectation in biometrics system. In the diagram below, filtering out negative edges produces only one speaker because of a very high, but false positive value. This effect of incorrect values could propagate through the graph indefinitely, possibly causing only one speaker being detected for the entire graph in the worst case. Instead, by considering community effect of multiple negative links in the diagram, an accurate result identifying two speakers is expected.

Community detection with probabilistic data

This is why the naïve approach doesn’t work with probabilistic data. All available data needs to be used, so that negative values can have a chance to apply its negative effect. In addition, graph structure provides us an ability to apply community effect, so that multiple negative values can overcome occasional false positive values.

Multi-layered Community Detection

Community detection is a class of many graph algorithms usually used in social network analysis, to be able to identify what clusters people form, based on their social links. However, these algorithms typically works on a graph with positive links only, because such data is more easily available.

It is less known that it is also possible to run community detection on a multi-layered graph [1][2], where each layer has the same nodes (or a subset of them), but different edges. These layers can be formed either by edge labels describing various types of relationships, or by snapshots of data changing over time (temporal networks). When running a multi-layered community detection algorithm, it is also possible to assign weights to layers, including negative weights. This is a critical feature for our next steps here.

Multi-layered graph

Basic analysis of our data reveals that most of the voiceprint similarity matches are really negative matches, only a very few of the matches actually designate positive matches. They can be split into two sets, positive and negative. Note that in the histogram, matches don’t need to be split at 0 as in the example in the previous section, but at any threshold, to achieve the best results with given data.

Our threshold is found experimentally by calibrating it according to training data. In a real-world application with dynamic data, the threshold would be configurable by users in the UI, to let them affect algorithm results for their needs, to find the best balance between false acceptance and false rejection rates.

Histogram

There are not many libraries with support for multi-layered graphs yet, a rare case is louvain-igraph, an implementation of the well-known Louvain community detection algorithm in Python language. After splitting all edges into a positive set above the threshold and a negative set below the threshold, Louvain algorithm can be run on both sets together as layers of a multi-layered graph, with positive set having layer weight = 1 and negative set having layer weight = -1.

After connecting the voiceprints with the additional information about voiceprint co-occurrence and layouting the graph, speakers and even communities of speakers starts emerging and it is possible to explore the graph visually.

Part of the graph before and after applying community detection on voiceprints. Four inner virtual accounts were detected as belonging to one real-world entity with high confidence.

A community from the final network, displaying connected speakers

Further steps

Louvain algorithm is a popular greedy algorithm optimizing for minimal modularity, a graph property measuring the density of edges inside communities to edges outside communities. It represents a global approach, with a disadvantage that a change in one part of the graph could affect results in a completely separate part of the graph. Local algorithms could also be considered, such as label propagation.

Recently a flaw in Louvain algorithm was found, unnoticed for a long time. It turns out it may form badly connected, or even disconnected communities. A new Leiden algorithm was introduced, providing explicit guarantees about community connectedness. In several benchmarks and real-world networks, it was found to be faster and uncovering better communities, as compared to Louvain. [3] The original implementation with support for multi-layered graphs is available as leidenalg.

Conclusion

Our experiment with applying community detection to a set of voiceprint similaries found new unique speakers, undetected by conventional methods. Presented approach can be generalized to any set of data, across different domains and semantics. It can be text, image, audio, video, as long as there is a comparator function, returning a similarity measure of a pair of items.

Combined with other available information about how items are connected to each other, graph analysis is a powerful tool to gain insights and actionable knowledge from any connected data, across different domains and semantics. Get in touch with GraphAware to see what we can do with your data!

References

[1] Traag, V. A., and Jeroen Bruggeman. “Community Detection in Networks with Positive and Negative Links.” Physical Review E 80, no. 3 (September 21, 2009). https://doi.org/10.1103/PhysRevE.80.036115.

[2] Mucha, P. J., T. Richardson, K. Macon, M. A. Porter, and J.-P. Onnela. “Community Structure in Time-Dependent, Multiscale, and Multiplex Networks.” Science 328, no. 5980 (May 14, 2010): 876–78. https://doi.org/10.1126/science.1184819.

[3] Traag, Vincent, Ludo Waltman, and Nees Jan van Eck. “From Louvain to Leiden: Guaranteeing Well-Connected Communities.” ArXiv:1810.08473 [Physics], October 19, 2018. https://arxiv.org/abs/1810.08473.

Jan Zak