Extracting and rendering representative sequences

TitleExtracting and rendering representative sequences
Publication TypeBook Chapter
Year of Publication2011
AuthorsGabadinho, A, Ritschard, G, Studer, M, Müller, NS
EditorFred, A, Dietz, JLG, Liu, K, Filipe, J
Book TitleKnowledge Discovery, Knowledge Engineering and Knowledge Management
Series TitleCommunications in Computer and Information Science
NumberVol. 128
Pagination94-106
PublisherSpringer
Place PublishedBerlin
ISBN Number978-3-642-19031-5
Keywordscategorical sequences, dairwise dissimilarities, discrepancy of sequences, representatives, summarizing sets of sequences, visualization
Abstract

This paper is concerned with the summarization of a set of categorical sequences. More specifically, the problem studied is the determination of the smallest possible number of representative sequences that ensure a given coverage of the whole set, i.e. that have together a given percentage of sequences in their neighbourhood. The proposed heuristic for extracting the representative subset requires as main arguments a pairwise distance matrix, a representativeness criterion and a distance threshold under which two sequences are considered as redundant or, identically, in the neighborhood of each other. It first builds a list of candidates using a representativeness score and then eliminates redundancy. We propose also a visualization tool for rendering the results and quality measures for evaluating them. The proposed tools have been implemented in our TraMineR R package for mining and visualizing sequence data and we demonstrate their efficiency on a real world example from social sciences. The methods are nonetheless by no way limited to social science data and should prove useful in many other domains.