Datasets with reference analysis

Algorithms in computational music analysis need to be evaluated. This task is not so easy – there are many different analyses of the same piece that are musically relevant. Howvever, some analysis are definitely more correct than others, and reference annotations can be used to evaluate some parts of analysis algorithms.

As computer scientists, we would like to have computer-readable reference datasets that may be used as a ground truth to evaluate MIR and CMA algorithms. But as music theorists, we know that there is not only one correct analysis of a given piece: listeners, players, or analysts often disagree or at least propose several points of view.

Anyway, there is consensus about some analytical elements by many music theorists, players or listeners. The fact that reaching consensus may be difficult on some points should not prevent us from trying to formalize some elements.

Dataset for fugue analysis

Mathieu Giraud, Richard Groult, Florence Levé

This dataset gives a reference analysis for the 24 fugues of the first book of Bach's Well-Tempered Clavier (WTC I, BWV 846-893) and the 12 first fugues from Shostakovich 24 Preludes and Fugues (op. 57, 1952). These annotations are based on several musicological sources as well as on our own analysis. The file gives the symbolic position (measure number and position in measure) of subjects (S) and counter-subjects (CS), as well as cadences and pedals. We also report slight modifications of S/CS (actual start with respect to the time signature, delayed resolutions...).

As in any analytical work, there may be no consensus between musicologists for some analytic elements. This is true even for fundamental elements such as the exact definition of the subject: In 8 of the 24 Bach fugues, at least two sources disagree on the end of the subject. We indicate these alternative subject definitions in the file (but do not report alternative CS).

We collected these data firstly to evaluate our own algorithms on fugue analysis, but they might also be useful in other situations, for instance in evaluating algorithms for pattern extraction or structure analysis.


  • Dataset: fugues.truth (release 2013.12)
  • Description of the syntax of the file
  • Changelog
    • 2013.12: First release on 12 Shostakovitch fugues + minor updates on Bach fugues (960 annotations)
    • 2013.05: First release on 24 Bach fugues (610 annotations)

The annotations include all complete subjects and counter-subjects, as well as pedals, and, for Bach, cadences. Further releases will also include also incomplete occurrences of S/CS. We welcome any feedback or suggestions.

The dataset is made available under the Open Database License, and any rights in individual contents of the database are licensed under the Database Contents License.

Relevant data


If you use this dataset, please cite the following reference: The sources used to compile this dataset were the following ones: