Big speech data analytics for contact centers


European Horizon 2020 project No. 645323


Description and terms and conditions of use of BISON simulated contact center data


     1.            Introduction

The BISON project has been funded by the EC Horizon 2020 Framework Programme and aims at bringing significant innovations in speech data mining for contact centers (CC). Please see project web-page for more information about BISON.

The data is one of the crucial resources for the innovations planned in the BISON project. At the same time it is one of the main assets in order to perform big data speech analytics and one of the main goals within this project. In BISON, we have to take into account that we are operating in a commercial environment, with data from customers that holds both significant commercial value and severe legal limitations and usage restrictions.

As the CC data is very sensitive and the BISON consortium needed data for public demonstration and dissemination activities, we have collected a limited amount of simulated CC data including no true personal information. This data is comm          only called “fake data”. This document accompanies the public release  of this data and contains its technical description and terms of use.


     2.            Languages and content

The four most relevant languages for CC partners - Czech, English, French and Spanish - were chosen, and CC partners EBOS, ComData and Telefonica Móviles prepared fake campaigns and recruited speakers (among their employees) that performed the calls. The collection procedure is followed several prepared scripts resembling real calls as much close as possible to the business use cases in BISON.

     3.            Law abiding data collection

As the whole BISON project, BISON simulated CC recording followed strictly the compliance with applicable law.  

                        3.1.            Speakers and informed consent

The CC clients are played by known people that sign the informed consent form allowing the intended use of the speech data: “Recorded data will be used for the purposes of the BISON project and will contribute to the provision of better services to CC customers. The recordings may be used for both academic and commercial research and development, and may be made publicly available on the Internet in order to support international R&D community in the area of speech data mining.”

                        3.2.            Personal data

Special focus was on personal data that could not be collected. At the same time, we required the data to be useful for BISON purposes: e.g., keeping real format for IDs, telephone numbers and addresses and several topics around CC operation and relevant use cases for BISON. The following procedures of fake personal data generation to ensure ethics and law-abidingness, with the most important issues summarized below:

        their personal data are never mentioned: fake data are used and combined with the most common data for population in given country (see details below);

        brand names are replaced by fake ones to avoid problems with real producers.

Non-trivial procedures were selected to generate fake personal data, we were careful namely that:

        phone numbers, even if randomly generated, do not correspond even potentially to real customers – that is, they are not only unused today, but also unassignable tomorrow to a new customer, based on the current numbering schemes in the specific countries;

        names are general enough to prevent singling out an individual, even indirectly;

        addresses do not correspond to real ones, yet are realistic enough for the purpose; therefore, a suitable mix of street addresses, non-existing street numbers, and street/city coupling has been used for the purpose.

The following table shows the procedures for the generation of fake personal information for the public release data.


(Czech Republic)

(United Kingdom, France)


Fake identities

Most common names in the Czech Republic, owned by a high number of persons, e.g. Novak, Prochazka, Novotny

Most common first and last names in UK and France, e.g. James Johnson, Louise Dubois

Web application used to generate real structure of Spanish identities, but with numbers that do not exist nowadays and may not exist in the foreseeable future.

Fake numbers

Randomly generated 9-digit numbers coupled with the (non-existing) prefix +579

Altered area code, replacing the leading 0 with 1, obtaining a non-existing prefix

Altered area code, two-digits codes swapped (e.g. 93  39) so as to obtain a non-existing prefix

Fake addresses

Real street/square names, fake (non-existing) numbers, different city

Real street/square names, fake (non-existing) numbers, different city

Real street/square names, fake (non-existing) numbers, different city


     4.            Recorded data

                        4.1.            Statistics

The following table summarizes the statistics of the collected data:






Produced by










   of which males





   of which females










Total duration

30 mins

54.21 min

26.78 min

38.17 min

                        4.2.            Structure and format

The ZIP file contains 4 directories, each with one language.



wav/Script_01.wav … Script_10.wav - stereo WAV files with calls

    doc/README.txt - info file

    doc/01 Script Domiciliacion.docx …

    doc/010 Script Cambio de titular.docx - call scripts


wav/251796172stereo.wav - 251801300stereo.wav - stereo WAV files with calls

txt/data for fake calls - information about addresses, numbers and names


wav/252796179stereo.wav - 252796194.wav - stereo WAV files with calls


wav/252796184stereo.wav - 252796199.wav - stereo WAV files with calls


     5.            Terms and conditions of use

                        5.1.            Purposes

The simulated CC data is publicly available through the BISON public web-site. It can be used for all legitimate purposes including (but not limited to) Academic Research and Development, Industrial Research and Development, Education, CC agent training, Demonstration, Testing of own speech analytics software, testing of third party speech analytics SW, serving as example for similar data collection, and others.

                        5.2.            Collection of information

BISON consortium however collects information on who downloaded the data for which purposes, and the data will be made available only after filling in required information. The individuals, laboratories and companies interested in this data might be contacted with questionnaires, and eventually with business offers, after obtaining lawful informed consent thereto.

                        5.3.            Acknowledgements

In case of publication of results on BISON simulated CC data, you are kindly requested to acknowledge the EC funding and the BISON project by stating:

“Collection of BISON simulated CC data was funded by the European Union’s Horizon 2020 research and innovation programme under grant agreement No 645323. The data is available at”.