Big speech data analytics for contact centers
BISON
European Horizon 2020 project No. 645323
Description and terms and conditions of use of BISON simulated contact center data
The BISON project has been funded by the EC Horizon 2020 Framework Programme and aims at bringing significant innovations in speech data mining for contact centers (CC). Please see project web-page http://bison-project.eu for more information about BISON.
The data is one of the crucial resources for the innovations planned in the BISON project. At the same time it is one of the main assets in order to perform big data speech analytics and one of the main goals within this project. In BISON, we have to take into account that we are operating in a commercial environment, with data from customers that holds both significant commercial value and severe legal limitations and usage restrictions.
As the CC data is very sensitive and the BISON consortium needed data for public demonstration and dissemination activities, we have collected a limited amount of simulated CC data including no true personal information. This data is comm only called “fake data”. This document accompanies the public release of this data and contains its technical description and terms of use.
The four most relevant languages for CC partners - Czech, English, French and Spanish - were chosen, and CC partners EBOS, ComData and Telefonica Móviles prepared fake campaigns and recruited speakers (among their employees) that performed the calls. The collection procedure is followed several prepared scripts resembling real calls as much close as possible to the business use cases in BISON.
As the whole BISON project, BISON simulated CC recording followed strictly the compliance with applicable law.
The CC clients are played by known people that sign the informed consent form allowing the intended use of the speech data: “Recorded data will be used for the purposes of the BISON project and will contribute to the provision of better services to CC customers. The recordings may be used for both academic and commercial research and development, and may be made publicly available on the Internet in order to support international R&D community in the area of speech data mining.”
Special focus was on personal data that could not be collected. At the same time, we required the data to be useful for BISON purposes: e.g., keeping real format for IDs, telephone numbers and addresses and several topics around CC operation and relevant use cases for BISON. The following procedures of fake personal data generation to ensure ethics and law-abidingness, with the most important issues summarized below:
● their personal data are never mentioned: fake data are used and combined with the most common data for population in given country (see details below);
● brand names are replaced by fake ones to avoid problems with real producers.
Non-trivial procedures were selected to generate fake personal data, we were careful namely that:
● phone numbers, even if randomly generated, do not correspond even potentially to real customers – that is, they are not only unused today, but also unassignable tomorrow to a new customer, based on the current numbering schemes in the specific countries;
● names are general enough to prevent singling out an individual, even indirectly;
● addresses do not correspond to real ones, yet are realistic enough for the purpose; therefore, a suitable mix of street addresses, non-existing street numbers, and street/city coupling has been used for the purpose.
The following table shows the procedures for the generation of fake personal information for the public release data.
|
ComData |
EBOS |
TME |
Fake identities |
Most common names in the Czech Republic, owned by a high number of persons, e.g. Novak, Prochazka, Novotny |
Most common first and last names in UK and France, e.g. James Johnson, Louise Dubois |
Web application used to generate real structure of Spanish identities, but with numbers that do not exist nowadays and may not exist in the foreseeable future. http://www.aplicacionesinformaticas.com/programas/gratis/cif.php |
Fake numbers |
Randomly generated 9-digit numbers coupled with the (non-existing) prefix +579 |
Altered area code, replacing the leading 0 with 1, obtaining a non-existing prefix |
Altered area code, two-digits codes swapped (e.g. 93 39) so as to obtain a non-existing prefix |
Fake addresses |
Real street/square names, fake (non-existing) numbers, different city |
Real street/square names, fake (non-existing) numbers, different city |
Real street/square names, fake (non-existing) numbers, different city |
The following table summarizes the statistics of the collected data:
Language |
Czech |
French |
English |
Spanish |
Produced by |
COMDATA |
EBOS |
EBOS |
TME |
speakers |
8 |
12 |
8 |
10 |
of which males |
4 |
7 |
7 |
5 |
of which females |
4 |
5 |
1 |
5 |
Calls |
10 |
14 |
7 |
10 |
Total duration |
30 mins |
54.21 min |
26.78 min |
38.17 min |
The ZIP file contains 4 directories, each with one language.
bison_simulated_CC_data/
spanish/
wav/Script_01.wav … Script_10.wav - stereo WAV files with calls
doc/README.txt - info file
doc/01 Script Domiciliacion.docx …
doc/010 Script Cambio de titular.docx - call scripts
czech/
wav/251796172stereo.wav - 251801300stereo.wav - stereo WAV files with calls
txt/data for fake calls - information about addresses, numbers and names
french/
wav/252796179stereo.wav - 252796194.wav - stereo WAV files with calls
english/
wav/252796184stereo.wav - 252796199.wav - stereo WAV files with calls
The simulated CC data is publicly available through the BISON public web-site. It can be used for all legitimate purposes including (but not limited to) Academic Research and Development, Industrial Research and Development, Education, CC agent training, Demonstration, Testing of own speech analytics software, testing of third party speech analytics SW, serving as example for similar data collection, and others.
BISON consortium however collects information on who downloaded the data for which purposes, and the data will be made available only after filling in required information. The individuals, laboratories and companies interested in this data might be contacted with questionnaires, and eventually with business offers, after obtaining lawful informed consent thereto.
In case of publication of results on BISON simulated CC data, you are kindly requested to acknowledge the EC funding and the BISON project by stating:
“Collection of BISON simulated CC data was funded by the European Union’s Horizon 2020 research and innovation programme under grant agreement No 645323. The data is available at http://bison-project.eu/data”.