Punctuation Restoration

Dataset can be used to train models to restore punctuation in the ASR recognition of texts read out loud.

Last updated on Mar 12, 2022

Dataset – WikiPunct

WikiPunct is a crowdsourced text and audio data set of Polish Wikipedia pages read out loud by Polish lectors. The dataset is divided into two parts:conversational(WikiTalks)and information (WikiNews). Over a hundred people were involved in the production of the audio component. The total length of audio data reaches almost thirty-six hours, including the test set. Steps were taken to balance the male-to-female ratio.

WikiPuncthas over thirty-two thousand texts and 1200 audio files, one thousand in the training set and two hundred in the test set. There is a transcript of automatically recognized speech and force-aligned text for each text. The details behind the data format and evaluation metrics are presented below in the respective sections.

You can learn more and download dataset here

Part 1. WikiTalks

Data scraped from Polish Wikipedia Talk pages. Talk pages, also known as discussion pages, are administration pages with editorial details and discussions for Wikipedia articles.. Talk pages were scrapped from the web using a list of article titles shared alongside Wikipedia dump archives.

Wikipedia Talk pages serve as conversational data. Here, users communicate with each other by writing comments. Vocabulary and punctuation errors are expected. This data set covers 20% of the spoken data.

Example:

wikitalks001948: Cóż za bzdury tu powypisywane! Fra Diavolo starał się nie dopuścić do upadku Republiki Partenopejskiej? Kto to wymyślił?! Człowiek ten był jednym z najżarliwszych wrogów francuskiej okupacji, a za zasługi w wypędzeniu Francuzów został mianowany pułkownikiem w królewskiej armii z prawdziwie królewską pensją. Bez niego wyzwolenie, nazywać to tak czy też nie, północnej części królestwa byłoby dużo trudniejsze, bo dysponował siłą kilku tysięcy sprawnych w boju i umiejętnie wziętych w karby rzezimieszków. Toteż armia Burbonów nie pokonywała go, jak to się twierdzi w artykule, lecz ściśle współpracowała. Redaktorów zachęcam do jak najszybszej korekty artykułu, bo aktualnie jest obrazą dla ambicji Wikipedii. 91.199.250.17 wikitalks008902: Stare wątki w dyskusji przeniosłem do archiwum. Od prawie roku dyskusja w nich nie była kontynuowana. Sławek Borewicz

Part 2. WikiNews

Wikinews is a free-content news wiki and a project of the Wikimedia Foundation. The site works through collaborative journalism. The data was scraped directly from wikinews dump archive. The overall text quality is high, but vocabulary and punctuation errors may occur. This data set covers 80% of the spoken data.

Example:

wikinews222361: Misja STS-127 promu kosmicznego Endeavour do Międzynarodowej Stacji Kosmicznej została przełożona ze względu na wyciek wodoru. Podczas procesu napełniania zewnętrznego zbiornika paliwem, część ciekłego wodoru przemieniła się w gaz i przedostała się do systemu odpowietrzania. System ten jest używany do bezpiecznego odprowadzania nadmiaru wodoru z platformy startowej 39A do Centrum Lotów Kosmicznych imienia Johna F. Kennedy'ego. Początek misji miał mieć miejsce dzisiaj, o godzinie 13:17. Ze względu jednak na awarię, najbliższa możliwa data startu wahadłowca to środa 17 czerwca, jednak na ten dzień NASA na Przylądku Canaveral zaplanowana wystrzelenie sondy kosmicznej Lunar Reconnaissance Orbiter. Misja może być zatem opóźniona do 20 czerwca, który jest ostatnią możliwą datą startu w tym miesiącu. W niedzielę odbędzie się spotkanie specjalistów NASA, na którym zostanie ustalona nowa data startu i dalszy plan misji STS-127.

Motivation

Speech transcripts generated by Automatic Speech Recognition (ASR) systems typically do not contain any punctuation or capitalization. In longer stretches of automatically recognized speech, the lack of punctuation affects the general clarity of the output text [1]. The primary purpose of punctuation (PR) and capitalization restoration (CR) as a distinct natural language processing (NLP) task is to improve the legibility of ASR-generated text, and possibly other types of texts without punctuation. Aside from their intrinsic value, PR and CR may improve the performance of other NLP aspects such as Named Entity Recognition (NER), part-of-speech (POS) and semantic parsing or spoken dialog segmentation [2, 3]. As useful as it seems, It is hard to systematically evaluate PR on transcripts of conversational language; mainly because punctuation rules can be ambiguous even for originally written texts, and the very nature of naturally-occurring spoken language makes it difficult to identify clear phrase and sentence boundaries [4,5]. Given these requirements and limitations, a PR task based on a redistributable corpus of read speech was suggested. 1200 texts included in this collection (totaling over 240,000 words) were selected from two distinct sources: WikiNews and WikiTalks. Punctuation found in these sources should be approached with some reservation when used for evaluation: these are original texts and may contain some user-induced errors and bias. The texts were read out by over a hundred different speakers. Original texts with punctuation were forced-aligned with recordings and used as the ideal ASR output. The goal of the task is to provide a solution for restoring punctuation in the test set collated for this task. The test set consists of time-aligned ASR transcriptions of read texts from the two sources. Participants are encouraged to use both text-based and speech-derived features to identify punctuation symbols (e.g. multimodal framework [6]). In addition, the train set is accompanied by reference text corpora of WikiNews and WikiTalks data that can be used in training and fine-tuning punctuation models.

Deep Learning datasets

Punctuation Restoration

Dataset – WikiPunct

Part 1. WikiTalks

Part 2. WikiNews

Motivation

Related