attackers作品_the attackers were

Mark wiens

发布时间:2024-08-04

cs.SD 方向16篇,eess.AS 方向19篇

attackers作品_the attackers were

 

今日论文合集:cs.SD语音16篇,eess.AS音频处理19篇本文经arXiv每日学术速递授权转载微信公众号:arXiv_Dailycs.SD语音【1】 Psychophysiology-aided Perceptually Fluent Speech Analysis of Children  Who Stutter。

标题:心理生理学辅助下的口吃儿知觉流利言语分析链接:https://arxiv.org/abs/2211.09089作者:Yi Xiao,Harshit Sharma,Victoria Tumanova,Asif Salekin

机构:Syracuse University备注:20 pages, 5 figures摘要:本文首次提出了一种新的方法,称为PASAD,检测变化的知觉流畅的语音声学的幼儿特别地,对感知流畅语音的分析使得能够识别被认为是口吃不流畅的根本原因的语音运动控制因素。

最近的研究表明,幼儿的言语产出,特别是那些口吃的儿童,可能会受到不利的影响,环境的生理唤醒本文的主要贡献是利用说话人的情境生理反应来实时有效地分析语音信号提出的PASAD方法采用超网络结构,利用生理参数提取时间语音重要性信息。

此外,一种新颖的非局部声谱图特征提取网络识别有意义的声学属性最后,序列网络利用声学属性和提取的时间语音重要性进行有效分类我们收集了73名学龄前口吃儿童(CWS)和不口吃儿童(CWNS)在不同条件下的言语和生理感知数据。

PASAD的独特体系结构使得能够可视化与CWS的流畅语音不同的语音属性,语音发音器)所提取的知识可以增强对儿童流利言语、言语运动控制(SMC)和口吃发展的理解综合评价结果表明,PASAD在不同条件下均优于现有的多模态基线方法,具有较强的表达能力和对说话人语音和生理的自适应性,泛化能力强,鲁棒性好,可在移动和可扩展设备上实时执行。

摘要:This first-of-its-kind paper presents a novel approach named PASAD that detects changes in perceptually fluent speech acoustics of young children. Particularly, analysis of perceptually fluent speech enables identifying the speech-motor-control factors that are considered as the underlying cause of stuttering disfluencies. Recent studies indicate that the speech production of young children, especially those who stutter, may get adversely affected by situational physiological arousal. A major contribution of this paper is leveraging the speakers situational physiological responses in real-time to analyze the speech signal effectively. The presented PASAD approach adapts a Hyper-Network structure to extract temporal speech importance information leveraging physiological parameters. In addition, a novel non-local acoustic spectrogram feature extraction network identifies meaningful acoustic attributes. Finally, a sequential network utilizes the acoustic attributes and the extracted temporal speech importance for effective classification. We collected speech and physiological sensing data from 73 preschool-age children who stutter (CWS) and who dont stutter (CWNS) in different conditions. PASADs unique architecture enables visualizing speech attributes distinct to a CWSs fluent speech and mapping them to the speakers respective speech-motor-control factors (i.e., speech articulators). Extracted knowledge can enhance understanding of childrens fluent speech, speech-motor-control (SMC), and stuttering development. Our comprehensive evaluation shows that PASAD outperforms state-of-the-art multi-modal baseline approaches in different conditions, is expressive and adaptive to the speakers speech and physiology, generalizable, robust, and is real-time executable on mobile and scalable devices.

【2】 Avoid Overthinking in Self-Supervised Models for Speech Recognition标题:避免在用于语音识别的自监督模型中过度思考链接:https://arxiv.org/abs/2211.08989

作者:Dan Berrebbi,Brian Yan,Shinji Watanabe机构:Carnegie Mellon University摘要:自我监督学习(SSL)模型重塑了我们的语音、语言和视觉方法。

然而,它们的庞大规模以及它们的层与任务之间的不透明关系导致了缓慢的推理和网络过度思考,其中从大型模型的最后一层做出的预测比从中间层做出的预测更差早期退出(EE)策略可以通过动态减少某些样本的推理时间计算量来解决这两个问题。

虽然EE在视觉和语言分类任务中很流行,但在序列到序列语音识别(ASR)任务中,EE的使用较少,因为早期层的输出经常退化当语音SSL模型应用于分布外(OOD)数据时,这一挑战进一步加剧本文首先指出了SSL模型在ASR中的过度考虑。

然后,我们通过计算性能与速度权衡的最佳界限来推动EE的进一步研究为了接近这一界限,我们提出了两种新的ASR策略:(1)我们将最近提出的耐心策略应用于ASR;(2)设计了一种新的针对ASR的EE策略,其性能优于之前介绍的所有策略。

摘要:Self-supervised learning (SSL) models reshaped our approach to speech, language and vision. However their huge size and the opaque relations between their layers and tasks result in slow inference and network overthinking, where predictions made from the last layer of large models is worse than those made from intermediate layers. Early exit (EE) strategies can solve both issues by dynamically reducing computations at inference time for certain samples. Although popular for classification tasks in vision and language, EE has seen less use for sequence-to-sequence speech recognition (ASR) tasks where outputs from early layers are often degenerate. This challenge is further compounded when speech SSL models are applied on out-of-distribution (OOD) data. This paper first shows that SSL models do overthinking in ASR. We then motivate further research in EE by computing an optimal bound for performance versus speed trade-offs. To approach this bound we propose two new strategies for ASR: (1) we adapt the recently proposed patience strategy to ASR; and (2) we design a new EE strategy specific to ASR that performs better than all strategies previously introduced.

【3】 Is my automatic audio captioning system so bad? spider-max: a metric to  consider several caption candidates

标题:我的自动音频字幕系统有那么糟糕吗?Spider-max:考虑多个候选字幕的指标链接:https://arxiv.org/abs/2211.08983作者:Etienne Labbé,Thomas Pellegrini,Julien Pinquier

机构:IRIT, Universit´e Paul Sabatier, CNRS, Toulouse, France备注:Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2022), Nov 2022, Nancy, France

摘要:自动音频字幕(AAC)是一项旨在使用自然语言描述音频信号的任务AAC系统将音频信号作为输入,并输出自由形式的文本语句,称为字幕评估这样的系统并不是一件小事,因为有许多方式可以表达同一个想法出于这个原因,使用诸如BLEU、CIDEr、SPICE和SPIDEr之类的几个互补度量来将单个自动字幕与由人类注释器产生的一个或几个参考字幕进行比较。

然而,自动系统可产生若干字幕候选项,例如,在句子产生过程中使用一些随机性,或通过在用波束搜索解码期间考虑各种竞争的假设字幕如果我们考虑AAC系统的终端用户,呈现几个标题而不是单个标题似乎与提供某种多样性相关,类似于信息检索系统。

在这项工作中,我们探讨了在评估过程中考虑几个而不是一个预测字幕的可能性为此,我们提出SPIDEr-max,这是一种在多个字幕候选项的得分中取最大SPIDEr值的度量为了支持我们的指标,我们报告了在Clotho v2.1和AudioCaps上使用基于转换的系统进行的实验。

以AudioCaps为例,该系统达到的SPIDEr-max值(有5个候选项)接近SPIDEr人类参考评分摘要:Automatic Audio Captioning (AAC) is the task that aims to describe an audio signal using natural language. AAC systems take as input an audio signal and output a free-form text sentence, called a caption. Evaluating such systems is not trivial, since there are many ways to express the same idea. For this reason, several complementary metrics, such as BLEU, CIDEr, SPICE and SPIDEr, are used to compare a single automatic caption to one or several captions of reference, produced by a human annotator. Nevertheless, an automatic system can produce several caption candidates, either using some randomness in the sentence generation process, or by considering the various competing hypothesized captions during decoding with beam-search, for instance. If we consider an end-user of an AAC system, presenting several captions instead of a single one seems relevant to provide some diversity, similarly to information retrieval systems. In this work, we explore the possibility to consider several predicted captions in the evaluation process instead of one. For this purpose, we propose SPIDEr-max, a metric that takes the maximum SPIDEr value among the scores of several caption candidates. To advocate for our metric, we report experiments on Clotho v2.1 and AudioCaps, with a transformed-based system. On AudioCaps for example, this system reached a SPIDEr-max value (with 5 candidates) close to the SPIDEr human score of reference.

【4】 Rapid Connectionist Speaker Adaptation标题:快速连接主义说话人自适应链接:https://arxiv.org/abs/2211.08978作者:Michael Witbrock,Patrick Haffner

机构:School of Computer Science, Carnegie Mellon University, Pittsburgh, PA ,-, USA, Centre National des Etudes de Télécommunications, Route de Trégastel, Lannion, France

备注:None摘要:我们 介绍 了 一 个 用于 建模 说话 人 可变 性 的 系统 SVCnet 专用 于 每 种 语音 的 编码 器 神经 网络 产生 声学 变化 的 低 维 模型 , 并且 这些 模型 进一步 组合 成 话音 变化 的 总体 模型 。

描述 了 一 种 训练 过程 , 其 使 该 模型 对 发出 的 声音 的 依赖 性 最 小 化 使用 训练 的 模型 ( SVCnet ) 和 新 说话 人 声音 的 简短 的 、 不 受 约束 的 样本 , 系统 产生 说话 人 声音 代码 , 该 说话 人 声音 代码 可 用于 使 识别 系统 适应 新 说话 人 而 无需 再 训练 。

介绍 了 一 种 将 SVCnet 与 MS-TDNN 识别 器 相 结合 的 系统摘要:We present SVCnet, a system for modelling speaker variability. Encoder Neural Networks specialized for each speech sound produce low dimensionality models of acoustical variation, and these models are further combined into an overall model of voice variability. A training procedure is described which minimizes the dependence of this model on which sounds have been uttered. Using the trained model (SVCnet) and a brief, unconstrained sample of a new speakers voice, the system produces a Speaker Voice Code that can be used to adapt a recognition system to the new speaker without retraining. A system which combines SVCnet with an MS-TDNN recognizer is described

【5】 Data Augmentation with Unsupervised Speaking Style Transfer for Speech  Emotion Recognition标题:语音情感识别中无监督语气转移的数据增强

链接:https://arxiv.org/abs/2211.08843作者:Leyuan Qu,Wei Wang,Taihao Li,Cornelius Weber,Stefan Wermter,Fuji Ren

机构:Department of Informatics,  University of Hamburg摘要:目前 , 语音 情感 识别 系统 的 性能 主要 受到 大 规模 标注 语料 库 的 限制 。

数据 增强 被 认为 是 一 种 很 有 前途 的 方法 , 它 借鉴 了 自动 语音 识别 ( ASR ) 的 方法 , 例如 , 对 语音 的 速度 和 音调 进行 扰动 , 或者 利用 生成 式 对抗 网络 生成 情感 语音 。

本文 提出 了 一 种 新 的 风格 迁移 模型 EmoAug , 该 模型 通过 语义 编码 器 和 副 语言 编码 器 分别 表示 语言 和 非 语言 信息 , 以 增强 情感 表达 另外 , 解码 器 通过 以 无 监督 方式 调节 上述 两 个 信息 流 来 重构 语音 信号 。

一旦 训练 完成 , EmoAug 通过 向 副 语言 编码 器 输入 不同 的 风格 , 丰富 了 不同 韵律 属性 ( 如 重音 、 节奏 和 强度 ) 的 情感 语音 表达 此外 , 我们 还 可以 为 每个 类 生成 相似 数量 的 样本 , 以 解决 数据 不 平衡 问题 。

在 IEMOCAP 数据 集 上 的 实验 结果 表明 , EmoAug 能够 成功 地 转换 不同 的 说话 风格 , 同时 保留 说话 人 身份 和 语义 内容 此外 , 利用 EmoAug 增强 的 数据 训练 SER 模型 , 结果 表明 , 该 模型 不仅 优于 现有 的 监督 和 自 监督 方法 , 而且 克服 了 数据 不 平衡 导致 的 过 拟合 问题 .一些 音频 样本 可以 在 我们 的 演示 网站 上 找到 。

摘要:Currently, the performance of Speech Emotion Recognition (SER) systems is mainly constrained by the absence of large-scale labelled corpora. Data augmentation is regarded as a promising approach, which borrows methods from Automatic Speech Recognition (ASR), for instance, perturbation on speed and pitch, or generating emotional speech utilizing generative adversarial networks. In this paper, we propose EmoAug, a novel style transfer model to augment emotion expressions, in which a semantic encoder and a paralinguistic encoder represent verbal and non-verbal information respectively. Additionally, a decoder reconstructs speech signals by conditioning on the aforementioned two information flows in an unsupervised fashion. Once training is completed, EmoAug enriches expressions of emotional speech in different prosodic attributes, such as stress, rhythm and intensity, by feeding different styles into the paralinguistic encoder. In addition, we can also generate similar numbers of samples for each class to tackle the data imbalance issue. Experimental results on the IEMOCAP dataset demonstrate that EmoAug can successfully transfer different speaking styles while retaining the speaker identity and semantic content. Furthermore, we train a SER model with data augmented by EmoAug and show that it not only surpasses the state-of-the-art supervised and self-supervised methods but also overcomes overfitting problems caused by data imbalance. Some audio samples can be found on our demo website.

【6】 Speaker Adaptation for End-To-End Speech Recognition Systems in Noisy  Environments标题:噪声环境下端到端语音识别系统的说话人自适应

链接:https://arxiv.org/abs/2211.08774作者:Dominik Wagner,Ilja Baumann,Sebastian P. Bayerl,Korbinian Riedhammer,Tobias Bocklet

机构:†Technische Hochschule N¨urnberg Georg Simon Ohm, Germany, ‡Intel Labs备注:Submitted to ICASSP 2023摘要

:我们分析了基于Transformers和wav 2 vec 2.0的端到端架构在不同噪声条件下对扬声器自适应的影响我们证明,将说话人矢量连接到声学特征并将其作为辅助模型输入提供的已证明方法仍然是提高端到端架构鲁棒性的可行选择。

通过包括从x矢量和ECAPA-TDNN模型获得的说话人嵌入,我们在LibriSpeech上实现了高达9.6%的相对单词错误率改善,在Switchboard上实现了高达14.5%的相对单词错误率改善对基于变压器的架构的影响与信噪比(SNR)近似成反比,在强噪声环境($SNR=0$)中最强。

在基于wav 2 vec 2.0的系统中,扬声器自适应的最大优势可以在中等噪声条件下实现($SNR\geq18$)我们还发现,x向量往往比ECAPA-TDNN嵌入产生更大的改善摘要:We analyze the impact of speaker adaptation in end-to-end architectures based on transformers and wav2vec 2.0 under different noise conditions. We demonstrate that the proven method of concatenating speaker vectors to the acoustic features and supplying them as an auxiliary model input remains a viable option to increase the robustness of end-to-end architectures. By including speaker embeddings obtained from x-vector and ECAPA-TDNN models, we achieve relative word error rate improvements of up to 9.6% on LibriSpeech and up to 14.5% on Switchboard. The effect on transformer-based architectures is approximately inversely proportional to the signal-to-noise ratio (SNR) and is strongest in heavily noised environments ($SNR=0$). The most substantial benefit of speaker adaption in systems based on wav2vec 2.0 can be achieved under moderate noise conditions ($SNR\geq18$). We also find that x-vectors tend to yield larger improvements than ECAPA-TDNN embeddings.

【7】 Streaming Joint Speech Recognition and Disfluency Detection标题:流媒体联合语音识别与不流畅检测链接:https://arxiv.org/abs/2211.08726

作者:Hayato Futami,Emiru Tsunoo,Kentaro Shibata,Yosuke Kashiwagi,Takao Okuda,Siddhant Arora,Shinji Watanabe

机构:Sony Group Corporation, Japan ,Carnegie Mellon University, USA摘要:作为语音识别的后处理,不流畅检测主要以流水线方式解决在本研究中,我们提出了基于Transformer的编码器-解码器模型,该模型联合解决了语音识别和不流畅检测,以流的方式工作。

与管道方法相比,联合模型可以利用声学信息,使得不流利检测对识别错误具有鲁棒性,并提供非语言线索此外,联合建模导致低延迟和轻量级推理我们研究用于流式不流畅检测的两个联合模型变体:转录富集模型和多任务模型在具有指示不流畅部分的起点和终点的特殊标签的文本上训练转录富集模型。

然而,它具有延迟和标准语言模型适应的问题,这是由附加的不流利标签引起的我们提出了一个多任务模型来解决这些问题,该模型在Transformer解码器上有两个输出层;一个用于语音识别,另一个用于不流畅检测它被建模为以具有附加令牌依赖机制的当前识别的令牌为条件。

在Switchboard和自发日语语料库上的实验结果表明,所提出的联合模型在准确率和延迟上都优于基于BERT的流水线方法摘要:Disfluency detection has mainly been solved in a pipeline approach, as post-processing of speech recognition. In this study, we propose Transformer-based encoder-decoder models that jointly solve speech recognition and disfluency detection, which work in a streaming manner. Compared to pipeline approaches, the joint models can leverage acoustic information that makes disfluency detection robust to recognition errors and provide non-verbal clues. Moreover, joint modeling results in low-latency and lightweight inference. We investigate two joint model variants for streaming disfluency detection: a transcript-enriched model and a multi-task model. The transcript-enriched model is trained on text with special tags indicating the starting and ending points of the disfluent part. However, it has problems with latency and standard language model adaptation, which arise from the additional disfluency tags. We propose a multi-task model to solve such problems, which has two output layers at the Transformer decoder; one for speech recognition and the other for disfluency detection. It is modeled to be conditioned on the currently recognized token with an additional token-dependency mechanism. We show that the proposed joint models outperformed a BERT-based pipeline approach in both accuracy and latency, on both the Switchboard and the corpus of spontaneous Japanese.

【8】 Conditional variational autoencoder to improve neural audio synthesis  for polyphonic music sound

标题:改进复调音乐声音神经音频合成的条件变分自动编码器链接:https://arxiv.org/abs/2211.08715作者:Seokjin Lee,Minhan Kim,Seunghyeon Shin,Daeho Lee,Inseon Jang,Wootaek Lim

机构:School of Electronics Engineering, Kyungpook National University, School of Electronic and Electrical Engineering, Kyungpook National University, Electronics and Telecommunications Research Institute

备注:5 pages, 6 figures摘要:用于音频合成的深度生成模型最近已经得到显著改进然而,对原始波形建模的任务仍然是一个困难的问题,尤其是对于音频波形和音乐信号近年来,实时音频变分自动编码器(RAVE)方法被开发用于高质量音频波形合成。

RAVE方法基于变分自动编码器,采用两阶段训练策略不幸的是,RAVE型号在再现宽音高复音音乐声音方面受到限制因此,为了提高重建性能,我们采用基音激活数据作为RAVE模型的辅助信息为了处理辅助信息,我们提出了一种改进的RAVE模型,它具有条件变分自动编码器结构和附加的全连通层。

为了评估该结构,我们使用MAESTRO进行了一个基于多刺激测试的听力实验,测试中使用了隐藏参照和锚点(MUSHRA)实验结果表明,与传统的RAVE模型相比,该模型在性能和稳定性方面都有明显的改善摘要:Deep generative models for audio synthesis have recently been significantly improved. However, the task of modeling raw-waveforms remains a difficult problem, especially for audio waveforms and music signals. Recently, the realtime audio variational autoencoder (RAVE) method was developed for high-quality audio waveform synthesis. The RAVE method is based on the variational autoencoder and utilizes the two-stage training strategy. Unfortunately, the RAVE model is limited in reproducing wide-pitch polyphonic music sound. Therefore, to enhance the reconstruction performance, we adopt the pitch activation data as an auxiliary information to the RAVE model. To handle the auxiliary information, we propose an enhanced RAVE model with a conditional variational autoencoder structure and an additional fully-connected layer. To evaluate the proposed structure, we conducted a listening experiment based on multiple stimulus tests with hidden references and an anchor (MUSHRA) with the MAESTRO. The obtained results indicate that the proposed model exhibits a more significant performance and stability improvement than the conventional RAVE model.

【9】 Exploring Detection-based Method For Speaker Diarization @ Ego4D  Audio-only Diarization Challenge 2022

标题:探索基于检测的说话人对象化方法@Ego4D纯音频对象化挑战2022链接:https://arxiv.org/abs/2211.08708作者:Jiahao Wang,Guo Chen,Yin-Dong Zheng,Tong Lu

机构:State Key Lab for Novel Software Technology, Nanjing University备注:2 pages摘要:我们在ECCV 2022中提供了Ego 4D仅音频诊断挑战的技术报告。

说话人diarization以音频流为输入,根据说话人身份输出同质片段它旨在解决“谁在何时发言”的问题“在本文中,我们探索了一种基于检测的方法来处理只有音频的说话人诊断任务该方法首先利用音频主干提取音频特征,然后将特征送入检测-生成网络,得到说话人建议。

最后,经过后处理,得到二值化结果验证数据集验证了该方法的有效性,在测试数据集上我们的方法达到了53.85 DER这些结果在2022年Ego 4D纯音频日记化挑战排行榜上排名第三摘要:We provide the technical report for Ego4D audio-only diarization challenge in ECCV 2022. Speaker diarization takes the audio streams as input and outputs the homogeneous segments according to the speakers identity. It aims to solve the problem of "Who spoke when." In this paper, we explore a Detection-based method to tackle the audio-only speaker diarization task. Our method first extracts audio features by audio backbone and then feeds the feature to a detection-generate network to get the speaker proposals. Finally, after postprocessing, we can get the diarization results. The validation dataset validates this method, and our method achieves 53.85 DER on the test dataset. These results rank 3rd on the leaderboard of Ego4D audio-only diarization challenge 2022.

【10】 PBSM: Backdoor attack against Keyword spotting based on pitch boosting  and sound masking标题:PBSM:基于基音提升和声音掩蔽的关键词定位后门攻击

链接:https://arxiv.org/abs/2211.08697作者:Hanbo Cai,Pengcheng Zhang,Hai Dong,Yan Xiao,Shunhui Ji机构:College of Computer and Information, Hohai University, Nanjing, China,  School of Computing Technologies, RMIT University, Melbourne, Australia,  School of Computing, National University of Singapore, Singapore

备注:5 pages, 4 figures摘要:关键词识别(Keyword Spotting,KWS)已被广泛应用于各种语音控制场景中KWS的训练通常基于深度神经网络,需要大量的数据制造商经常使用第三方数据来训练KWS。

然而,深度神经网络对于制造商来说解释性不够,攻击者可以在模型训练期间操纵第三方训练数据来植入后门有效的后门攻击可以迫使模型在某些条件下做出指定的判断,即,触发器本文针对KWS系统设计了一种基于Pitch Boosting和Sound Masking的后门攻击方案PBSM。

实验结果表明,当中毒量小于1%的训练数据时,PBSM在3个受害者模型中的平均攻击成功率接近90%摘要:Keyword spotting (KWS) has been widely used in various speech control scenarios. The training of KWS is usually based on deep neural networks and requires a large amount of data. Manufacturers often use third-party data to train KWS. However, deep neural networks are not sufficiently interpretable to manufacturers, and attackers can manipulate third-party training data to plant backdoors during the model training. An effective backdoor attack can force the model to make specified judgments under certain conditions, i.e., triggers. In this paper, we design a backdoor attack scheme based on Pitch Boosting and Sound Masking for KWS, called PBSM. Experimental results demonstrated that PBSM is feasible to achieve an average attack success rate close to 90% in three victim models when poisoning less than 1% of the training data.

【11】 Leveraging Heteroscedastic Uncertainty in Learning Complex Spectral  Mapping for Single-channel Speech Enhancement

标题:利用异方差不确定性学习复谱映射进行单通道语音增强链接:https://arxiv.org/abs/2211.08624作者:Kuan-Lin Chen,Daniel D. E. Wong,Ke Tan,Buye Xu,Anurag Kumar,Vamsi Krishna Ithapu

机构:Meta Reality Labs Research, Department of Electrical and Computer Engineering, University of California, San Diego

备注:5 pages. Submitted to ICASSP 2023摘要:大多数语音增强(SE)模型学习点估计,并且在学习过程中不利用不确定性估计在这篇文章中,我们证明了通过最小化多元高斯负对数似然(NLL)来建模异方差不确定性可以在不增加额外成本的情况下提高SE性能。

在训练过程中,我们的方法用一个临时子模型来增强一个模型学习复谱映射,以预测每个时频点处的增强误差的协方差由于不受限制的异方差不确定性,协方差引入了采样不足效应,对SE性能不利为了减轻欠采样,我们的方法扩大了不确定性下限,并将每个损失分量与其不确定性进行加权,从而有效地补偿了严重欠采样的分量。

我们的多元设置揭示了常见的协方差假设,如标量和对角矩阵通过弱化这些假设,我们证明了NLL与包括均方误差(MSE)、平均绝对误差(MAE)和尺度不变信号失真比(SI-SDR)在内的常见损失相比,实现了更好的性能。

摘要:Most speech enhancement (SE) models learn a point estimate, and do not make use of uncertainty estimation in the learning process. In this paper, we show that modeling heteroscedastic uncertainty by minimizing a multivariate Gaussian negative log-likelihood (NLL) improves SE performance at no extra cost. During training, our approach augments a model learning complex spectral mapping with a temporary submodel to predict the covariance of the enhancement error at each time-frequency bin. Due to unrestricted heteroscedastic uncertainty, the covariance introduces an undersampling effect, detrimental to SE performance. To mitigate undersampling, our approach inflates the uncertainty lower bound and weights each loss component with their uncertainty, effectively compensating severely undersampled components with more penalties. Our multivariate setting reveals common covariance assumptions such as scalar and diagonal matrices. By weakening these assumptions, we show that the NLL achieves superior performance compared to popular losses including the mean squared error (MSE), mean absolute error (MAE), and scale-invariant signal-to-distortion ratio (SI-SDR).

【12】 McNet: Fuse Multiple Cues for Multichannel Speech Enhancement标题:McNet:融合多线索进行多通道语音增强链接:https://arxiv.org/abs/2211.08872

作者:Yujie Yang,Changsheng Quan,Xiaofei Li机构:Zhejiang University, Hangzhou, China,  Westlake University & Westlake Institute for Advanced Study, Hangzhou, China

备注:submitted to icassp 2023摘要:在多通道语音增强中,频谱和空间信息对于区分语音和噪声是至关重要的如何充分利用这两类信息及其时间动态性仍然是一个有趣的研究问题针对这一问题,提出了一种多线索融合网络McNet,该网络由4个模块级联而成,分别利用全波段空间信息、窄带空间信息、子波段光谱信息和全波段光谱信息。

实验结果表明,该网络中的每个模块都有其独特的贡献,整体上明显优于其他现有方法摘要:In multichannel speech enhancement, both spectral and spatial information are vital for discriminating between speech and noise. How to fully exploit these two types of information and their temporal dynamics remains an interesting research problem. As a solution to this problem, this paper proposes a multi-cue fusion network named McNet, which cascades four modules to respectively exploit the full-band spatial, narrow-band spatial, sub-band spectral, and full-band spectral information. Experiments show that each module in the proposed network has its unique contribution and, as a whole, notably outperforms other state-of-the-art methods.

【13】 Delivering Speaking Style in Low-resource Voice Conversion with  Multi-factor Constraints标题:多因素约束下低资源语音转换中的语体传递

链接:https://arxiv.org/abs/2211.08857作者:Zhichao Wang,Xinsheng Wang,Lei Xie,Yuanzhe Chen,Qiao Tian,Yuping Wang

机构:School of Computer Science, Northwestern Polytechnical University, Xi’an, China, Speech, Audio & Music Intelligence (SAMI), ByteDance

备注:Submitted to ICASSP 2023摘要:在语音转换中,既要表达语言内容,又要保持源语音的语调、情感等说话风格然而,在低资源的情况下,其中只有有限的来自目标说话人的话语是可访问的,现有的VC方法很难满足这一要求并捕获目标说话人的木材。

针对低资源VC任务,提出了一种新的VC模型MFC-StyleVC提出了一种基于聚类的说话人音色约束,用于指导目标说话人音色学习的不同阶段同时,为了防止对目标说话人的有限数据的过拟合,感知正则化约束显式地保持模型在特定方面的性能,包括说话风格、语言内容和语音质量。

此外,引入仿真模式模拟推理过程,以缓解训练与推理之间的不匹配在高表达语音上的大量实验证明了该方法在低资源VC环境下的优越性摘要:Conveying the linguistic content and maintaining the source speechs speaking style, such as intonation and emotion, is essential in voice conversion (VC). However, in a low-resource situation, where only limited utterances from the target speaker are accessible, existing VC methods are hard to meet this requirement and capture the target speakers timber. In this work, a novel VC model, referred to as MFC-StyleVC, is proposed for the low-resource VC task. Specifically, speaker timbre constraint generated by clustering method is newly proposed to guide target speaker timbre learning in different stages. Meanwhile, to prevent over-fitting to the target speakers limited data, perceptual regularization constraints explicitly maintain model performance on specific aspects, including speaking style, linguistic content, and speech quality. Besides, a simulation mode is introduced to simulate the inference process to alleviate the mismatch between training and inference. Extensive experiments performed on highly expressive speech demonstrate the superiority of the proposed method in low-resource VC.

【14】 Annotation of Soft Onsets in String Ensemble Recordings标题:弦乐录音中软开头的注解链接:https://arxiv.org/abs/2211.08848

作者:Maciej Tomczak,Min Susan Li,Adrian Bradbury,Mark Elliott,Ryan Stables,Maria Witek,Tom Goodman,Diar Abdlkarim,Massimiliano Di Luca,Alan Wing,Jason Hockman

机构:and Jason HockmanaaBirmingham City University,  UK;bUniversity of Birmingham,  University of Warwick

摘要:开始检测是识别音频记录内的音符事件的开始点的过程虽然打击乐起始点的检测通常被认为是一个已解决的问题,但弦乐器录音中发现的柔和起始点仍然对最先进的算法提出了重大挑战包含专家注释和与用于管理弦乐器的软开始注释的最佳实践相关的研究的数据的缺乏进一步加剧了该问题。

为此,我们研究了24名参与者之间的注释者之间的一致性,扩展了一种确定最一致注释者的算法,并比较了人类注释者和最先进的开始检测算法的性能实验结果表明,与自动系统相比,音乐体验与注释者间的一致性和性能之间存在积极的趋势。

此外,发现由指法的变化以及来自大提琴的变化所产生的首奏对于人类注释者和自动方法都是特别具有挑战性的为了促进软发作注释最佳实践的研究,我们公开了与本研究相关的所有实验数据此外,我们还发布了ARME Virtuoso Strings数据集,其中包含超过144张海顿弦乐四重奏Op. 74 No. 1 Finale的专业演奏录音,每张录音都有相应的乐器开始注释。

摘要:Onset detection is the process of identifying the start points of musical note events within an audio recording. While the detection of percussive onsets is often considered a solved problem, soft onsets-as found in string instrument recordings-still pose a significant challenge for state-of-the-art algorithms. The problem is further exacerbated by a paucity of data containing expert annotations and research related to best practices for curating soft onset annotations for string instruments. To this end, we investigate inter-annotator agreement between 24 participants, extend an algorithm for determining the most consistent annotator, and compare the performance of human annotators and state-of-the-art onset detection algorithms. Experimental results reveal a positive trend between musical experience and both inter-annotator agreement and performance in comparison with automated systems. Additionally, onsets produced by changes in fingering as well as those from the cello were found to be particularly challenging for both human annotators and automatic approaches. To promote research in best practices for annotation of soft onsets, we have made all experimental data associated with this study publicly available. In addition, we publish the ARME Virtuoso Strings dataset, consisting of over 144 recordings of professional performances of an excerpt from Haydns string quartet Op. 74 No. 1 Finale, each with corresponding individual instrumental onset annotations.

【15】 Array Configuration-Agnostic Personalized Speech Enhancement using  Long-Short-Term Spatial Coherence

标题:基于长短期空间相干的阵列结构不可知个性化语音增强链接:https://arxiv.org/abs/2211.08748作者:Yicheng Hsu,Yonghan Lee,Mingsian R. Bai

摘要:个性化语音增强一直是抑制诸如竞争说话者或电视对话之类的类语音干扰的活跃研究领域与单通道方法相比,通过利用麦克风信号中的空间信息,多通道PSE系统可以在不利的声学条件下更有效然而,实施多通道PSE以适应家庭应用中的各种阵列拓扑结构可能具有挑战性。

为了开发一个阵列结构不可知的PSE系统,我们定义了一个称为长短期空间相干性的空间特征作为卷积递归网络的输入特征,以监测目标说话人的语音活动作为另一改进,等效矩形带宽缩放LSTSC特征可用于减少计算成本通过实验比较了所提出的PSE系统,包括完整版本和简化版本,其中两个基线使用不可见的房间响应和阵列配置,在存在电视噪声和竞争扬声器的情况下。

实验结果表明,采用LSTSC特征训练的多通道PSE网络在不需要精确了解阵列配置和房间响应的情况下,能够实现较好的增强性能摘要:Personalized speech enhancement has been a field of active research for suppression of speechlike interferers such as competing speakers or TV dialogues. Compared with single channel approaches, multichannel PSE systems can be more effective in adverse acoustic conditions by leveraging the spatial information in microphone signals. However, the implementation of multichannel PSEs to accommodate a wide range of array topology in household applications can be challenging. To develop an array configuration agnostic PSE system, we define a spatial feature termed the long short term spatial coherence as the input feature to a convolutional recurrent network to monitor the voice activity of the target speaker. As another refinement, an equivalent rectangular bandwidth scaled LSTSC feature can be used to reduce the computational cost. Experiments were conducted to compare the proposed PSE systems, including the complete and the simplified versions with two baselines using unseen room responses and array configurations in the presence of TV noise and competing speakers. The results demonstrated that the proposed multichannel PSE network trained with the LSTSC feature achieved superior enhancement performance without precise knowledge of the array configurations and room responses.

【16】 Hybrid Transformers for Music Source Separation标题:用于音乐源分离的混合变形器链接:https://arxiv.org/abs/2211.08553

作者:Simon Rouard,Francisco Massa,Alexandre Défossez机构:Meta AI摘要:在 音乐 源 分离 ( MSS ) 中 出现 的 自然 问题 是 长 距离 上下文 信息 是否 有用 , 或者 局部 声学 特征 是否 足够 。

在 其他 领域 , 基于 注意 力 的 变形 金刚 已经 显示 出 它们 在 长 序列 上 整合 信息 的 能力 本文 提出 了 一 种 混合 变换 器 Demucs ( Hybrid Transformer Demucs , HT Demucs ) , 它 是 一 种 基于 Hybrid Demucs 的 混合 时间/谱 双 U 网 , 其中 最 内层 由 一 个 跨 域 的 Transformer Encoder 代替 , 在 一 个 域内 使用 自 注意 , 在 跨 域 使用 交叉 注意 。

虽然 当 仅 在 MUSDB 上 训练 时 , 它 的 表现 很 差 , 但 我们 表明 , 当 使用 800 首 额外 训练 歌曲 时 , 它 比 混合 Demucs ( 在 相同 数据 上 训练 ) 的 SDR 高 出 0.45 dB 。

利用 稀疏 注意 核 扩展 其 感受 野 , 并 对 每个 源 进行 微调 , 我们 在 额外 训练 数据 的 情况 下 在 MUSDB 上 获得 了 最 先进 的 结果 , SDR 为 9.20 dB 。

摘要:A natural question arising in Music Source Separation (MSS) is whether long range contextual information is useful, or whether local acoustic features are sufficient. In other fields, attention based Transformers have shown their ability to integrate information over long sequences. In this work, we introduce Hybrid Transformer Demucs (HT Demucs), an hybrid temporal/spectral bi-U-Net based on Hybrid Demucs, where the innermost layers are replaced by a cross-domain Transformer Encoder, using self-attention within one domain, and cross-attention across domains. While it performs poorly when trained only on MUSDB, we show that it outperforms Hybrid Demucs (trained on the same data) by 0.45 dB of SDR when using 800 extra training songs. Using sparse attention kernels to extend its receptive field, and per source fine-tuning, we achieve state-of-the-art results on MUSDB with extra training data, with 9.20 dB of SDR.

eess.AS音频处理【1】 McNet: Fuse Multiple Cues for Multichannel Speech Enhancement标题:McNet:融合多线索进行多通道语音增强链接

:https://arxiv.org/abs/2211.08872* 与cs.SD语音【12】为同一篇作者:Yujie Yang,Changsheng Quan,Xiaofei Li机构:Zhejiang University, Hangzhou, China,  Westlake University & Westlake Institute for Advanced Study, Hangzhou, China

备注:submitted to icassp 2023摘要:在多通道语音增强中,频谱和空间信息对于区分语音和噪声是至关重要的如何充分利用这两类信息及其时间动态性仍然是一个有趣的研究问题针对这一问题,提出了一种多线索融合网络McNet,该网络由4个模块级联而成,分别利用全波段空间信息、窄带空间信息、子波段光谱信息和全波段光谱信息。

实验结果表明,该网络中的每个模块都有其独特的贡献,整体上明显优于其他现有方法摘要:In multichannel speech enhancement, both spectral and spatial information are vital for discriminating between speech and noise. How to fully exploit these two types of information and their temporal dynamics remains an interesting research problem. As a solution to this problem, this paper proposes a multi-cue fusion network named McNet, which cascades four modules to respectively exploit the full-band spatial, narrow-band spatial, sub-band spectral, and full-band spectral information. Experiments show that each module in the proposed network has its unique contribution and, as a whole, notably outperforms other state-of-the-art methods.

【2】 Delivering Speaking Style in Low-resource Voice Conversion with  Multi-factor Constraints标题:多因素约束下低资源语音转换中的语体传递

链接:https://arxiv.org/abs/2211.08857* 与cs.SD语音【13】为同一篇作者:Zhichao Wang,Xinsheng Wang,Lei Xie,Yuanzhe Chen,Qiao Tian,Yuping Wang

机构:School of Computer Science, Northwestern Polytechnical University, Xi’an, China, Speech, Audio & Music Intelligence (SAMI), ByteDance

备注:Submitted to ICASSP 2023摘要:在语音转换中,既要表达语言内容,又要保持源语音的语调、情感等说话风格然而,在低资源的情况下,其中只有有限的来自目标说话人的话语是可访问的,现有的VC方法很难满足这一要求并捕获目标说话人的木材。

针对低资源VC任务,提出了一种新的VC模型MFC-StyleVC提出了一种基于聚类的说话人音色约束,用于指导目标说话人音色学习的不同阶段同时,为了防止对目标说话人的有限数据的过拟合,感知正则化约束显式地保持模型在特定方面的性能,包括说话风格、语言内容和语音质量。

此外,引入仿真模式模拟推理过程,以缓解训练与推理之间的不匹配在高表达语音上的大量实验证明了该方法在低资源VC环境下的优越性摘要:Conveying the linguistic content and maintaining the source speechs speaking style, such as intonation and emotion, is essential in voice conversion (VC). However, in a low-resource situation, where only limited utterances from the target speaker are accessible, existing VC methods are hard to meet this requirement and capture the target speakers timber. In this work, a novel VC model, referred to as MFC-StyleVC, is proposed for the low-resource VC task. Specifically, speaker timbre constraint generated by clustering method is newly proposed to guide target speaker timbre learning in different stages. Meanwhile, to prevent over-fitting to the target speakers limited data, perceptual regularization constraints explicitly maintain model performance on specific aspects, including speaking style, linguistic content, and speech quality. Besides, a simulation mode is introduced to simulate the inference process to alleviate the mismatch between training and inference. Extensive experiments performed on highly expressive speech demonstrate the superiority of the proposed method in low-resource VC.

【3】 L2 proficiency assessment using self-supervised speech representations标题:使用自我监督语音表征的二语水平评估链接:https://arxiv.org/abs/2211.08849

作者:Stefano Bannò,Kate M. Knill,Marco Matassoni,Vyas Raina,Mark J. F. Gales机构:Fondazione Bruno Kessler, Trento, Italy,University of Trento, Trento, Italy, ALTA Institute, Cambridge University, UK,Enhanced Speech Technology Ltd., UK∗

摘要:近年 来 , 对 自动 口语 评估 系统 的 需求 不断 增长 这 一 过程 的 标准 流程 是 从 语音 识别 系统 开始 , 并 衍生 出 利用 转录 和 音频 的 特征 , 无论 是 手工 制作 的 还是 基于 深度 学习 的 。

虽然 这些 方法 可以 产生 高 性能 系统 , 但是 它们 需要 能够 用于 L2 说话 者 的 语音 识别 系统 , 并且 优选 地 调谐 到 所 部署 的 特定 测试 形式 最近 提出 了 一 种 基于 自 监督 语音 表示 的 方案 , 不 需要 语音 识别 。

本 研究 将 该 方法 的 初步 分析 扩展 到 一 个 大 规模 的 语言 技能 测试 , 该 测试 包括 多 个 部分 , 每个 部分 都 被 设计 用于 评估 候选 人 口语 水平 的 不同 属性 。

将 自我 监督 的 wav2vec 2.0 系统 的 性能 与 高 性能 手工 制作 的 评估 系统 和 基于 BERT 的 文本 系统 进行 比较 , 这 两 个 系统 都 使用 语音 转录 虽然 基于 wav2vec 2.0 的 系统 被 发现 对 响应 的 性质 敏感 , 但是 它 可以 被 配置 为 产生 与 需要 语音 转录 的 系统 相当 的 性能 , 并且 当 与 标准 方法 适当 地 组合 时 产生 增益 。

摘要:There has been a growing demand for automated spoken language assessment systems in recent years. A standard pipeline for this process is to start with a speech recognition system and derive features, either hand-crafted or based on deep-learning, that exploit the transcription and audio. Though these approaches can yield high performance systems, they require speech recognition systems that can be used for L2 speakers, and preferably tuned to the specific form of test being deployed. Recently a self-supervised speech representation based scheme, requiring no speech recognition, was proposed. This work extends the initial analysis conducted on this approach to a large scale proficiency test, Linguaskill, that comprises multiple parts, each designed to assess different attributes of a candidates speaking proficiency. The performance of the self-supervised, wav2vec 2.0, system is compared to a high performance hand-crafted assessment system and a BERT-based text system both of which use speech transcriptions. Though the wav2vec 2.0 based system is found to be sensitive to the nature of the response, it can be configured to yield comparable performance to systems requiring a speech transcription, and yields gains when appropriately combined with standard approaches.

【4】 Annotation of Soft Onsets in String Ensemble Recordings标题:弦乐录音中软开头的注解链接:https://arxiv.org/abs/2211.08848

* 与cs.SD语音【14】为同一篇作者:Maciej Tomczak,Min Susan Li,Adrian Bradbury,Mark Elliott,Ryan Stables,Maria Witek,Tom Goodman,Diar Abdlkarim,Massimiliano Di Luca,Alan Wing,Jason Hockman

机构:and Jason HockmanaaBirmingham City University,  UK;bUniversity of Birmingham,  University of Warwick

摘要:开始检测是识别音频记录内的音符事件的开始点的过程虽然打击乐起始点的检测通常被认为是一个已解决的问题,但弦乐器录音中发现的柔和起始点仍然对最先进的算法提出了重大挑战包含专家注释和与用于管理弦乐器的软开始注释的最佳实践相关的研究的数据的缺乏进一步加剧了该问题。

为此,我们研究了24名参与者之间的注释者之间的一致性,扩展了一种确定最一致注释者的算法,并比较了人类注释者和最先进的开始检测算法的性能实验结果表明,与自动系统相比,音乐体验与注释者间的一致性和性能之间存在积极的趋势。

此外,发现由指法的变化以及来自大提琴的变化所产生的首奏对于人类注释者和自动方法都是特别具有挑战性的为了促进软发作注释最佳实践的研究,我们公开了与本研究相关的所有实验数据此外,我们还发布了ARME Virtuoso Strings数据集,其中包含超过144张海顿弦乐四重奏Op. 74 No. 1 Finale的专业演奏录音,每张录音都有相应的乐器开始注释。

摘要:Onset detection is the process of identifying the start points of musical note events within an audio recording. While the detection of percussive onsets is often considered a solved problem, soft onsets-as found in string instrument recordings-still pose a significant challenge for state-of-the-art algorithms. The problem is further exacerbated by a paucity of data containing expert annotations and research related to best practices for curating soft onset annotations for string instruments. To this end, we investigate inter-annotator agreement between 24 participants, extend an algorithm for determining the most consistent annotator, and compare the performance of human annotators and state-of-the-art onset detection algorithms. Experimental results reveal a positive trend between musical experience and both inter-annotator agreement and performance in comparison with automated systems. Additionally, onsets produced by changes in fingering as well as those from the cello were found to be particularly challenging for both human annotators and automatic approaches. To promote research in best practices for annotation of soft onsets, we have made all experimental data associated with this study publicly available. In addition, we publish the ARME Virtuoso Strings dataset, consisting of over 144 recordings of professional performances of an excerpt from Haydns string quartet Op. 74 No. 1 Finale, each with corresponding individual instrumental onset annotations.

【5】 On using the UA-Speech and TORGO databases to validate automatic  dysarthric speech classification approaches

标题:利用UA-Speech和Torgo数据库验证自动构音障碍语音分类方法链接:https://arxiv.org/abs/2211.08833作者:Guilherme Schu,Parvaneh Janbakhshi,Ina Kodrasi

机构:⋆Idiap Research Institute, Martigny, Switzerland, † ´Ecole Polytechnique F´ed´erale de Lausanne, Lausanne, Switzerland, ‡ Bayer AG, Berlin, Germany

备注:Submitted to ICASSP 2023摘要:虽然UA-Speech和TORGO控制和构音障碍语音数据库是研究团体开发鲁棒自动语音识别系统的宝贵资源,但它们也被用于验证相当数量的构音障碍语音自动分类方法。

这样的方法通常依赖于潜在的假设,即来自控制说话者和构音障碍说话者的记录是使用相同的记录设置在相同的无噪声环境中收集的在本文中,我们证明了这一假设在UA-Speech和TORGO数据库中是违反的使用声音活动检测来提取语音和非语音段,我们表明,大多数最先进的构音障碍分类方法在使用这些数据库的非语音段时比使用语音段时实现了相同或明显更好的性能。

这些结果表明,在UA-Speech和TORGO数据库上训练和验证的这些方法是潜在的记录环境或设置的学习特征,而不是构音障碍语音特征我们希望这些结果能提高研究界对发展和评估构音障碍自动分类方法时记录质量重要性的认识。

摘要:Although the UA-Speech and TORGO databases of control and dysarthric speech are invaluable resources made available to the research community with the objective of developing robust automatic speech recognition systems, they have also been used to validate a considerable number of automatic dysarthric speech classification approaches. Such approaches typically rely on the underlying assumption that recordings from control and dysarthric speakers are collected in the same noiseless environment using the same recording setup. In this paper, we show that this assumption is violated for the UA-Speech and TORGO databases. Using voice activity detection to extract speech and non-speech segments, we show that the majority of state-of-the-art dysarthria classification approaches achieve the same or a considerably better performance when using the non-speech segments of these databases than when using the speech segments. These results demonstrate that such approaches trained and validated on the UA-Speech and TORGO databases are potentially learning characteristics of the recording environment or setup rather than dysarthric speech characteristics. We hope that these results raise awareness in the research community about the importance of the quality of recordings when developing and evaluating automatic dysarthria classification approaches.

【6】 Structural Segmentation and Labeling of Tabla Solo Performances标题:Tabla独奏表演的结构切分与标注链接:https://arxiv.org/abs/2211.08790

作者:Gowriprasad R,R Aravind,Hema A Murthy机构:Department of Electrical Engineering, IIT Madras, Chennai , Department of Computer Science and Engineering, IIT Madras, Chennai,  India, -,-

备注:35 pages, 11 figures摘要:塔布拉(Tabla)是北印度的一种打击乐器,用作伴奏和独奏的专用乐器塔布拉独奏曲错综复杂,通过一系列具有共同节奏特征的同质部分来展示节奏的演变印度次大陆的塔布拉学习和演奏是基于被称为gharana-s的风格流派。

不同的作曲家的作品在每个部分都有演奏本文旨在探讨如何将塔布拉独奏音乐会分割成具有音乐意义的片段然后,我们分配合适的部分标签,并从部分中识别gharana-s我们提出了一个多样化的收集超过38小时的独奏塔布拉录音的任务。

我们激发问题,提出任务的不同挑战和方面受塔布拉独奏独特音乐特性的启发,我们计算了几个节奏和音色特征用于分割任务本工作探索了一种通过分析局部自相似性来自动定位节奏结构中的显著变化的方法我们还探索了监督随机森林和卷积神经网络训练的手工制作的特点。

监督和非监督方法也在一组保持的记录上进行测试将音频片段分割成其结构组件并进行标记对于许多音乐信息检索应用(例如重复结构查找、音频摘要和快速音乐导航)是至关重要的这部作品帮助我们获得了对塔布拉独奏音乐会的全面音乐描述。

摘要:Tabla is a North Indian percussion instrument used as an accompaniment and an exclusive instrument for solo performances. Tabla solo is intricate and elaborate, exhibiting rhythmic evolution through a sequence of homogeneous sections marked by shared rhythmic characteristics. Each section has a specific structure and name associated with it. Tabla learning and performance in the Indian subcontinent is based on stylistic schools called gharana-s. Several compositions by various composers from different gharana-s are played in each section. This paper addresses the task of segmenting the tabla solo concert into musically meaningful sections. We then assign suitable section labels and recognize gharana-s from the sections. We present a diverse collection of over 38 hours of solo tabla recordings for the task. We motivate the problem and present different challenges and facets of the tasks. Inspired by the distinct musical properties of tabla solo, we compute several rhythmic and timbral features for the segmentation task. This work explores the approach of automatically locating the significant changes in the rhythmic structure by analyzing local self-similarity in an unsupervised manner. We also explore supervised random forest and a convolutional neural network trained on hand-crafted features. Both supervised and unsupervised approaches are also tested on a set of held-out recordings. Segmentation of an audio piece into its structural components and labeling is crucial to many music information retrieval applications like repetitive structure finding, audio summarization, and fast music navigation. This work helps us obtain a comprehensive musical description of the tabla solo concert.

【7】 Array Configuration-Agnostic Personalized Speech Enhancement using  Long-Short-Term Spatial Coherence

标题:基于长短期空间相干的阵列结构不可知个性化语音增强链接:https://arxiv.org/abs/2211.08748* 与cs.SD语音【15】为同一篇作者:Yicheng Hsu,Yonghan Lee,Mingsian R. Bai

摘要:个性化语音增强一直是抑制诸如竞争说话者或电视对话之类的类语音干扰的活跃研究领域与单通道方法相比,通过利用麦克风信号中的空间信息,多通道PSE系统可以在不利的声学条件下更有效然而,实施多通道PSE以适应家庭应用中的各种阵列拓扑结构可能具有挑战性。

为了开发一个阵列结构不可知的PSE系统,我们定义了一个称为长短期空间相干性的空间特征作为卷积递归网络的输入特征,以监测目标说话人的语音活动作为另一改进,等效矩形带宽缩放LSTSC特征可用于减少计算成本通过实验比较了所提出的PSE系统,包括完整版本和简化版本,其中两个基线使用不可见的房间响应和阵列配置,在存在电视噪声和竞争扬声器的情况下。

实验结果表明,采用LSTSC特征训练的多通道PSE网络在不需要精确了解阵列配置和房间响应的情况下,能够实现较好的增强性能摘要:Personalized speech enhancement has been a field of active research for suppression of speechlike interferers such as competing speakers or TV dialogues. Compared with single channel approaches, multichannel PSE systems can be more effective in adverse acoustic conditions by leveraging the spatial information in microphone signals. However, the implementation of multichannel PSEs to accommodate a wide range of array topology in household applications can be challenging. To develop an array configuration agnostic PSE system, we define a spatial feature termed the long short term spatial coherence as the input feature to a convolutional recurrent network to monitor the voice activity of the target speaker. As another refinement, an equivalent rectangular bandwidth scaled LSTSC feature can be used to reduce the computational cost. Experiments were conducted to compare the proposed PSE systems, including the complete and the simplified versions with two baselines using unseen room responses and array configurations in the presence of TV noise and competing speakers. The results demonstrated that the proposed multichannel PSE network trained with the LSTSC feature achieved superior enhancement performance without precise knowledge of the array configurations and room responses.

【8】 Hybrid Transformers for Music Source Separation标题:用于音乐源分离的混合变形器链接:https://arxiv.org/abs/2211.08553

* 与cs.SD语音【16】为同一篇作者:Simon Rouard,Francisco Massa,Alexandre Défossez机构:Meta AI摘要:在音乐源分离(MSS)中出现的自然问题是长距离上下文信息是否有用,或者局部声学特征是否足够。

在其他领域,基于注意力的Transformers已经显示出它们在长序列上整合信息的能力本文提出了一种混合变换器Demucs(Hybrid Transformer Demucs,HT Demucs),它是一种基于Hybrid Demucs的混合时间/谱双U网,其中最内层由一个跨域的Transformer Encoder代替,在一个域内使用自注意,在跨域使用交叉注意。

虽然当仅在MUSDB上训练时,它的表现很差,但我们表明,当使用800首额外训练歌曲时,它比混合Demucs(在相同数据上训练)的SDR高出0.45 dB利用稀疏注意核扩展其感受野,并对每个源进行微调,我们在额外训练数据的情况下在MUSDB上获得了最先进的结果,SDR为9.20 dB。

摘要:A natural question arising in Music Source Separation (MSS) is whether long range contextual information is useful, or whether local acoustic features are sufficient. In other fields, attention based Transformers have shown their ability to integrate information over long sequences. In this work, we introduce Hybrid Transformer Demucs (HT Demucs), an hybrid temporal/spectral bi-U-Net based on Hybrid Demucs, where the innermost layers are replaced by a cross-domain Transformer Encoder, using self-attention within one domain, and cross-attention across domains. While it performs poorly when trained only on MUSDB, we show that it outperforms Hybrid Demucs (trained on the same data) by 0.45 dB of SDR when using 800 extra training songs. Using sparse attention kernels to extend its receptive field, and per source fine-tuning, we achieve state-of-the-art results on MUSDB with extra training data, with 9.20 dB of SDR.

【9】 Psychophysiology-aided Perceptually Fluent Speech Analysis of Children  Who Stutter标题:心理生理学辅助下的口吃儿知觉流利言语分析

链接:https://arxiv.org/abs/2211.09089* 与cs.SD语音【1】为同一篇作者:Yi Xiao,Harshit Sharma,Victoria Tumanova,Asif Salekin

机构:Syracuse University备注:20 pages, 5 figures摘要:本文首次提出了一种新的方法,称为PASAD,检测变化的知觉流畅的语音声学的幼儿特别地,对感知流畅语音的分析使得能够识别被认为是口吃不流畅的根本原因的语音运动控制因素。

最近的研究表明,幼儿的言语产出,特别是那些口吃的儿童,可能会受到不利的影响,环境的生理唤醒本文的主要贡献是利用说话人的情境生理反应来实时有效地分析语音信号提出的PASAD方法采用超网络结构,利用生理参数提取时间语音重要性信息。

此外,一种新颖的非局部声谱图特征提取网络识别有意义的声学属性最后,序列网络利用声学属性和提取的时间语音重要性进行有效分类我们收集了73名学龄前口吃儿童(CWS)和不口吃儿童(CWNS)在不同条件下的言语和生理感知数据。

PASAD的独特体系结构使得能够可视化与CWS的流畅语音不同的语音属性,语音发音器)所提取的知识可以增强对儿童流利言语、言语运动控制(SMC)和口吃发展的理解综合评价结果表明,PASAD在不同条件下均优于现有的多模态基线方法,具有较强的表达能力和对说话人语音和生理的自适应性,泛化能力强,鲁棒性好,可在移动和可扩展设备上实时执行。

摘要:This first-of-its-kind paper presents a novel approach named PASAD that detects changes in perceptually fluent speech acoustics of young children. Particularly, analysis of perceptually fluent speech enables identifying the speech-motor-control factors that are considered as the underlying cause of stuttering disfluencies. Recent studies indicate that the speech production of young children, especially those who stutter, may get adversely affected by situational physiological arousal. A major contribution of this paper is leveraging the speakers situational physiological responses in real-time to analyze the speech signal effectively. The presented PASAD approach adapts a Hyper-Network structure to extract temporal speech importance information leveraging physiological parameters. In addition, a novel non-local acoustic spectrogram feature extraction network identifies meaningful acoustic attributes. Finally, a sequential network utilizes the acoustic attributes and the extracted temporal speech importance for effective classification. We collected speech and physiological sensing data from 73 preschool-age children who stutter (CWS) and who dont stutter (CWNS) in different conditions. PASADs unique architecture enables visualizing speech attributes distinct to a CWSs fluent speech and mapping them to the speakers respective speech-motor-control factors (i.e., speech articulators). Extracted knowledge can enhance understanding of childrens fluent speech, speech-motor-control (SMC), and stuttering development. Our comprehensive evaluation shows that PASAD outperforms state-of-the-art multi-modal baseline approaches in different conditions, is expressive and adaptive to the speakers speech and physiology, generalizable, robust, and is real-time executable on mobile and scalable devices.

【10】 Avoid Overthinking in Self-Supervised Models for Speech Recognition标题:避免在用于语音识别的自监督模型中过度思考链接:https://arxiv.org/abs/2211.08989

* 与cs.SD语音【2】为同一篇作者:Dan Berrebbi,Brian Yan,Shinji Watanabe机构:Carnegie Mellon University摘要:自我监督学习(SSL)模型重塑了我们的语音、语言和视觉方法。

然而,它们的庞大规模以及它们的层与任务之间的不透明关系导致了缓慢的推理和网络过度思考,其中从大型模型的最后一层做出的预测比从中间层做出的预测更差早期退出(EE)策略可以通过动态减少某些样本的推理时间计算量来解决这两个问题。

虽然EE在视觉和语言分类任务中很流行,但在序列到序列语音识别(ASR)任务中,EE的使用较少,因为早期层的输出经常退化当语音SSL模型应用于分布外(OOD)数据时,这一挑战进一步加剧本文首先指出了SSL模型在ASR中的过度考虑。

然后,我们通过计算性能与速度权衡的最佳界限来推动EE的进一步研究为了接近这一界限,我们提出了两种新的ASR策略:(1)我们将最近提出的耐心策略应用于ASR;(2)设计了一种新的针对ASR的EE策略,其性能优于之前介绍的所有策略。

摘要:Self-supervised learning (SSL) models reshaped our approach to speech, language and vision. However their huge size and the opaque relations between their layers and tasks result in slow inference and network overthinking, where predictions made from the last layer of large models is worse than those made from intermediate layers. Early exit (EE) strategies can solve both issues by dynamically reducing computations at inference time for certain samples. Although popular for classification tasks in vision and language, EE has seen less use for sequence-to-sequence speech recognition (ASR) tasks where outputs from early layers are often degenerate. This challenge is further compounded when speech SSL models are applied on out-of-distribution (OOD) data. This paper first shows that SSL models do overthinking in ASR. We then motivate further research in EE by computing an optimal bound for performance versus speed trade-offs. To approach this bound we propose two new strategies for ASR: (1) we adapt the recently proposed patience strategy to ASR; and (2) we design a new EE strategy specific to ASR that performs better than all strategies previously introduced.

【11】 Is my automatic audio captioning system so bad? spider-max: a metric to  consider several caption candidates

标题:我的自动音频字幕系统有那么糟糕吗?Spider-max:考虑多个候选字幕的指标链接:https://arxiv.org/abs/2211.08983* 与cs.SD语音【3】为同一篇作者:Etienne Labbé,Thomas Pellegrini,Julien Pinquier

机构:IRIT, Universit´e Paul Sabatier, CNRS, Toulouse, France备注:Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2022), Nov 2022, Nancy, France

摘要:自动音频字幕(AAC)是一项旨在使用自然语言描述音频信号的任务AAC系统将音频信号作为输入,并输出自由形式的文本语句,称为字幕评估这样的系统并不是一件小事,因为有许多方式可以表达同一个想法出于这个原因,使用诸如BLEU、CIDEr、SPICE和SPIDEr之类的几个互补度量来将单个自动字幕与由人类注释器产生的一个或几个参考字幕进行比较。

然而,自动系统可产生若干字幕候选项,例如,在句子产生过程中使用一些随机性,或通过在用波束搜索解码期间考虑各种竞争的假设字幕如果我们考虑AAC系统的终端用户,呈现几个标题而不是单个标题似乎与提供某种多样性相关,类似于信息检索系统。

在这项工作中,我们探讨了在评估过程中考虑几个而不是一个预测字幕的可能性为此,我们提出SPIDEr-max,这是一种在多个字幕候选项的得分中取最大SPIDEr值的度量为了支持我们的指标,我们报告了在Clotho v2.1和AudioCaps上使用基于转换的系统进行的实验。

以AudioCaps为例,该系统达到的SPIDEr-max值(有5个候选项)接近SPIDEr人类参考评分摘要:Automatic Audio Captioning (AAC) is the task that aims to describe an audio signal using natural language. AAC systems take as input an audio signal and output a free-form text sentence, called a caption. Evaluating such systems is not trivial, since there are many ways to express the same idea. For this reason, several complementary metrics, such as BLEU, CIDEr, SPICE and SPIDEr, are used to compare a single automatic caption to one or several captions of reference, produced by a human annotator. Nevertheless, an automatic system can produce several caption candidates, either using some randomness in the sentence generation process, or by considering the various competing hypothesized captions during decoding with beam-search, for instance. If we consider an end-user of an AAC system, presenting several captions instead of a single one seems relevant to provide some diversity, similarly to information retrieval systems. In this work, we explore the possibility to consider several predicted captions in the evaluation process instead of one. For this purpose, we propose SPIDEr-max, a metric that takes the maximum SPIDEr value among the scores of several caption candidates. To advocate for our metric, we report experiments on Clotho v2.1 and AudioCaps, with a transformed-based system. On AudioCaps for example, this system reached a SPIDEr-max value (with 5 candidates) close to the SPIDEr human score of reference.

【12】 Rapid Connectionist Speaker Adaptation标题:快速连接主义说话人自适应链接:https://arxiv.org/abs/2211.08978* 与cs.SD语音【4

】为同一篇作者:Michael Witbrock,Patrick Haffner机构:School of Computer Science, Carnegie Mellon University, Pittsburgh, PA ,-, USA, Centre National des Etudes de Télécommunications, Route de Trégastel, Lannion, France

备注:None摘要:我们介绍了一个用于建模说话人可变性的系统SVCnet专用于每种语音的编码器神经网络产生声学变化的低维模型,并且这些模型进一步组合成话音变化的总体模型描述了一种训练过程,其使该模型对发出的声音的依赖性最小化。

使用训练的模型(SVCnet)和新说话人声音的简短的、不受约束的样本,系统产生说话人声音代码,该说话人声音代码可用于使识别系统适应新说话人而无需再训练介绍了一种将SVCnet与MS-TDNN识别器相结合的系统。

摘要:We present SVCnet, a system for modelling speaker variability. Encoder Neural Networks specialized for each speech sound produce low dimensionality models of acoustical variation, and these models are further combined into an overall model of voice variability. A training procedure is described which minimizes the dependence of this model on which sounds have been uttered. Using the trained model (SVCnet) and a brief, unconstrained sample of a new speakers voice, the system produces a Speaker Voice Code that can be used to adapt a recognition system to the new speaker without retraining. A system which combines SVCnet with an MS-TDNN recognizer is described

【13】 Data Augmentation with Unsupervised Speaking Style Transfer for Speech  Emotion Recognition标题:语音情感识别中无监督语气转移的数据增强

链接:https://arxiv.org/abs/2211.08843* 与cs.SD语音【5】为同一篇作者:Leyuan Qu,Wei Wang,Taihao Li,Cornelius Weber,Stefan Wermter,Fuji Ren

机构:Department of Informatics,  University of Hamburg摘要:目前,语音情感识别系统的性能主要受到大规模标注语料库的限制数据增强被认为是一种很有前途的方法,它借鉴了自动语音识别(ASR)的方法,例如,对语音的速度和音调进行扰动,或者利用生成式对抗网络生成情感语音。

本文提出了一种新的风格迁移模型EmoAug,该模型通过语义编码器和副语言编码器分别表示语言和非语言信息,以增强情感表达另外,解码器通过以无监督方式调节上述两个信息流来重构语音信号一旦训练完成,EmoAug通过向副语言编码器输入不同的风格,丰富了不同韵律属性(如重音、节奏和强度)的情感语音表达。

此外,我们还可以为每个类生成相似数量的样本,以解决数据不平衡问题在IEMOCAP数据集上的实验结果表明,EmoAug能够成功地转换不同的说话风格,同时保留说话人身份和语义内容此外,利用EmoAug增强的数据训练SER模型,结果表明,该模型不仅优于现有的监督和自监督方法,而且克服了数据不平衡导致的过拟合问题.一些音频样本可以在我们的演示网站上找到。

摘要:Currently, the performance of Speech Emotion Recognition (SER) systems is mainly constrained by the absence of large-scale labelled corpora. Data augmentation is regarded as a promising approach, which borrows methods from Automatic Speech Recognition (ASR), for instance, perturbation on speed and pitch, or generating emotional speech utilizing generative adversarial networks. In this paper, we propose EmoAug, a novel style transfer model to augment emotion expressions, in which a semantic encoder and a paralinguistic encoder represent verbal and non-verbal information respectively. Additionally, a decoder reconstructs speech signals by conditioning on the aforementioned two information flows in an unsupervised fashion. Once training is completed, EmoAug enriches expressions of emotional speech in different prosodic attributes, such as stress, rhythm and intensity, by feeding different styles into the paralinguistic encoder. In addition, we can also generate similar numbers of samples for each class to tackle the data imbalance issue. Experimental results on the IEMOCAP dataset demonstrate that EmoAug can successfully transfer different speaking styles while retaining the speaker identity and semantic content. Furthermore, we train a SER model with data augmented by EmoAug and show that it not only surpasses the state-of-the-art supervised and self-supervised methods but also overcomes overfitting problems caused by data imbalance. Some audio samples can be found on our demo website.

【14】 Speaker Adaptation for End-To-End Speech Recognition Systems in Noisy  Environments标题:噪声环境下端到端语音识别系统的说话人自适应

链接:https://arxiv.org/abs/2211.08774* 与cs.SD语音【6】为同一篇作者:Dominik Wagner,Ilja Baumann,Sebastian P. Bayerl,Korbinian Riedhammer,Tobias Bocklet

机构:†Technische Hochschule N¨urnberg Georg Simon Ohm, Germany, ‡Intel Labs备注:Submitted to ICASSP 2023摘要

:我们分析了基于Transformers和wav 2 vec 2.0的端到端架构在不同噪声条件下对扬声器自适应的影响我们证明,将说话人矢量连接到声学特征并将其作为辅助模型输入提供的已证明方法仍然是提高端到端架构鲁棒性的可行选择。

通过包括从x矢量和ECAPA-TDNN模型获得的说话人嵌入,我们在LibriSpeech上实现了高达9.6%的相对单词错误率改善,在Switchboard上实现了高达14.5%的相对单词错误率改善对基于变压器的架构的影响与信噪比(SNR)近似成反比,在强噪声环境($SNR=0$)中最强。

在基于wav 2 vec 2.0的系统中,扬声器自适应的最大优势可以在中等噪声条件下实现($SNR\geq18$)我们还发现,x向量往往比ECAPA-TDNN嵌入产生更大的改善摘要:We analyze the impact of speaker adaptation in end-to-end architectures based on transformers and wav2vec 2.0 under different noise conditions. We demonstrate that the proven method of concatenating speaker vectors to the acoustic features and supplying them as an auxiliary model input remains a viable option to increase the robustness of end-to-end architectures. By including speaker embeddings obtained from x-vector and ECAPA-TDNN models, we achieve relative word error rate improvements of up to 9.6% on LibriSpeech and up to 14.5% on Switchboard. The effect on transformer-based architectures is approximately inversely proportional to the signal-to-noise ratio (SNR) and is strongest in heavily noised environments ($SNR=0$). The most substantial benefit of speaker adaption in systems based on wav2vec 2.0 can be achieved under moderate noise conditions ($SNR\geq18$). We also find that x-vectors tend to yield larger improvements than ECAPA-TDNN embeddings.

【15】 Streaming Joint Speech Recognition and Disfluency Detection标题:流媒体联合语音识别与不流畅检测链接:https://arxiv.org/abs/2211.08726

* 与cs.SD语音【7】为同一篇作者:Hayato Futami,Emiru Tsunoo,Kentaro Shibata,Yosuke Kashiwagi,Takao Okuda,Siddhant Arora,Shinji Watanabe

机构:Sony Group Corporation, Japan ,Carnegie Mellon University, USA摘要:作为语音识别的后处理,不流畅检测主要以流水线方式解决在本研究中,我们提出了基于Transformer的编码器-解码器模型,该模型联合解决了语音识别和不流畅检测,以流的方式工作。

与管道方法相比,联合模型可以利用声学信息,使得不流利检测对识别错误具有鲁棒性,并提供非语言线索此外,联合建模导致低延迟和轻量级推理我们研究用于流式不流畅检测的两个联合模型变体:转录富集模型和多任务模型在具有指示不流畅部分的起点和终点的特殊标签的文本上训练转录富集模型。

然而,它具有延迟和标准语言模型适应的问题,这是由附加的不流利标签引起的我们提出了一个多任务模型来解决这些问题,该模型在Transformer解码器上有两个输出层;一个用于语音识别,另一个用于不流畅检测它被建模为以具有附加令牌依赖机制的当前识别的令牌为条件。

在Switchboard和自发日语语料库上的实验结果表明,所提出的联合模型在准确率和延迟上都优于基于BERT的流水线方法摘要:Disfluency detection has mainly been solved in a pipeline approach, as post-processing of speech recognition. In this study, we propose Transformer-based encoder-decoder models that jointly solve speech recognition and disfluency detection, which work in a streaming manner. Compared to pipeline approaches, the joint models can leverage acoustic information that makes disfluency detection robust to recognition errors and provide non-verbal clues. Moreover, joint modeling results in low-latency and lightweight inference. We investigate two joint model variants for streaming disfluency detection: a transcript-enriched model and a multi-task model. The transcript-enriched model is trained on text with special tags indicating the starting and ending points of the disfluent part. However, it has problems with latency and standard language model adaptation, which arise from the additional disfluency tags. We propose a multi-task model to solve such problems, which has two output layers at the Transformer decoder; one for speech recognition and the other for disfluency detection. It is modeled to be conditioned on the currently recognized token with an additional token-dependency mechanism. We show that the proposed joint models outperformed a BERT-based pipeline approach in both accuracy and latency, on both the Switchboard and the corpus of spontaneous Japanese.

【16】 Conditional variational autoencoder to improve neural audio synthesis  for polyphonic music sound

标题:改进复调音乐声音神经音频合成的条件变分自动编码器链接:https://arxiv.org/abs/2211.08715* 与cs.SD语音【8】为同一篇作者:Seokjin Lee,Minhan Kim,Seunghyeon Shin,Daeho Lee,Inseon Jang,Wootaek Lim

机构:School of Electronics Engineering, Kyungpook National University, School of Electronic and Electrical Engineering, Kyungpook National University, Electronics and Telecommunications Research Institute

备注:5 pages, 6 figures摘要:用于音频合成的深度生成模型最近已经得到显著改进然而,对原始波形建模的任务仍然是一个困难的问题,尤其是对于音频波形和音乐信号近年来,实时音频变分自动编码器(RAVE)方法被开发用于高质量音频波形合成。

RAVE方法基于变分自动编码器,采用两阶段训练策略不幸的是,RAVE型号在再现宽音高复音音乐声音方面受到限制因此,为了提高重建性能,我们采用基音激活数据作为RAVE模型的辅助信息为了处理辅助信息,我们提出了一种改进的RAVE模型,它具有条件变分自动编码器结构和附加的全连通层。

为了评估该结构,我们使用MAESTRO进行了一个基于多刺激测试的听力实验,测试中使用了隐藏参照和锚点(MUSHRA)实验结果表明,与传统的RAVE模型相比,该模型在性能和稳定性方面都有明显的改善摘要:Deep generative models for audio synthesis have recently been significantly improved. However, the task of modeling raw-waveforms remains a difficult problem, especially for audio waveforms and music signals. Recently, the realtime audio variational autoencoder (RAVE) method was developed for high-quality audio waveform synthesis. The RAVE method is based on the variational autoencoder and utilizes the two-stage training strategy. Unfortunately, the RAVE model is limited in reproducing wide-pitch polyphonic music sound. Therefore, to enhance the reconstruction performance, we adopt the pitch activation data as an auxiliary information to the RAVE model. To handle the auxiliary information, we propose an enhanced RAVE model with a conditional variational autoencoder structure and an additional fully-connected layer. To evaluate the proposed structure, we conducted a listening experiment based on multiple stimulus tests with hidden references and an anchor (MUSHRA) with the MAESTRO. The obtained results indicate that the proposed model exhibits a more significant performance and stability improvement than the conventional RAVE model.

【17】 Exploring Detection-based Method For Speaker Diarization @ Ego4D  Audio-only Diarization Challenge 2022

标题:探索基于检测的说话人对象化方法@Ego4D纯音频对象化挑战2022链接:https://arxiv.org/abs/2211.08708* 与cs.SD语音【9】为同一篇作者:Jiahao Wang,Guo Chen,Yin-Dong Zheng,Tong Lu

机构:State Key Lab for Novel Software Technology, Nanjing University备注:2 pages摘要:我们在ECCV 2022中提供了Ego 4D仅音频诊断挑战的技术报告。

说话人diarization以音频流为输入,根据说话人身份输出同质片段它旨在解决“谁在何时发言”的问题“在本文中,我们探索了一种基于检测的方法来处理只有音频的说话人诊断任务该方法首先利用音频主干提取音频特征,然后将特征送入检测-生成网络,得到说话人建议。

最后,经过后处理,得到二值化结果验证数据集验证了该方法的有效性,在测试数据集上我们的方法达到了53.85 DER这些结果在2022年Ego 4D纯音频日记化挑战排行榜上排名第三摘要:We provide the technical report for Ego4D audio-only diarization challenge in ECCV 2022. Speaker diarization takes the audio streams as input and outputs the homogeneous segments according to the speakers identity. It aims to solve the problem of "Who spoke when." In this paper, we explore a Detection-based method to tackle the audio-only speaker diarization task. Our method first extracts audio features by audio backbone and then feeds the feature to a detection-generate network to get the speaker proposals. Finally, after postprocessing, we can get the diarization results. The validation dataset validates this method, and our method achieves 53.85 DER on the test dataset. These results rank 3rd on the leaderboard of Ego4D audio-only diarization challenge 2022.

【18】 PBSM: Backdoor attack against Keyword spotting based on pitch boosting  and sound masking标题:PBSM:基于基音提升和声音掩蔽的关键词定位后门攻击

链接:https://arxiv.org/abs/2211.08697* 与cs.SD语音【10】为同一篇作者:Hanbo Cai,Pengcheng Zhang,Hai Dong,Yan Xiao,Shunhui Ji

机构:College of Computer and Information, Hohai University, Nanjing, China,  School of Computing Technologies, RMIT University, Melbourne, Australia,  School of Computing, National University of Singapore, Singapore

备注:5 pages, 4 figures摘要:关键词识别(Keyword Spotting,KWS)已被广泛应用于各种语音控制场景中KWS的训练通常基于深度神经网络,需要大量的数据制造商经常使用第三方数据来训练KWS。

然而,深度神经网络对于制造商来说解释性不够,攻击者可以在模型训练期间操纵第三方训练数据来植入后门有效的后门攻击可以迫使模型在某些条件下做出指定的判断,即,触发器本文针对KWS系统设计了一种基于Pitch Boosting和Sound Masking的后门攻击方案PBSM。

实验结果表明,当中毒量小于1%的训练数据时,PBSM在3个受害者模型中的平均攻击成功率接近90%摘要:Keyword spotting (KWS) has been widely used in various speech control scenarios. The training of KWS is usually based on deep neural networks and requires a large amount of data. Manufacturers often use third-party data to train KWS. However, deep neural networks are not sufficiently interpretable to manufacturers, and attackers can manipulate third-party training data to plant backdoors during the model training. An effective backdoor attack can force the model to make specified judgments under certain conditions, i.e., triggers. In this paper, we design a backdoor attack scheme based on Pitch Boosting and Sound Masking for KWS, called PBSM. Experimental results demonstrated that PBSM is feasible to achieve an average attack success rate close to 90% in three victim models when poisoning less than 1% of the training data.

【19】 Leveraging Heteroscedastic Uncertainty in Learning Complex Spectral  Mapping for Single-channel Speech Enhancement

标题:利用异方差不确定性学习复谱映射进行单通道语音增强链接:https://arxiv.org/abs/2211.08624* 与cs.SD语音【11】为同一篇作者:Kuan-Lin Chen,Daniel D. E. Wong,Ke Tan,Buye Xu,Anurag Kumar,Vamsi Krishna Ithapu

机构:Meta Reality Labs Research, Department of Electrical and Computer Engineering, University of California, San Diego

备注:5 pages. Submitted to ICASSP 2023摘要:大多数语音增强(SE)模型学习点估计,并且在学习过程中不利用不确定性估计在这篇文章中,我们证明了通过最小化多元高斯负对数似然(NLL)来建模异方差不确定性可以在不增加额外成本的情况下提高SE性能。

在训练过程中,我们的方法用一个临时子模型来增强一个模型学习复谱映射,以预测每个时频点处的增强误差的协方差由于不受限制的异方差不确定性,协方差引入了采样不足效应,对SE性能不利为了减轻欠采样,我们的方法扩大了不确定性下限,并将每个损失分量与其不确定性进行加权,从而有效地补偿了严重欠采样的分量。

我们的多元设置揭示了常见的协方差假设,如标量和对角矩阵通过弱化这些假设,我们证明了NLL与包括均方误差(MSE)、平均绝对误差(MAE)和尺度不变信号失真比(SI-SDR)在内的常见损失相比,实现了更好的性能。

摘要:Most speech enhancement (SE) models learn a point estimate, and do not make use of uncertainty estimation in the learning process. In this paper, we show that modeling heteroscedastic uncertainty by minimizing a multivariate Gaussian negative log-likelihood (NLL) improves SE performance at no extra cost. During training, our approach augments a model learning complex spectral mapping with a temporary submodel to predict the covariance of the enhancement error at each time-frequency bin. Due to unrestricted heteroscedastic uncertainty, the covariance introduces an undersampling effect, detrimental to SE performance. To mitigate undersampling, our approach inflates the uncertainty lower bound and weights each loss component with their uncertainty, effectively compensating severely undersampled components with more penalties. Our multivariate setting reveals common covariance assumptions such as scalar and diagonal matrices. By weakening these assumptions, we show that the NLL achieves superior performance compared to popular losses including the mean squared error (MSE), mean absolute error (MAE), and scale-invariant signal-to-distortion ratio (SI-SDR).

机器翻译,仅供参考

戳“阅读原文”了解详细内容及报名通道

永久福利 直投简历简历投递:join@speechhome.comVjoinU 内推助力,leader直收简历

扫码关注我们助力AI语音开发者的社区

免责声明:本站所有信息均搜集自互联网,并不代表本站观点,本站不对其真实合法性负责。如有信息侵犯了您的权益,请告知,本站将立刻处理。联系QQ:1640731186