프로젝트 아이디어 구체화

NEURIPS Data-centric Ai workshop

Machine Translation task에서 high-quality parallel corpora를 제작할 때 사람의 노력을 적게 들이는 tool 제안
- stage 1: Corpus Filtering
  - grammar error correction
  - mono 코퍼스의 quality 확인
  - high quality mono 코퍼스 구성
- stage 2: 정제된 mono 스크립트를 모델에 입력
  - 모델의 input과 출력된 output으로 첫번째 pseudo-parallel 코퍼스 구성
- stage 3: Automatic Post Editing (APE)
  - pseudo-parallel 코퍼스의 에러 수정
- stage 4: quality prediction
  - Quality Estimation 모델로 pearson’s correlation에 기반해 점수 부여
- stage 5: 계산된 continuous score를 3 level로 치환 (high / middle / low)
  - scores over 20% (high), under 20% (low), between (middle)
  - high score 데이터는 바로 사용
  - middle과 low는 human annotator에게로, quality level에 따라 차등 지급

⇒ 이런 식으로 framework를 제안하면 좋을 듯 함.

AutoDC: Automated data-centric processing

https://github.com/dingdian110/AutoDC
- labeled dataset일 경우, label correction → edge case selection → data augmentation의 과정을 거침
  - label correction: 데이터를 vector화 한 다음에, outliers를 detection
  - edge case selection: embedding outliers 중에 선택
  - data augmentation: 가우시안 노이즈 추가 등등으로 augmentation
  - classification accuracy improvements 확인
Cleanlab

Audio Classification with SpeechBrain and Cleanlab - cleanlab

An Introduction to Confident Learning: Finding and Learning with Label Errors in Datasets
- Confident learning을 통해 noisy examples을 추정하고 pruning 수행, error가 제거된 상태로 학습, 추정된 latent prior로 re-weighting
  - confident learning: noisy observed labels와 (true) latent labels의 joint distribution 추정
  - 예측 확률이 per-class-threshold보다 클 때 example이 실제로 해당 class에 속할 것이라고 봄
  - Learning with out-of-distribution data for audio classification (ICASSP 2020) 논문에서 OOD data을 detect하면서 label과 모델 예측이 다르면서, 모델의 confidence가 threshold보다 높을 때, OOD 데이터로 본 것과 비슷

⇒ Label error을 detect할 때 out-of-distribution data, confident learning에 집중을 많이 함. 모델로 label error를 찾아내기에 가장 쉬운 방법인 듯 함. 우리도 만약 out-of-distribution data를 찾는다면, 인코더로 임베딩으로 만든 다음 outliers를 찾아내는 방식을 취해봐도 될 듯 함. 하지만 ASR task보다 라벨이 적은 audio classification task여서 가능한 방법이라고 생각됨.

최종적으로 AutoDC처럼, 모델로 찾은 wrong label 비율, (가능하면 추가적인 증강 비율), 모델의 수정 안된 데이터셋에 대한 정확도, 모델의 수정된 데이터셋에 대한 향상된 정확도를 표로 보이면 좋을 것 같다고 생각.

ASR Error Detection via Audio-Transcript entailment (Interspeech 2022)

Domain-specific 상황에서 ASR Error Detection을 위한 방법론
ASR output이 여러 모델의 input으로 들어갈 수 있는데, ASR output에서 error를 일일이 수정하는 것은 time-consuming함
Domain-specific한 상황에서는 특정 용어, 특정 소리가 발생, 이는 ASR error를 초래