Negation Bechmark
- Not another Negation Benchmark: The NaN-NLI Test Suite for Sub-clausal Negation, paper, repo
- Say What You Mean! Large Language Models Speak Too Positively about Negative Commonsense Knowledge, paper
- Language models are not naysayers: An analysis of language models on negation benchmarks, paper, repo
Negation Dataset
- UnCommonSense: Informative Negative Knowledge about Everyday Concepts, paper, dataset
- This is not a Dataset: A Large Negation Benchmark to Challenge Large Language Models, paper, dataset
- “I’m Not Mad”: Commonsense Implications of Negation and Contradiction, paper, ANION dataset
Other
How Vera Handles Data
- Multiple-Choice QA
- Boolean QA: convert the question into a declarative statement, and keep the original label
https://cultural-norms-demo-a-team.apps.allenai.org/
eval/bool/all_f_ap: The average average precision for false statements across all boolean benchmarks. 63.24
Run summary: wandb: eval/bool/all_acc 0.61125 wandb: eval/bool/all_auroc 0.47096 wandb: eval/bool/all_ece 0.31479 wandb: eval/bool/all_f_ap 0.58411 wandb: eval/bool/all_f_f1 0.0 wandb: eval/bool/all_f_p 1.0 wandb: eval/bool/all_f_r 0.0 wandb: eval/bool/all_t_ap 0.37643 wandb: eval/bool/all_t_f1 0.0 wandb: eval/bool/all_t_p 1.0 wandb: eval/bool/all_t_r 0.0 wandb: eval/bool/unseen_acc 0.61125 wandb: eval/bool/unseen_auroc 0.47096 wandb: eval/bool/unseen_ece 0.31479 wandb: eval/bool/unseen_f_ap 0.58411 wandb: eval/bool/unseen_f_f1 0.0 wandb: eval/bool/unseen_f_p 1.0 wandb: eval/bool/unseen_f_r 0.0 wandb: eval/bool/unseen_t_ap 0.37643 wandb: eval/bool/unseen_t_f1 0.0 wandb: eval/bool/unseen_t_p 1.0 wandb: eval/bool/unseen_t_r 0.0 wandb: eval/loss inf wandb: eval/loss_contrastive inf wandb: eval/loss_scoring 21.57938 wandb: eval/mc/acc/alphanli 0.5 wandb: eval/mc/acc/codah 0.2175 wandb: eval/mc/acc/hellaswag 0.2665 wandb: eval/mc/acc/story_cloze_test 0.52325 wandb: eval/mc/acc/swag 0.2655 wandb: eval/mc/acc_unweighted_all 0.35455 wandb: eval/mc/acc_unweighted_unseen 0.35455 wandb: eval/step 50000
\begin{table}[h] \centering \begin{tabular}{@{}lcccccccc@{}} \toprule Dataset & All & MC & Bool & WSC & COPA & NumerSense & PROST & SpatialCS \\ \midrule \textbf{Vera-T5-small(60M)} & 35.72 & 34.87 & 39.11 & 50.18 & 50.10 & 27.45 & 25.01 & 42.82 \\ SKD Critic (355M) & 38.34 & 35.83 & 48.41 & 54.21 & 53.00 & 11.50 & 24.60 & 48.41 \\ \textbf{Nera-T5-small(60M)} & 39.30 & 34.78 & 57.36 & 50.00 & 50.00 & 90.91 & 75.00 & 50.00 \\ I2D2 Critic (355M) & 54.79 & 54.43 & 56.22 & 80.59 & 72.80 & 35.00 & 29.35 & 56.22 \\ UnifiedQA-v2 (11B) & 59.73 & 55.10 & 78.25 & 71.79 & 81.20 & 35.00 & 32.40 & 78.25 \\ Entailer (11B) & 71.47 & 68.05 & 85.15 & 86.08 & 92.40 & 51.00 & 42.70 & 85.15 \\ GPT-3.5 (175B) & 71.03 & 70.73 & 72.24 & 85.71 & 87.00 & 66.50 & 43.70 & 72.24 \\ ChatGPT & 61.20 & 54.69 & 87.22 & 73.26 & 58.80 & 47.50 & 39.20 & 87.22 \\ GPT-4 & 77.40 & 71.75 & 100.00 & 85.00 & 64.00 & 69.00 & 69.00 & 100.00 \\ Flan-T5 (11B) & 77.62 & 73.22 & 95.23 & 90.48 & 93.00 & 57.50 & 51.90 & 95.23 \\ Vera-LLaMA (7B) & 75.71 & 74.06 & 82.32 & 94.14 & 91.80 & 65.00 & 45.30 & 82.32 \\ Vera-T5 (5B) & 81.65 & 78.70 & 93.44 & 94.51 & 93.40 & 66.50 & 60.40 & 93.44 \\ \bottomrule \end{tabular} \caption{Unseen Type 1 Accuracy} \label{table:your_label} \end{table}