Poster
Report
Backup Files
Not a Dataset
Stage A
wandb: Run summary: wandb: eval/bool/all_acc 0.4966 wandb: eval/bool/all_auroc 0.47261 wandb: eval/bool/all_ece 0.03857 wandb: eval/bool/all_f_ap 0.46435 wandb: eval/bool/all_f_f1 0.0695 wandb: eval/bool/all_f_p 0.39413 wandb: eval/bool/all_f_r 0.03811 wandb: eval/bool/all_t_ap 0.49998 wandb: eval/bool/all_t_f1 0.65497 wandb: eval/bool/all_t_p 0.50173 wandb: eval/bool/all_t_r 0.94296 wandb: eval/bool/not_dataset_acc 0.4966 wandb: eval/bool/not_dataset_auroc 0.47261 wandb: eval/bool/not_dataset_ece 0.03857 wandb: eval/bool/not_dataset_f_ap 0.46435 wandb: eval/bool/not_dataset_f_f1 0.0695 wandb: eval/bool/not_dataset_f_p 0.39413 wandb: eval/bool/not_dataset_f_r 0.03811 wandb: eval/bool/not_dataset_t_ap 0.49998 wandb: eval/bool/not_dataset_t_f1 0.65497 wandb: eval/bool/not_dataset_t_p 0.50173 wandb: eval/bool/not_dataset_t_r 0.94296 wandb: eval/bool/unseen_acc 0.4966 wandb: eval/bool/unseen_auroc 0.47261 wandb: eval/bool/unseen_ece 0.03857 wandb: eval/bool/unseen_f_ap 0.46435 wandb: eval/bool/unseen_f_f1 0.0695 wandb: eval/bool/unseen_f_p 0.39413 wandb: eval/bool/unseen_f_r 0.03811 wandb: eval/bool/unseen_t_ap 0.49998 wandb: eval/bool/unseen_t_f1 0.65497 wandb: eval/bool/unseen_t_p 0.50173 wandb: eval/bool/unseen_t_r 0.94296 wandb: eval/loss inf wandb: eval/loss_contrastive inf wandb: eval/loss_scoring 0.69472 wandb: eval/mc/acc_unweighted_all nan wandb: eval/step 50000
Stage B
wandb: Run summary: wandb: eval/bool/all_acc 0.4818 wandb: eval/bool/all_auroc 0.47584 wandb: eval/bool/all_ece 0.04514 wandb: eval/bool/all_f_ap 0.46378 wandb: eval/bool/all_f_f1 0.25288 wandb: eval/bool/all_f_p 0.43784 wandb: eval/bool/all_f_r 0.17778 wandb: eval/bool/all_t_ap 0.50911 wandb: eval/bool/all_t_f1 0.60334 wandb: eval/bool/all_t_p 0.49281 wandb: eval/bool/all_t_r 0.77778 wandb: eval/bool/not_dataset_acc 0.4818 wandb: eval/bool/not_dataset_auroc 0.47584 wandb: eval/bool/not_dataset_ece 0.04514 wandb: eval/bool/not_dataset_f_ap 0.46378 wandb: eval/bool/not_dataset_f_f1 0.25288 wandb: eval/bool/not_dataset_f_p 0.43784 wandb: eval/bool/not_dataset_f_r 0.17778 wandb: eval/bool/not_dataset_t_ap 0.50911 wandb: eval/bool/not_dataset_t_f1 0.60334 wandb: eval/bool/not_dataset_t_p 0.49281 wandb: eval/bool/not_dataset_t_r 0.77778 wandb: eval/bool/unseen_acc 0.4818 wandb: eval/bool/unseen_auroc 0.47584 wandb: eval/bool/unseen_ece 0.04514 wandb: eval/bool/unseen_f_ap 0.46378 wandb: eval/bool/unseen_f_f1 0.25288 wandb: eval/bool/unseen_f_p 0.43784 wandb: eval/bool/unseen_f_r 0.17778 wandb: eval/bool/unseen_t_ap 0.50911 wandb: eval/bool/unseen_t_f1 0.60334 wandb: eval/bool/unseen_t_p 0.49281 wandb: eval/bool/unseen_t_r 0.77778 wandb: eval/loss inf wandb: eval/loss_contrastive inf wandb: eval/loss_scoring 0.69606 wandb: eval/mc/acc_unweighted_all nan wandb: eval/step 14000
Run summary: wandb: eval/bool/all_acc 0.4929 wandb: eval/bool/all_auroc 0.48381 wandb: eval/bool/all_ece 0.07739 wandb: eval/bool/all_f_ap 0.4738 wandb: eval/bool/all_f_f1 0.1691 wandb: eval/bool/all_f_p 0.44103 wandb: eval/bool/all_f_r 0.1046 wandb: eval/bool/all_t_ap 0.5051 wandb: eval/bool/all_t_f1 0.6351 wandb: eval/bool/all_t_p 0.49977 wandb: eval/bool/all_t_r 0.87093 wandb: eval/bool/not_dataset_acc 0.4929 wandb: eval/bool/not_dataset_auroc 0.48381 wandb: eval/bool/not_dataset_ece 0.07739 wandb: eval/bool/not_dataset_f_ap 0.4738 wandb: eval/bool/not_dataset_f_f1 0.1691 wandb: eval/bool/not_dataset_f_p 0.44103 wandb: eval/bool/not_dataset_f_r 0.1046 wandb: eval/bool/not_dataset_t_ap 0.5051 wandb: eval/bool/not_dataset_t_f1 0.6351 wandb: eval/bool/not_dataset_t_p 0.49977 wandb: eval/bool/not_dataset_t_r 0.87093 wandb: eval/bool/unseen_acc 0.4929 wandb: eval/bool/unseen_auroc 0.48381 wandb: eval/bool/unseen_ece 0.07739 wandb: eval/bool/unseen_f_ap 0.4738 wandb: eval/bool/unseen_f_f1 0.1691 wandb: eval/bool/unseen_f_p 0.44103 wandb: eval/bool/unseen_f_r 0.1046 wandb: eval/bool/unseen_t_ap 0.5051 wandb: eval/bool/unseen_t_f1 0.6351 wandb: eval/bool/unseen_t_p 0.49977 wandb: eval/bool/unseen_t_r 0.87093 wandb: eval/loss inf wandb: eval/loss_contrastive inf wandb: eval/loss_scoring 0.70583 wandb: eval/mc/acc_unweighted_all nan wandb: eval/step 50000
Run summary: wandb: eval/bool/all_acc 0.4933 wandb: eval/bool/all_auroc 0.50611 wandb: eval/bool/all_ece 0.5067 wandb: eval/bool/all_f_ap 0.50024 wandb: eval/bool/all_f_f1 0.0 wandb: eval/bool/all_f_p 1.0 wandb: eval/bool/all_f_r 0.0 wandb: eval/bool/all_t_ap 0.51215 wandb: eval/bool/all_t_f1 0.0 wandb: eval/bool/all_t_p 1.0 wandb: eval/bool/all_t_r 0.0 wandb: eval/bool/not_dataset_acc 0.4933 wandb: eval/bool/not_dataset_auroc 0.50611 wandb: eval/bool/not_dataset_ece 0.5067 wandb: eval/bool/not_dataset_f_ap 0.50024 wandb: eval/bool/not_dataset_f_f1 0.0 wandb: eval/bool/not_dataset_f_p 1.0 wandb: eval/bool/not_dataset_f_r 0.0 wandb: eval/bool/not_dataset_t_ap 0.51215 wandb: eval/bool/not_dataset_t_f1 0.0 wandb: eval/bool/not_dataset_t_p 1.0 wandb: eval/bool/not_dataset_t_r 0.0 wandb: eval/bool/unseen_acc 0.4933 wandb: eval/bool/unseen_auroc 0.50611 wandb: eval/bool/unseen_ece 0.5067 wandb: eval/bool/unseen_f_ap 0.50024 wandb: eval/bool/unseen_f_f1 0.0 wandb: eval/bool/unseen_f_p 1.0 wandb: eval/bool/unseen_f_r 0.0 wandb: eval/bool/unseen_t_ap 0.51215 wandb: eval/bool/unseen_t_f1 0.0 wandb: eval/bool/unseen_t_p 1.0 wandb: eval/bool/unseen_t_r 0.0 wandb: eval/loss inf wandb: eval/loss_contrastive inf wandb: eval/loss_scoring 11.07682 wandb: eval/mc/acc_unweighted_all nan wandb: eval/step 47000
GPT4
Run summary: wandb: eval/bool/all_acc 0.33742 wandb: eval/bool/all_auroc 0.31124 wandb: eval/bool/all_ece 0.18483 wandb: eval/bool/all_f_ap 0.56184 wandb: eval/bool/all_f_f1 0.2678 wandb: eval/bool/all_f_p 0.48765 wandb: eval/bool/all_f_r 0.18458 wandb: eval/bool/all_t_ap 0.24771 wandb: eval/bool/all_t_f1 0.39496 wandb: eval/bool/all_t_p 0.28776 wandb: eval/bool/all_t_r 0.62946 wandb: eval/bool/gpt4_acc 0.33742 wandb: eval/bool/gpt4_auroc 0.31124 wandb: eval/bool/gpt4_ece 0.18483 wandb: eval/bool/gpt4_f_ap 0.56184 wandb: eval/bool/gpt4_f_f1 0.2678 wandb: eval/bool/gpt4_f_p 0.48765 wandb: eval/bool/gpt4_f_r 0.18458 wandb: eval/bool/gpt4_t_ap 0.24771 wandb: eval/bool/gpt4_t_f1 0.39496 wandb: eval/bool/gpt4_t_p 0.28776 wandb: eval/bool/gpt4_t_r 0.62946 wandb: eval/bool/unseen_acc 0.33742 wandb: eval/bool/unseen_auroc 0.31124 wandb: eval/bool/unseen_ece 0.18483 wandb: eval/bool/unseen_f_ap 0.56184 wandb: eval/bool/unseen_f_f1 0.2678 wandb: eval/bool/unseen_f_p 0.48765 wandb: eval/bool/unseen_f_r 0.18458 wandb: eval/bool/unseen_t_ap 0.24771 wandb: eval/bool/unseen_t_f1 0.39496 wandb: eval/bool/unseen_t_p 0.28776 wandb: eval/bool/unseen_t_r 0.62946 wandb: eval/loss inf wandb: eval/loss_contrastive inf wandb: eval/loss_scoring 0.71063 wandb: eval/mc/acc_unweighted_all nan wandb: eval/step 14000
Run summary: wandb: eval/bool/all_acc 0.34202 wandb: eval/bool/all_auroc 0.40814 wandb: eval/bool/all_ece 0.20537 wandb: eval/bool/all_f_ap 0.60107 wandb: eval/bool/all_f_f1 0.14712 wandb: eval/bool/all_f_p 0.49333 wandb: eval/bool/all_f_r 0.08645 wandb: eval/bool/all_t_ap 0.2881 wandb: eval/bool/all_t_f1 0.46442 wandb: eval/bool/all_t_p 0.32236 wandb: eval/bool/all_t_r 0.83036 wandb: eval/bool/gpt4_acc 0.34202 wandb: eval/bool/gpt4_auroc 0.40814 wandb: eval/bool/gpt4_ece 0.20537 wandb: eval/bool/gpt4_f_ap 0.60107 wandb: eval/bool/gpt4_f_f1 0.14712 wandb: eval/bool/gpt4_f_p 0.49333 wandb: eval/bool/gpt4_f_r 0.08645 wandb: eval/bool/gpt4_t_ap 0.2881 wandb: eval/bool/gpt4_t_f1 0.46442 wandb: eval/bool/gpt4_t_p 0.32236 wandb: eval/bool/gpt4_t_r 0.83036 wandb: eval/bool/unseen_acc 0.34202 wandb: eval/bool/unseen_auroc 0.40814 wandb: eval/bool/unseen_ece 0.20537 wandb: eval/bool/unseen_f_ap 0.60107 wandb: eval/bool/unseen_f_f1 0.14712 wandb: eval/bool/unseen_f_p 0.49333 wandb: eval/bool/unseen_f_r 0.08645 wandb: eval/bool/unseen_t_ap 0.2881 wandb: eval/bool/unseen_t_f1 0.46442 wandb: eval/bool/unseen_t_p 0.32236 wandb: eval/bool/unseen_t_r 0.83036 wandb: eval/loss inf wandb: eval/loss_contrastive inf wandb: eval/loss_scoring 0.742 wandb: eval/mc/acc_unweighted_all nan wandb: eval/step 50000
Stage C
GPT4
wandb: Run summary: wandb: eval/bool/all_acc 0.65644 wandb: eval/bool/all_auroc 0.55501 wandb: eval/bool/all_ece 0.34387 wandb: eval/bool/all_f_ap 0.72873 wandb: eval/bool/all_f_f1 0.0 wandb: eval/bool/all_f_p 1.0 wandb: eval/bool/all_f_r 0.0 wandb: eval/bool/all_t_ap 0.36881 wandb: eval/bool/all_t_f1 0.0 wandb: eval/bool/all_t_p 1.0 wandb: eval/bool/all_t_r 0.0 wandb: eval/bool/gpt4_acc 0.65644 wandb: eval/bool/gpt4_auroc 0.55501 wandb: eval/bool/gpt4_ece 0.34387 wandb: eval/bool/gpt4_f_ap 0.72873 wandb: eval/bool/gpt4_f_f1 0.0 wandb: eval/bool/gpt4_f_p 1.0 wandb: eval/bool/gpt4_f_r 0.0 wandb: eval/bool/gpt4_t_ap 0.36881 wandb: eval/bool/gpt4_t_f1 0.0 wandb: eval/bool/gpt4_t_p 1.0 wandb: eval/bool/gpt4_t_r 0.0 wandb: eval/bool/unseen_acc 0.65644 wandb: eval/bool/unseen_auroc 0.55501 wandb: eval/bool/unseen_ece 0.34387 wandb: eval/bool/unseen_f_ap 0.72873 wandb: eval/bool/unseen_f_f1 0.0 wandb: eval/bool/unseen_f_p 1.0 wandb: eval/bool/unseen_f_r 0.0 wandb: eval/bool/unseen_t_ap 0.36881 wandb: eval/bool/unseen_t_f1 0.0 wandb: eval/bool/unseen_t_p 1.0 wandb: eval/bool/unseen_t_r 0.0 wandb: eval/loss inf wandb: eval/loss_contrastive inf wandb: eval/loss_scoring 7.42829 wandb: eval/mc/acc_unweighted_all nan wandb: eval/step 46000
Run summary: wandb: eval/bool/all_acc 0.65644 wandb: eval/bool/all_auroc 0.55426 wandb: eval/bool/all_ece 0.34394 wandb: eval/bool/all_f_ap 0.72732 wandb: eval/bool/all_f_f1 0.0 wandb: eval/bool/all_f_p 1.0 wandb: eval/bool/all_f_r 0.0 wandb: eval/bool/all_t_ap 0.36804 wandb: eval/bool/all_t_f1 0.0 wandb: eval/bool/all_t_p 1.0 wandb: eval/bool/all_t_r 0.0 wandb: eval/bool/gpt4_acc 0.65644 wandb: eval/bool/gpt4_auroc 0.55426 wandb: eval/bool/gpt4_ece 0.34394 wandb: eval/bool/gpt4_f_ap 0.72732 wandb: eval/bool/gpt4_f_f1 0.0 wandb: eval/bool/gpt4_f_p 1.0 wandb: eval/bool/gpt4_f_r 0.0 wandb: eval/bool/gpt4_t_ap 0.36804 wandb: eval/bool/gpt4_t_f1 0.0 wandb: eval/bool/gpt4_t_p 1.0 wandb: eval/bool/gpt4_t_r 0.0 wandb: eval/bool/unseen_acc 0.65644 wandb: eval/bool/unseen_auroc 0.55426 wandb: eval/bool/unseen_ece 0.34394 wandb: eval/bool/unseen_f_ap 0.72732 wandb: eval/bool/unseen_f_f1 0.0 wandb: eval/bool/unseen_f_p 1.0 wandb: eval/bool/unseen_f_r 0.0 wandb: eval/bool/unseen_t_ap 0.36804 wandb: eval/bool/unseen_t_f1 0.0 wandb: eval/bool/unseen_t_p 1.0 wandb: eval/bool/unseen_t_r 0.0 wandb: eval/loss inf wandb: eval/loss_contrastive inf wandb: eval/loss_scoring 7.67577 wandb: eval/mc/acc_unweighted_all nan wandb: eval/step 50000