Birdwatch Archive

Birdwatch Note

2025-02-20 09:22:02 UTC - MISINFORMED_OR_POTENTIALLY_MISLEADING

o3-mini isn't better than grok 3 in EVERY eval. Even without the cons@64 method, grok 3 is better than o3-mini on GPQA (80.2 x 79.7). Without the cons@64 method, Grok 3 mini also outperforms o3-mini in 3 out of the 5 benchmarks shown (AIME'24, GPQA, and LiveCodeBench v5). https://x.ai/blog/grok-3

Written by 917D318FC303781283CA16D8059B3E4C0F52D027F9504C1981D25F625B348A66
Participant Details

Original Tweet

Tweet embedding is no longer reliably available, due to the platform's instability (in terms of both technology and policy). If the Tweet still exists, you can view it here: https://twitter.com/foo_bar/status/1892407015038996740

Please note, though, that you may need to have your own Twitter account to access that page. I am currently exploring options for archiving Tweet data in a post-API context.

All Information

  • ID - 1892504979610521633
  • noteId - 1892504979610521633
  • participantId -
  • noteAuthorParticipantId - 917D318FC303781283CA16D8059B3E4C0F52D027F9504C1981D25F625B348A66 Participant Details
  • createdAtMillis - 1740043322838
  • tweetId - 1892407015038996740
  • classification - MISINFORMED_OR_POTENTIALLY_MISLEADING
  • believable -
  • harmful -
  • validationDifficulty -
  • misleadingOther - 0
  • misleadingFactualError - 1
  • misleadingManipulatedMedia - 0
  • misleadingOutdatedInformation - 0
  • misleadingMissingImportantContext - 0
  • misleadingUnverifiedClaimAsFact - 0
  • misleadingSatire - 0
  • notMisleadingOther - 0
  • notMisleadingFactuallyCorrect - 0
  • notMisleadingOutdatedButNotWhenWritten - 0
  • notMisleadingClearlySatire - 0
  • notMisleadingPersonalOpinion - 0
  • trustworthySources - 1
  • summary
    • o3-mini isn't better than grok 3 in EVERY eval. Even without the cons@64 method, grok 3 is better than o3-mini on GPQA (80.2 x 79.7). Without the cons@64 method, Grok 3 mini also outperforms o3-mini in 3 out of the 5 benchmarks shown (AIME'24, GPQA, and LiveCodeBench v5). https://x.ai/blog/grok-3

Note Ratings

rated at rated by
2025-02-20 21:42:03 -0600 Rating Details
2025-02-20 10:08:11 -0600 Rating Details
2025-02-20 09:54:12 -0600 Rating Details
2025-02-20 09:14:24 -0600 Rating Details
2025-02-20 08:28:47 -0600 Rating Details
2025-02-20 07:31:42 -0600 Rating Details
2025-02-20 07:16:50 -0600 Rating Details
2025-02-20 07:09:35 -0600 Rating Details
2025-02-20 07:03:17 -0600 Rating Details
2025-02-20 07:00:28 -0600 Rating Details
2025-02-20 07:00:03 -0600 Rating Details
2025-02-20 06:27:41 -0600 Rating Details
2025-02-20 06:21:27 -0600 Rating Details
2025-02-20 05:47:24 -0600 Rating Details
2025-02-20 05:45:57 -0600 Rating Details
2025-02-20 05:27:56 -0600 Rating Details
2025-02-20 05:18:44 -0600 Rating Details
2025-02-20 05:18:20 -0600 Rating Details
2025-02-20 05:13:24 -0600 Rating Details
2025-02-20 05:12:28 -0600 Rating Details
2025-02-20 04:46:18 -0600 Rating Details
2025-02-20 04:20:39 -0600 Rating Details
2025-02-20 03:58:15 -0600 Rating Details
2025-02-20 03:58:04 -0600 Rating Details
2025-02-20 03:51:41 -0600 Rating Details
2025-02-20 03:35:46 -0600 Rating Details
2025-02-20 03:31:59 -0600 Rating Details
2025-02-21 02:20:22 -0600 Rating Details
2025-02-20 12:20:03 -0600 Rating Details
2025-02-20 10:30:53 -0600 Rating Details
2025-02-20 07:44:21 -0600 Rating Details
2025-02-20 07:25:26 -0600 Rating Details
2025-02-20 07:19:42 -0600 Rating Details
2025-02-20 06:40:22 -0600 Rating Details
2025-02-20 06:29:06 -0600 Rating Details
2025-02-20 06:06:46 -0600 Rating Details
2025-02-20 05:03:15 -0600 Rating Details
2025-02-20 04:59:17 -0600 Rating Details
2025-02-20 04:30:19 -0600 Rating Details
2025-02-20 04:29:18 -0600 Rating Details
2025-02-20 03:46:30 -0600 Rating Details
2025-02-20 03:28:23 -0600 Rating Details
2025-02-24 12:39:58 -0600 Rating Details
2025-02-23 07:59:51 -0600 Rating Details
2025-02-20 18:26:04 -0600 Rating Details
2025-02-20 15:12:51 -0600 Rating Details
2025-02-20 12:14:33 -0600 Rating Details
2025-02-20 12:08:46 -0600 Rating Details
2025-02-20 07:13:04 -0600 Rating Details
2025-02-20 06:46:07 -0600 Rating Details
2025-02-20 06:44:43 -0600 Rating Details
2025-02-20 06:30:46 -0600 Rating Details
2025-02-20 06:08:34 -0600 Rating Details
2025-02-20 06:00:05 -0600 Rating Details
2025-02-20 05:45:33 -0600 Rating Details
2025-02-20 05:45:17 -0600 Rating Details
2025-02-20 05:22:50 -0600 Rating Details
2025-02-20 05:06:30 -0600 Rating Details
2025-02-20 05:02:23 -0600 Rating Details
2025-02-20 04:32:46 -0600 Rating Details
2025-02-20 04:12:46 -0600 Rating Details
2025-02-20 04:01:00 -0600 Rating Details
2025-02-20 10:53:47 -0600 Rating Details
2025-02-20 10:40:10 -0600 Rating Details
2025-02-20 07:03:47 -0600 Rating Details
2025-02-20 07:03:07 -0600 Rating Details
2025-02-20 07:01:09 -0600 Rating Details
2025-02-20 06:58:39 -0600 Rating Details
2025-02-20 06:27:40 -0600 Rating Details
2025-02-20 05:47:16 -0600 Rating Details
2025-02-20 05:06:41 -0600 Rating Details
2025-02-20 04:44:27 -0600 Rating Details
2025-02-20 04:38:34 -0600 Rating Details
2025-02-20 03:46:42 -0600 Rating Details
2025-07-08 07:47:35 -0500 Rating Details