Thursday, December 11, 2025

Using general purpose LLMs to help with set theory questions

Are general purpose LLMs useful to figuring things out in set theory? Here is a story about two experiences I recently had. Don’t worry about the mathematical details.

Last week I wanted to know whether one can prove a certain strengthened version of Cantor’s Theorem without using the Axiom of Choice. I asked Gemini. The results were striking. It looked like a proof, but at crucial stages degenerated into weirdness. It started the proof as a reductio, and then correctly proved a bunch of things, and then claimed that this leads to a contradiction. It then said a bunch of stuff that didn’t yield a contradiction, and then said the proof was complete. Then it said a bunch more stuff that sounded like it was kind of seeing that there was no contradiction.

The “proof” also had a step that needed more explanation and it offered to give an explanation. When I accepted its offer it said something that sounded right, but it implicitly used the Axiom of Choice, which I expressly told it in the initial problem it wasn’t supposed to. When I called it on this, it admitted it, but defended itself by saying it was using a widely-accepted weaker version of Choice (true, but irrelevant).

ChatGPT screwed up in a different way. Both LLMs produced something that at the local level looked like a proof, but wasn’t. I ended up asking MathOverflow and getting a correct answer.

Today, I was thinking about Martin’s Axiom which is something that I am very unfamiliar with. Along the way, I wanted to know if:

  1. There is an upper bound on the cardinality of a compact Hausdorff topological space that satisfies the countable chain condition (ccc).

Don’t worry about what the terms mean. Gemini told me this was a “classic” question and the answer was positive. It said that the answer depended on a “deep” result of Shapirovskii from 1974 that implied that:

  1. Every compact Hausdorff topological space satisfying the ccc is separable.

A warning bell that I failed to heed sufficiently was that Gemini’s exposition of Shapirovskii included the phrase “the cc(X) = cc(X) implies d(X) = cc(X)”, which is not only ungrammatical (“the”?!) but has a trivial antecedent.

I had trouble finding an etext of the Shapirovskii paper (which from the title is on a relevant topic), so I asked ChatGPT whether (2) is true. Its short answer was: “Not provable in ZFC.” It then said that the existence of a counterexample is independent of the ZFC axioms. Well, I Googled a bit more, and found that the falsity of (2) follows from the ZFC axioms given the highest-ranked answer here as combined with the (very basic) Tychonoff theorem (I am not just relying on authority here: I can see that the example in the answer works). Thus, the “Not provable” claim was just false. I suspect that ChatGPT got its wrong answer by reading too much into a low-ranked answer on the same page (the low ranked answer gave a counterexample that is independent of the ZFC axioms, but did not claim that all counterexamples are so independent).

A tiny bit of thought about the counterexample to (2) made it clear to me that the answer to (1) was negative.

I then asked Gemini in a new session directly about (2). It gave essentially the same incorrect answer as ChatGPT, but with a bit more detail. Amusingly, this contradicts what Gemini said to my initial question.

Finally, just as I was writing this up, I asked ChatGPT directly about (1). It correctly stated that the answer to (1) is negative. However, parts of its argument were incorrect—it gave an inequality (which I haven’t checked the correctness of) but then its argument relied on the opposite inequality.

So, here’s the upshot. On my first set theoretic question, the incorrect answers of both LLMs did not help me in the least. On my second question, Gemini was wrong, but it did point me to a connection between (1) and (2) (which I should have seen myself), and further investigation led me to negative answers to both (1) and (2). Both Gemini and ChatGPT got (2) wrong. ChatGPT got the answer to (1) right (which it had a 50% chance of, I suppose) but got the argument wrong.

Nonetheless, on my second question Gemini did actually help me, by pointing me to a connection that along with MathOverflow pointed me to the right answer. If you know what you’re doing, you can get something useful out of these tools. But it’s dangerous: you need to be able to extract kernels from truth from a mix of truth and falsity. You can’t trust anything set theoretic the LLM gives, not even if it gives a source.

No comments: