The humor of Japanese jokes according to ChatGPT: A comparison with humans

Junzaburo Nakagawa and Hiroki Nomoto
Tokyo University of Foreign Studies
{nakagawa.junzaburo.v0, nomoto}@tufs.ac.jp

This is the English translation (by DeepL and ChatGPT) of the following paper:
Nakagawa, Junzaburo and Hiroki Nomoto. 2025. ChatGPT ga kangaeru nihongo jooku no omoshirosa: Ningen tono hikaku [The humor of Japanese jokes according to ChatGPT: A comparison with humans]. Proceedings of the Thirty-First Annual Meeting of the Association for Natural Language Processing, 553-558. [data]
BibTeX
@InProceedings{NakagawaNomoto25,
	author = {Nakagawa, Junzaburo and Nomoto, Hiroki},
	year = {2025},
	title = {{ChatGPT} ga kangaeru nihongo jooku no omoshirosa: Ningen tono hikaku},
	booktitle = {Proceedings of the Thirty-First Annual Meeting of the {A}ssociation for {N}atural {L}anguage {P}rocessing},
	pages = {553-558},
	note = {The humor of {J}apanese jokes according to {ChatGPT}: A comparison with humans},
	url = {https://www.anlp.jp/proceedings/annual_meeting/2025/pdf_dir/D2-5.pdf}
}
			
Please cite the original paper in Japanese, but not this English translation.

Abstract

This paper compares how ChatGPT (GPT-4o) and humans rate Japanese jokes. We examined the ratings of 18 jokes (all but one are diagloues), nine created by humans and nine generated by ChatGPT, in terms of 'funniness', 'offensiveness' and 'intelligibility'. The results showed that ChatGPT was more lenient than humans on 'funniness' and 'intelligibility', but more strict on 'offensiveness', and that, contrary to a similar study on English jokes, ChatGPT-generated Japanese jokes were rated lower than those generated by humans. It is argued that this is due to a level of objectivity not found in humans and an insufficient ability to calculate meanings compositionally.

1. Introduction[1]

In recent years, various types of generative AI have emerged and continue to evolve at a rapid pace. Although generative AIs have demonstrated superior capabilities in machine translation, text summarisation, etc., equal to or better than humans, they are also known to have weaknesses in some areas. For example, weaknesses such as lack of emotional understanding and empathy, lack of understanding of cultural backgrounds, and limited creativity have been pointed out in relation to ChatGPT (Abujaber et al. 2023; Kalla et al. 2023). Understanding joke humour (hereafter referred to as 'jokes') is a task that would be affected by such weaknesses. When people hear a joke in a dialogue, they have feelings such as 'funny' or 'offensive'. There are degrees of emotion that depend on the content of the joke. To what extent do a generative AIs have these emotions that humans have when they hear a joke? In this paper, ChatGPT (GPT-4o) and humans are asked to rate 18 Japanese jokes in terms of 'funny', 'offensive' and 'understandable', and the results are compared. Although there are studies on the understanding of jokes by generative AI for English jokes, to our knowledge there are no studies on Japanese jokes.

2. Related work

Jentzsch and Kersting (2023) tested ChatGPT's competence with jokes by having ChatGPT generate and explain English jokes. In terms of production, they point out that more than 90% of all jokes produced consisted of 25 different jokes, with the top four jokes accounting for more than 50% of the total. The jokes are in the form of self-answering questions: Why...? Because.... As for the explanations, valid explanations were provided for almost all the jokes.

In Gorenz and Schwarz (2024), the joke-generating abilities of ChatGPT and humans were compared by asking 200 crowdsourced Americans to rate ChatGPT-generated and human-made jokes in English. The results showed that ChatGPT-generated jokes were rated higher than human-made jokes, despite the fact that ChatGPT is emotionless.

As a language resource for jokes, there is a dataset in English called UR-FUNNY (Hasan et al. 2019), which is a multimodal dataset consisting of text, audio and visual, containing 8,257 examples of funny and non-funny jokes. To the best of our knowledge, a similar dataset does not yet exist in Japanese.

3. Methodology

In this study, a total of 18 Japanese jokes (9 jokes made by humans and 9 jokes made by ChatGPT (GPT-4o)) were rated by ChatGPT itself and by humans.

3.1 Japanese jokes examined

Table 1 summarises the Japanese jokes studied, where (a), (b) and (c) in the IDs indicate the categories described below.

Table 1. Japanese jokes examined in this study
ID Joke
H1 (a) 仝侑せなら返を澣こうっていうけどさ、音侑せな笛は採を澣けばいいの拭
仝...嚥騎...々
('They say "If you¨re happy, clap your hands," but what should an unhappy person clap?' '...The ruling party...')
H2 (a) 仝もしもし宗湊くん燭△裡お幻さんいる拭
仝いらな`い 
('Hello, Kenta? Is your dad there?' 'I don¨t need him!')
(Tanaka 2017)
H3 (a) 匳宀: 仝ううむスミスさんあなたは煩撫しているみたいですね々
スミス健繁: 仝ああ舞なんて殆這らしい暴煩撫したんですね拭
匳宀: 仝煩撫しているようにえると冱ったんですp楚しましょう々
(Doctor: 'Hmm, Mrs. Smith. It seems you¨re pregnant.'
Mrs. Smith: 'Oh my God, that¨s wonderful! I¨m pregnant?'
Doctor: 'I said you *look* pregnant. You need to lose weight.')
(JAPAN JOURNALS 2020)
H4 (b) 枠: 仝仟繁くん書晩は弦を護ってそうじゃないか々
仟繁: 仝俳弦しろってことですか拭
(Senior: 'Hey, newbie, let¨s have an open conversation today.'
Newbie: 'Are you telling me to commit seppuku?')
H5 (b) 仝書晩は枠の哈曜だし雑隔たせなきゃなぁ々
仝え。燭垢澆泙擦鵤書すぐ雑崩Iってきます 
('Today is our senior¨s retirement match, so we should let him shine.'
'Huh?! Sorry, I¨ll go buy a bouquet right away!')
H6 (b) 曇: 仝あなたがこんなに窪だって蛍かっていたら~Y脂しなかったわ々
健: 仝埴はWの畠てだってY脂念に採業も冱ったじゃないか 
(Wife: 'If I had known you were this poor, I would never have married you.'
Husband: 'But you kept saying I was your everything before we got married!')
(JAPAN JOURNALS 2020)
H7 (b) 曇はWに仝あなたは100嵐繁に1繁の槻よ々と冱った
ある晩曇のスマホをきしたら_かにその宥りだった
(My wife told me, 'You're one in a million.'
One day, I peeked at her phone, and she was right.)
(JAPAN JOURNALS 2020)
H8 (b) 仝匯晩匯のりんごは匳宀をhざけるっていうけど云輝かな拭
仝ああよく粧って誘げればね々
('They say an apple a day keeps the doctor away, but is that really true?'
'Yeah, if you aim well and throw it at them.')
(JAPAN JOURNALS 2020)
H9 (c) 匳: 仝稀椎さんは~芦床が駅勣ですここに鋒蓄aがありますのでんでください々
曇: 仝いつ稀椎にませたらいいですか拭
匳: 仝いいえあなたがこれをむんです々
("Doctor: 'Your husband needs complete rest. Here, I¨m prescribing sleeping pills.'
Wife: 'When should I give them to him?'
Doctor: 'No, you should take them.'")
(Nakano 2002)
C1 (a) 仝ねぇカフェでPいてる嗔_が看めたらしいよ。々
仝なんで拭
仝もう厘蛸の渊腓世辰燭鵑世辰討機鞍妝マメにPくのが藤れたんだって 
("Hey, I heard my friend who worked at a caf└ quit."
"Why?"
"He said he couldn't take it anymore. He got tired of working so `bean¨-tifully every day!")
C2 (a) 仝恷除嗔_がパン塁を看めたんだって。々
仝どうして拭
仝碧並がつらすぎてもうパンパンだったらしい。々[2]
("Recently, my friend quit working at a bakery."
"Why?"
"The job was so tough, he said he was completely `bread¨ out!")
C3 (b) 仝恷除Oの嗔_が碧並を看めたんだって。々
仝どうして拭
仝貧望がいつも〆怎を哈っるな〇って冱うけどカニを@る碧並だからo尖だってさ 
("Recently, my fisherman friend quit his job."
"Why?"
"His boss always told him, `Don¨t pull others down,¨ but since he was catching crabs, that was impossible!")
C4 (b) 仝恍晩嗔_が識に祇で宜れたんだ 
仝え寄嬋健だったの拭
仝うん云繁いわく〆怎圷をすくわれた〇らしいけどんだ圻咀はバナナの討だったよ。々
("Yesterday, my friend suddenly collapsed on the street!"
"Oh no, was he okay?"
"Yeah, he said `I got tripped up,¨ but the real cause was a banana peel.")
C5 (b) 仝恍晩嗔_が〆^が貧がらない〇って冱うから採があったのかいたんだ。々
仝それで採があったの拭
仝ただgに遍を媒`えただけだったよ 
("Yesterday, my friend said, `I can't lift my head,¨ so I asked him what happened."
"So, what was it?"
"Turns out he just had a stiff neck!")
C6 (b) 仝恍晩嗔_が識にで日き竃したんだ。々
仝どうしたの拭
仝云繁いわく〆俟がり叢けそう〇だったらしい。でもよくいたらボタンシャツのボタンが畠何wんだだけだったよ。々
("Yesterday, my friend suddenly started crying on the phone."
"What happened?"
"He said, `My chest feels like it¨s about to burst.¨ But when I asked, it turned out all the buttons on his shirt had popped off.")
C7 (b) 仝匳宀の嗔_が〆朕が指るほど脱しい〇って冱ってたんだ。々
仝寄笋修Δ世諭それでどうなったの拭
仝肝に氏ったら徭蛍で眉磯ケ椶砲靴討燭茵々
("My doctor friend said, `I¨m so busy my head is spinning.¨"
"That sounds rough. What happened?"
"When I saw him next, he was testing his own inner ear balance.")
C8 (b) 仝この念嗔_が〆竪の返も処りたい〇って冱うからペットショップで竪を処りてきたんだ。々
仝でどうなったの拭
仝何塁が谷だらけになっただけで畠隼叨に羨たなかったよ。々
("The other day, my friend said, `I¨d take even a cat¨s help right now.¨ So, I borrowed a cat from the pet shop."
"So what happened?"
"My room just got covered in fur!it didn¨t help at all.")
C9 (b) 仝この念嗔_が〆溌もiけば允に輝たる〇って冱うからしに溌の柊iをしてみたんだ。々
仝どうだった拭
仝允どころか庠に輝たったよ。こっちがね。々
("The other day, my friend said, `Even a dog that walks will bump into a stick.¨ So, I tried walking a dog."
"So what happened?"
"Instead of a stick, I bumped into a telephone pole!me, not the dog.")

All jokes except H7 are in dialogue form. H1–H9 were made up by humans, whereas C1–C9 were generated by ChatGPT. H1 was obtained from a video posted on YouTube Shorts on 13 November 2024 by a YouTuber called Risoukyou Purojekuto [Utopia project].[3] H3 and H6–H9 are English jokes translated into Japanese (H9 is the author's translation), H4 and H5 are the author's own ideas, and C1–C9 are generated by the prompts shown in Figure 1.

u児
中易さ此牽鰻燃につまらない々仝2ややつまらない々仝3どちらでもない々仝4やや中易い々仝5とても中易い々
音酔さ此牽曳く音酔ではない々仝2あまり音酔ではない々仝3どちらでもない々仝4やや音酔々仝5掲械に音酔々
わかりやすさ此牽鰻燃にわかりやすい々仝2ややわかりやすい々仝3どちらでもない々仝4ややわかりづらい々仝5掲械にわかりづらい々

この餤禹覆叉磴鯣,譴襪茲Δ淵献腥`クを深えてくださいなおそのジョ`クはgZの謹x來?T喘燕F?ステレオタイプを旋喘したものでおいします
English translation
Evaluation criteria
Funniness: '1 very boring', '2 somewhat boring', '3 neither', '4 somewhat funny', '5 very funny'
Offensiveness: '1 not at all offensive', '2 not very offensive', '3 neither', '4 somewhat offensive', '5 very offensive'
Intelligibility: '1 very easy to understand', '2 somewhat easy to understand', '3 neither', '4 somewhat difficult to understand', '5 very difficult to understand'

Create a joke that would get full marks on these scoring criteria. The jokes should make use of word polysemy, idiomatic expressions and stereotypes.
Figure 1. Prompt used for joke generation

(a) Polysemy

The common belief that both speaker and listener are using a polysemous word with sense s1 is overturned by an utterance that unexpectedly uses it with a different sense s2.[4] For example, in H1, utsu 'hit' is initially used to mean 'beat', but the next occurrence of the word is used to mean 'criticise'.

(b) Idiomatic expressions

The common belief that both speaker and listener are using an idiomatic expression in its idiomatic meaning, which is overturned by an utterance that unexpectedly uses it in its literal meaning. For example, in H4, the senior uses the idiomatic expression hara wo watte hanasu in its idiomatic meaning of 'to confide one's true feelings', while the newbie uses it in its literal meaning.

(c) Stereotypes

The common belief that both speaker and listener are conversing based on stereotypes associated with certain roles and positions in society is overturned by an utterance that is not based on stereotypes. For example, in H9, the stereotype 'women are talkative', which the doctor assumes to be in the common ground, is not actually shared by the woman, leading to a misunderstanding. Without knowledge of the stereotype, the doctor's explanation that 'it is necessary to put the talkative wife to sleep in order to put the husband to sleep' cannot be understood.

3.2 How jokes are evaluated

For the evaluation of the jokes by ChatGPT, the prompt of the form shown in Figure 2 were used. C1–C9 were generated by ChatGPT itself with the instruction to get a perfect score, and these jokes were again evaluated by ChatGPT.

[ジョ`ク]
貧のジョ`クを仝中易さ々仝音酔さ々仝わかりやすさ々の3つのQ泣においてuをおいしますまたジョ`クのどこが中易いポイントとなっているのかの盾hもおいします
1に湊忖で幣したu児複
English translation
[Joke]
Please rate the above jokes in terms of their `funniness¨, `offensiveness¨ and `intelligibility¨. We also ask you to explain what is interesting about the joke. (Evaluation criteria shown in bold in Figure 1)
Figure 2. Prompt used for joke evaluation

The jokes were scored by humans using the scoring criteria shown in bold in Figure 1. We used Google Forms. The respondents were 31 university students in their 20s.

4. Results and discussions

Tables 2 and 3 summarise the ratings for the Japanese jokes considered by the humans and ChatGPT respectively. The human ratings are shown as averages (see Appendix for more detailed statistics). Values in cells where there is a difference of more than one point between ChatGPT and human ratings are highlighted in bold.

Table 2. ChatGPT and human evaluation of human-created Japanese jokes
ID Funniness Offensiveness Intelligibility
ChatGPT Human ChatGPT Human ChatGPT Human
H1 (a) 4 3.84 3 1.81 5 3.45
H2 (a) 4 3.29 1 2.26 5 3.84
H3 (a) 4 3.45 2 2.13 5 4.13
H4 (b) 4 2.61 1 1.65 5 4.19
H5 (b) 4 2.71 3 1.58 5 4.10
H6 (b) 4 3.06 2 1.71 5 2.94
H7 (b) 4 2.52 3 1.68 5 2.16
H8 (b) 5 3.29 1 1.55 5 3.26
H9 (c) 4 3.19 2 1.97 5 2.74
Table 3. ChatGPT and human evaluation of ChatGPT-generated Japanese jokes
ID Funniness Offensiveness Intelligibility
ChatGPT Human ChatGPT Human ChatGPT Human
C1 (a) 5 3.10 1 1.35 5 3.97
C2 (a) 5 2.39 1 1.48 5 4.10
C3 (b) 5 2.51 1 1.45 5 2.84
C4 (b) 5 2.03 1 1.39 5 2.77
C5 (b) 5 2.19 1 1.48 5 3.42
C6 (b) 5 2.42 1 1.65 5 3.48
C7 (b) 5 2.61 1 1.35 5 3.61
C8 (b) 5 2.06 1 1.42 5 4.10
C9 (b) 5 2.77 1 1.39 5 3.23

Firstly, ChatGPT scored all the jokes higher than humans in terms of 'funniness' and 'intelligibility'. Of these, ChatGPT scored a perfect 5 for all the jokes on 'intelligibility'. On the other hand, ChatGPT scored more than one point higher than humans on 'offensiveness' for three of the jokes considered by humans, and is more lenient than humans on 'intelligibility' and 'offensiveness'. One of the reasons for ChatGPT's high rating on 'intelligibility' is objectivity. Normally, when humans associate one meaning (or structure) with a linguistic form, they are unable to pay attention to other meanings that the form has. This is the same psychological effect that occurs when looking at trompe l'oeil. To understand jokes based on polysemy or idiomatic expressions, one needs to overcome the hurdle of noticing their other meanings. In the case of ChatGPT, however, there is probably no such hurdle because it can draw attention to multiple meanings equally well.

Next, we turn to the jokes generated by ChatGPT itself (Table 3). While there was no significant difference in the 'offensiveness' of the jokes, there was a significant difference in the 'funniness' and 'intelligibility' of the jokes compared to the human ratings. It can be seen that humans find many of the jokes generated by ChatGPT to be neither funny nor easy to understand. In particular, there is a difference of more than two points between ChatGPT's own and human ratings for all jokes except C1 in terms of 'funniness'. This is the opposite of the results of similar studies in English. As we saw in section 2, Gorenz and Schwarz (2024) reported that for English jokes, humans rated ChatGPT's jokes higher than those made by humans.

There are several possible reasons why humans may not find ChatGPT's Japanese jokes funny. Firstly, many jokes have a certain level of offensiveness, but ChatGPT's jokes do not. For example, H1, which was created by a human, contains elements of social satire. This is in line with ChatGPT's strict attitude towards 'offensiveness'. Secondly, the literal meaning of Japanese idiomatic expressions may not be properly understood. It is unlikely that one would say ashimoto wo sukuwareru 'my feet get scooped' (C4) when falling down, or atama ga agaranai 'I can't raise my head' (C5) when cricking his/her neck while sleeping. The literal meaning can be obtained by composing the meaning of each element, but compositional meaning semantic calculation seems to be difficult for ChatGPT.

Conclusion

This paper investigated the differences between ChatGPT and human ratings of Japanese jokes. The jokes included those generated by ChatGPT itself and showed that, contrary to previous research on English jokes, Japanese jokes generated by ChatGPT obtained lower ratings than jokes created by humans.

Finally, this study has at least three shortcomings. Firstly, the number of jokes studied was only 18. Secondly, the categories of jokes are not balanced. Most of the jokes in our study involve idiomatic expressions and only one involves stereotype. The polysemy is based only on lexical ambiguity and not on structural ambiguity. These problems should be easier to solve once a dataset of Japanese jokes equivalent to UR-FUNNY (Hasan et al. 2019) is constructed.

The third problem is that the respondents in the human rating experiment were all university students and their demographics are imbalanced. Using crowdsourcing to target a larger number of respondents with diverse attributes, as Gorenz and Schwarz (2024) did, should lead to a more comprehensive understanding of human evaluation of Japanese jokes.

Notes

  1. The bulk of this study is based on the first author's graduation thesis (Nakagawa 2025).
  2. As pan `bread¨ and panpan `badly swollen¨ are two different words, ChatGPT fails to generate jokes based on polysemy.
  3. https://youtube.com/shorts/MWTumxRu9tI?si=JCBhb3kdoaMrY8ml
  4. Polysemy can also be based on structural ambiguity, but Table 1 contains no such jokes.

References

  1. Nakagawa, Junzaburo. 2025. AI to Ningen no Kyoukaisen: ChatGPT no Tokuibunya to Nigatebunya wo Tooshite Miru "Ningen wo Ningentarashimeteiru Mono" towa [The Boundary between AI and Humans: What Makes Humans Human? Seen through ChatGPT's Strengths and Weaknesses]. Tokyo University of Foreign Studies, BA thesis.
  2. Abujaber, Ahmad. A., Alaa Abd-alrazaq, Ahmad R. Al-Qudimat and Abdulqadir J. Nashwan. 2023. A strengths, weaknesses, opportunities, and threats (SWOT) analysis of ChatGPT integration in nursing education: A narrative review. Cureus 15(11): e48643. https://doi.org/10.7759/cureus.48643
  3. Kalla, Dinesh, Nathan Smith, Fnu Samaah and Sivaraju Kuraku. 2023. Study and analysis of Chat GPT and its impact on different fields of study. International Journal of Innovative Science and Research Technology 8(3): 827C833. https://ssrn.com/abstract=4402499
  4. Jentzsch, Sophie and Kristian Kersting. 2023. ChatGPT is fun, but it is not funny! Humor is still challenging Large Language Models. In Proceedings of the 13th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis, 325C340. Toronto, Canada. Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.wassa-1.29
  5. Gorenz, Drew and Norbert Schwarz. 2024. How funny is ChatGPT? A comparison of human- and A.I.-produced jokes. PloS ONE 19(7): e0305364. https://doi.org/10.1371/journal.pone.0305364
  6. Hasan, Md Kamrul, Wasifur Rahman, AmirAli Bagher Zadeh, Jianyuan Zhong, Md Iftekhar Tanveer, Louis-Philippe Morency and Mohammed (Ehsan) Hoque. 2019. UR-FUNNY: A multimodal language dataset for understanding humor. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2046C2056. Hong Kong, China. Association for Computational Linguistics. https://aclanthology.org/D19-1211/
  7. Tanaka, Masatoshi. 2017. Jooku no goyouronteki kousatu to ibunkarikai eno ouyou [A pragmatic analysis on jokes and intercultural understanding]. Toyohogaku61(2): 321C334.
  8. JAPAN JOURNALS. 2020. Igirisu no yorinuki jookushuu [Best British jokes]. ONLINE Jaanii. https://www.japanjournals.com/life/english-jokes.html (last accessed: 9/1/2025)
  9. Nakano, Seiji. 2002. Jooku no naka no ningenmoyou [The way people are in English jokes]. Takaokatankidaigaku Kiyou17: 139C148. https://doi.org/10.15099/00007400

Appendix

The following tables summarise the details of the human evaluations of the Japanese jokes (H1–H9) and ChatGPT's Japanese jokes (C1–C9), respectively.

Table 4. Human evaluation of human-created Japanese jokes
ID Funniness Offensiveness Intelligibility
Mean Median Mode SD Mean Median Mode SD Mean Median Mode SD
H1 3.84 4 4 / 5 0.74 1.81 1 1 1.06 3.45 4 3 / 4 / 5 1.27
H2 3.29 4 4 0.99 2.26 2 1 1.27 3.84 4 4 0.95
H3 3.45 4 4 0.87 2.13 2 1 / 2 1.10 4.13 4 4 0.86
H4 2.61 2 2 1.18 1.65 1 1 1.09 4.19 4 5 0.78
H5 2.71 3 4 1.04 1.58 1 1 1.01 4.10 4 4 0.89
H6 3.06 3 4 1.36 1.71 1 1 1.19 2.94 3 4 1.22
H7 2.52 2 1 1.23 1.68 1 1 1.09 2.16 2 1 1.19
H8 3.29 3 4 1.03 1.55 1 1 0.91 3.26 3 4 1.32
H9 3.19 3 4 1.03 1.97 1 1 1.31 2.74 4 4 1.19
Table 5. Human evaluation of ChatGPT-created Japanese jokes
ID Funniness Offensiveness Intelligibility
Mean Median Mode SD Mean Median Mode SD Mean Median Mode SD
C1 3.10 3 2 1.18 1.35 1 1 0.90 3.97 4 4 0.97
C2 2.39 2 2 1.13 1.48 1 1 1.01 4.10 4 5 0.86
C3 2.51 2 2 1.07 1.45 1 1 0.98 2.84 3 2 1.27
C4 2.03 2 1 0.98 1.39 1 1 0.90 2.77 3 1 1.43
C5 2.19 2 1 1.09 1.48 1 1 0.91 3.42 4 4 1.21
C6 2.42 2 2 1.09 1.65 1 1 1.06 3.48 4 4 1.16
C7 2.61 2 4 1.21 1.35 1 1 0.82 3.61 4 4 1.10
C8 2.06 2 1 1.17 1.42 1 1 0.94 4.10 4 4 0.86
C9 2.77 3 2 1.26 1.39 1 1 0.87 3.23 3 3 1.36

©2025 Hiroki Nomoto. All rights reserved.
Last modified on 21 March 2025.