Jerry Chi is the head of Japan for Stability AI, one of the companies leading the generative artificial intelligence boom. Since opening the Tokyo office a year ago, Chi’s team has released a string of new products, including a language model, an image-to-text generator, and a text-to-image generator tailored to Japanese language and culture.

This interview has been edited for length and clarity.

Why do you think it’s important for generative AI developers to build tools for specific languages?

This is part of the original vision of Stability AI. We set out to democratize AI globally, to cater to the needs of various languages, cultures, and countries all over the world. It would actually be dystopian if globally all the AI systems had the values of a 35-year-old male living in San Francisco. That doesn’t represent the values of the entire world.

In November, you released the text-to-image generator Japanese Stable Diffusion XL. What’s the difference between a user entering a prompt into the Japanese generator and using a translated prompt with the original tool?

One example is the prompt, “a high school boy.” Even if you let the user input translated Japanese into the original Stable Diffusion or a third-party American-built AI such as DALL-E 3, you get a white male schoolboy. Using Japanese Stable Diffusion XL, you get an Asian or Japanese schoolboy. Even for the same word, there are different nuances. Just using simple machine translation doesn’t capture the different implications or ways that words are typically used. 

How do you customize your Japanese models to address AI bias?

Every model has bias. You cannot completely eliminate that bias. You can only try to control which kind of biases it has. We wanted to give this model more of a Japanese bias so that it generates an image that Japanese people might typically think of when trying to picture a prompt. In the long term, Stability AI wants to have a model for each language and culture, and specific to each industry.