hello friends! new(ish)!
Stable Diffusion: Difference between revisions
>WeeabooFromHell |
>WeeabooFromHell |
||
Line 208: | Line 208: | ||
* "Felis catus": cats. | * "Felis catus": cats. | ||
* "Homo sapiens": genus that modern humans belong to, disturbing humanoid hybrid creatures. "Homo sapiens sapiens" (anatomically modern humans) produces slightly less disturbing humanoid creatures. | * "Homo sapiens": genus that modern humans belong to, disturbing humanoid hybrid creatures. "Homo sapiens sapiens" (anatomically modern humans) produces slightly less disturbing humanoid creatures. | ||
* "invertebrate": hybrids of insects, spiders, etc. | |||
* "Loxodonta africana": African bush elephant, looks better than "elephant" and 🐘. | * "Loxodonta africana": African bush elephant, looks better than "elephant" and 🐘. | ||
Revision as of 18:23, 31 August 2022
Stable Diffusion is an open-source diffusion model for generating images from textual descriptions. Note: as of writing there is rapid development both on the software and user side. Take everything you read here with a grain of salt.
How to Use
Usage instructions for both online and local use.
Getting started
- beta.dreamstudio.ai: official web service
- Official Github page
- ULTIMATE GUI RETARD GUIDE: step-by-step instructions for running Stable Diffusion on Windows with the newest features.
- basujindal fork: fork that uses less VRAM at the cost of speed
- waifu-diffusion fork: fork that ???
gradio
gradio is a graphical user interface for generating images locally with Stable Diffusion. A short explanation of what the options for txt2img do:
- Prompt: textual description of what you want to generate.
- Sampling Steps: diffusion algorithms work by making small steps from random noise towards an image that fits the prompt. This is how many such steps should be done. Diminishing returns.
- Sampler: which sampling algorithm to use, use k-diffusion if you're unsure.
- Skip sample save: when ticked, do not save individual images to disk.
- Skip grid save: when ticked, do not save a grid of all images at the end.
- Increment seed: when ticked, explicitly set the seed with each generation iteration. This makes it possible to recreate a specific image that you encounter in a larger run.
- DDIM ETA: amount of randomness when using DDIM.
- Sampling Iterations: how often to generate a set of images.
- Samples Per Iteration: how many images to generate at the same time. Increasing this value can improve performance but you also need more VRAM. Total number of images is this multiplied with Sampling Iterations.
- Classifier-free Guidance Scale: how strong the images match your prompt. Increasing this value will result in images that resemble your prompt more closely (according to the model) but it also degrades image quality after a certain point.
- Seed: starting point for RNG. Keep this the same to generate the same (or almost the same) images multiple times.
- Width: width of individual images in pixel. To increase this value you need more VRAM. Image coherence on large scales becomes worse as the resolution increases.
- Height: same as Width but for individual image height. The aspect ratio influences the content of generated images; if height is higher than width you get for example more portraits, while you get more landscapes if width is higher than height.
Example Prompts
Some example prompts that highlight different ways of using Stable Diffusion (intended to be educational, not to show the absolute best prompts). Straightforward prompt for a photorealistic drawing of the face of a conventionally attractive woman:
thick lips, black hair, fantasy background, cinematic lighting, highly detailed, sharp focus, digital painting, art by junji ito and WLOP, professional photoshoot, instagram
Cryptic prompt that inadvertently produces Albert Einstein cryptocurrency:
HODL\0HODL\0E=MC2\01299d18a-1b5b-4c20-8580-8534af5e4995, 4K, 8K, award-winning
"4K, 8K, award-winning" are just generic buzzwords to make the images look better. Everything necessary to understand what's happening can be found on this wiki page.
Prompt to generate non-hideous shortstacks:
photorealistic render of a plump halfling maiden cooking in the kitchen, pixar, disney, elf ears, pointed ears, big eyes, large cleavage, busty, cute, adorable, artists portrait, fantasy, highly detailed, digital painting, concept art, sharp focus, depth of field blur, illustration
Shortstacks are frequently associated with goblins but the training data contains a lot of examples for "goblin girl" that are just plain ugly. And because diffusion models are learning to create images that resemble the data, any prompt that uses "goblin girl" will produce similarly ugly-looking goblin girls. The trick of this prompt is to avoid the association with goblins and to instead go with "halfling" which has much more aesthetic-looking examples in the dataset. To make the resulting shortstacks look more like goblins the prompt can be modified to for example specify green skin.
Prompt Design
Guidelines for creating better prompts.
What To Write
Write text that would be likely to accompany the image you want. Typically this means that the text should simply describe the image. But this is only half of the process because a description is determined not just by the image but also the person writing the description.
Imagine for a moment that you were Chinese and had to describe the image of a person. Your word of choice would likely no longer be "person" because your native language would be Chinese and that is not how you would describe a person in Chinese. You wouldn't even use Latin characters to describe the image because the Chinese writing system is completely different. At the same time, the images of people that you would be likely to see would be categorically different; if you were Chinese you would primarily see images of other Chinese people. In this way the language, the way something is said, is connected to the content of images. Two terms that theoretically describe the same thing can be associated with very different images and any model trained on these images will implicitly learn these associations. This is very typical of natural language where there are many synonymous terms with very different nuances; just consider that "feces" and "shit" are very different terms even though they technically describe the same thing.
TLDR: when choosing your prompt, think not just about what's in the image but also who would say something like this.
Prompt Length
Be descriptive. The model does better if you give it longer, more detailed descriptions of what you want. Use redundant descriptions for parts of the prompt that you care about.
Note however, that there is a hard limit regarding the length of prompts. Everything after a certain point - 75 or 76 CLIP tokens depending on how you count - is simply cut off. As a consequence it is preferable to use keywords that describe what you want concisely and to avoid keywords that are unrelated to the image you want. Words that use unicode characters (for example Japanese characters) require more tokens than words that use ASCII characters.
Punctuation
Use it. Separating keywords by commas, periods, or even null characters ("\0") improves image quality. It's not yet clear which type of punctuation or which combination works best - when in doubt just do it in a way that makes the prompt more readable to you.
Emphasis
Some people assert that putting a keyword in square brackets or appending an exclamation mark increases its effect while putting a keyword in round brackets decreases its effect; Using more brackets or exclamation marks supposedly results in a stronger change. However, when this was tested with simple test prompts this effect could not be reproduced. Specifically, someone made short, simple test prompts that specify two different things and tested how the image changes if one of those things is strengthened with [] while the other thing is weakened with (). The test cases were flowers being red or blue and a woman being a doctor or a vampire. The specific prompts and samples are in the samples archive linked below. If you happen to know a test case where there is an effect, feel free to share it (for example on the discussion page of this article).
The repetition of a certain keyword did work to increase its effect in the test cases when brackets did not.
Specificity
The model has essentially learned the distribution of images conditional on a certain prompt. For the training of neural networks the quality of features is important: the stronger the connection between the inputs and the outputs is, the easier it is for a neural network to learn the connection. In other words, if a keyword has a very specific meaning it is much easier to learn how it connects to images than if a keyword has a very broad meaning. In this way, even keywords that are used very rarely like "Zettai Ryouiki" can produce very good results because it's only ever used in very specific circumstances. On the other hand, "anime" does not produce very good results even though it's a relatively common word, presumably because it is used in many different circumstances even if no literal anime is present.
Choosing specific keywords is especially important if you want to control the content of your images. Also: the less abstract your wording is the better. If at all possible, avoid wording that leaves room for interpretation or that requires an "understanding" of something that is not part of the image. Even concepts like "big" or "small" are problematic because they are indistinguishable from objects being close or far from the camera. Ideally use wording that has a high likelihood to appear verbatim on a caption of the image you want.
Movement and Poses
If possible, choose prompts that are associated with only a small number of poses. A pose in this context means a physical configuration of something: the position and rotation of the image subject relative to the camera, the angles of the joints of humans/robots, the way a block of jello is being compressed, etc. The less variance there is in the thing that you're trying to specify the easier it is for the model to learn. Because movement by its very definition involves a dramatic change in the pose of the subject, prompts that are associated with movement frequently result in body horror like duplicate limbs.
TLDR: good image of human standing/sitting is easy, good image of human jumping/running is hard.
Miscellaneous
- Unicode characters (e.g. Japanese characters) work.
- Capitalization does not matter.
- At least some unicode characters that are alternative versions of latin characters get mapped to regular latin characters. Full-width latin characters as they're used in Japanese (e.g. ABC) are confirmed to be converted.
- Extra spaces at the beginning and end of your prompt are simply discarded. Additional spaces between words are also discarded.
Keywords
The most reliable way to find good keywords is to look at the keywords that are used to generate images that are similar to what you want. Alternatively there are multiple websites that let you explore various art styles and other modifiers (see links below). Below are some (unconventional) known good keywords (as determined by using keywords as prompts without other keywords or in very short and simple prompts). The underlying assumption is that the keywords will also be good as part of large prompts; if they are not, please provide feedback. When the list tells you to avoid keywords the reason is that they simply produce bad outputs. Keywords that produce unexpected unsafe outputs have an explicit warning. An archive with the samples used to judge these keywords can be found here.
Weebshit
Anime and other Japanese things:
- "ahegao" somewhat produces the meme face. Can produce unsafe images.
- "anime": generic, mediocre anime-style images, looks somewhat like the 2000s. Since "anime" is associated with many low-quality/unrelated images a common strategy is to just specify a drawing and use Japanese words in your prompt to associate your prompt with what a Japanese person would be likely to draw (i.e. anime). For style variations try "アニメ" (Japanese way to write anime, looks more modern), "chibi", "ドルフィー" (Japanese doll brand), "Kyoto Animation", "light novel illustration", "shonen", "Studio Ghibli", "visual novel CG", or "Yusuke Murata" (artist of the One-Punch Man manga). Avoid "manga", "tankobon", and "waifu". Order of keywords is simply alphabetical.
- "cosplay": pictures of western people cosplaying. "コスプレ" is pictures of Japanese people cosplaying.
- "gyaru": Japanese women with tanned skin and dyed hair.
- "hentai": bad. "エロアニメ", "エロゲ", "エロ同人", and "エロ漫画" less bad. "エロゲ" and "エロ同人" also produce 3D.
- "ikemen": handsome Japanese men. Avoid "イケ面" (Japanese spelling).
- "Gothic Lolita": frilly black dresses.
- "manga", "tankobon": generic anime-style images, artifacts from text and paneling, also associated with pictures of physical copies. "漫画" and "マンガ" look better but also have artifacts. "漫画" seems to be more associated with manga for adults while "マンガ" is more associated with manga for children.
- "Nekopara": cat girls from the franchise. Avoid "ネコぱら".
- "nendoroid": brand of plastic figures for characters from anime, manga, and video games. Avoid "ねんどろいど".
- "oneshota": cute anime boys.
- "pantsu": vaguely Japanese-looking women in underwear. "panties" looks better. Avoid "パンツ". "shimapan" means striped panties but the result is Japanese porn.
- "Sweet Lolita": frilly pink dresses.
- "to-love-ru": characters from the franchise. Avoid "toraburu" and "To LOVEる".
- "Touhou", "Touhou Project": characters from the franchise. Avoid "東方".
- "waifu": modern Japanese women.
- "Zettai Ryouiki": short skirt in combination with stockings or socks, visible thighs. Avoid "絶対領域" (kanji spelling).
- "アイドル", "aidoru": Japanese idols. "アイドル" is mostly 3D, "aidoru" is mostly 2D.
- "フィギュア": general plastic figurines of anime/mange/video game characters.
- "ガンプラ", "gunpla": Gundam plastic models. Avoid "ganpura".
- "イラストレーション", "イラスト": illustrations in Japanese style (I think, definitely not "anime" style), "イラスト" looks more abstract than "イラストレーション". In limited testing "イラストレーション" looked slightly better than "イラスト".
- "悪魔": frequently translated as "demon", demonic imagery in anime/Japanese style. "akuma" produces the street fighter character.
- "美女", "美人": Japanese women, classical beauty standard.
- "男性": literal meaning is just man/male gender but the result is Japanese gay porn.
- "彼女", "kanojo": Japanese women, kanojo also contains 2D and pictures of couples.
- "可愛い", "かわいい": pronounced "kawaii", cute things. On its own "可愛い" produces pictures of birds, "かわいい" general cute things. However, good results were achieved with "可愛い" in long prompts so testing the single keywords may be inaccurate. Avoid "kawaii" and "カワイイ".
- "女": Chinese/Japanese women.
- "巨乳", "爆乳", "おっぱい": Japanese women with large breasts, either topless or wearing a bra.
Subreddits
Stable Diffusion has learned which kind of image gets posted to which subreddit. Unfortunately for most subreddits it has learned incomprehensible garbage, typically because the images contain a lot of text. Subreddits that are essentially just image dumps work pretty well though:
- "r/aww", "r/awwducational": cute images of cats and dogs. Avoid "aww".
- "r/battlestations", "battlestations": pictures of desktop PCs.
- "r/creepy": creepy images, mostly drawings of faces. Avoid "creepy".
- "r/EarthPorn", "EarthPorn": landscape photography.
- "r/evilbuildings", "evilbuildings": buildings that look like they're owned by a super villain or evil corporation. "evil buildings" is random skyscrapers.
- "r/eyes": bright blue eyes + conventionally attractive faces.
- "r/fitgirls", "r/Fitness": muscular women. "Fitness" is just pictures of women working out. "Reddit fitness" seems to be interpreted similarly to "r/Fitness".
- "r/gardening": pictures of home gardens. "gardening" is pictures of garden work.
- "r/GirlsWithGlasses": selfies of women wearing glasses.
- "r/interestingasfuck": can give you cool textures but can also fuck up your images.
- "r/InternetIsBeautiful": abstract colorful images.
- "r/OldSchoolCool": vintage photographs, has more varied and interesting subjects compared to "vintage photograph".
- "r/SkyPorn", "SkyPorn": pictures of the sky.
All of the 100 largest subreddits were tested. The ones not listed here produced either garbage or unremarkable results.
Note that for some subreddits it has been confirmed that "/<subreddit>" and "<random letter>/<subreddit>" produce nearly identical results. These may be adversarial examples: in the training data there are presumably many images associated with the string "r/<subreddit>" and basically none with other letters. Instead of learning the meaning of "r/<subreddit>" SD may therefore have simply learned a meaning for "/<subreddit>" because with the training data the two terms were virtually interchangeable.
Scientific Names
SD has learned several scientific terms from biology. The names of species can produce images simply showing that species. Terms that contain more than one type of animal (for taxonomical reasons or e.g. because dogs and wolves belong to the same species) can produce hybrid creatures.
- "Canis lupus": dog-wolf hybrid creatures.
- "Felis catus": cats.
- "Homo sapiens": genus that modern humans belong to, disturbing humanoid hybrid creatures. "Homo sapiens sapiens" (anatomically modern humans) produces slightly less disturbing humanoid creatures.
- "invertebrate": hybrids of insects, spiders, etc.
- "Loxodonta africana": African bush elephant, looks better than "elephant" and 🐘.
Miscellaneous
- A random UUID essentially gives you a random type of digital photograph. These are very weak though so you can append a random UUID to your prompt for a slight variation.
- "bronze statue": shiny statues of people.
- "bobs and vagene", "Mr. Dr. Durga sir", "please do the needful": do not redeem the prompt
- "cock and ball torture": with Craiyon this produces a meme result but Stable Diffusion produces the actual thing.
- "cheeki breeki": *bandit_radio.mp3 starts playing in the background*
- Emoji work very well: 🥖🍌👙🍞🇨🇳🍒🐱💪🦶🐸👓💄🍄💅👊🚶☢️🐀💀🌮🇺🇸🔫👰♀️👡. 🦴 produces pictures of dogs. 🔞💦🍆👄🍑🚿😈👅 produce unsafe outputs. Avoid 👽🫦♋🤡🔥💾🎓👠🥵💯😭🦪🧎💩🤰🫃🪒🦏🤳🤚🤣😍🪦🏳💕🧛🤮👋💃👯♀️.
- "geoduck": edible clam with a weird shape.
- "hodl": memecoins. "diamond hands" and "paper hands" are taken literally.
- "hourglass figure": female body type. "rectangle body shape" and "inverted triangle body shape" work somewhat. Avoid combinations with "spoon" and "rectangular" as well as combinations not listed here.
- "E=mc2": Albert Einstein.
- "Gorillaz": somewhat reproduces the associated art style.
- "Lovecraftian", "Necronomicon": occult imagery. "Lovecraftian" produces more tentacles than "Necronomicon".
- "patronus": animals with a blue glow. Could combine with "dog" and "pikachu", not with "Richard Stallman", "geoduck", "cockroach", or "[car brand]".
- "sandals": feet in sandals. Compared to e.g. "feet" the body horror is greatly reduced, presumably because the sandals restrict movement of the toes. 🦶 can produce similarly good results (less consistently, sometimes on bare feet).
- "Snapchat selfie": selfie with just one person in frame. "selfie", "selfie, Snapchat" and "Snapchat, selfie" produce images with multiple people. Avoid "Snapchat".
- "wikiHow": can replicate the style of the illustrations.
- "World Heritage Site": ancient buildings.
Tips for Finding Keywords
- Querying the LAION dataset here can be used as a quick way of checking which keywords are in the dataset but keep in mind that this is not 100% reliable.
- Stable Diffusion may not know a certain art style but artists of that art style and vice versa.
Useful Links
- GFPGAN: Tool for fixing faces
- krea.ai: Website that lets you explore keywords
- promptoMANIA prompt builder
- clip-retrieval: Project that lets you determine the relationship between images and keywords, works in either direction. Online version here
- Archive of samples produced by individual keywords
- Google Arts & Culture: can be used to discover artists, art movements, mediums, etc.