Automatically generating educational videos with GenAI

May 7, 2025

Some learnings from a side project in Generative AI.

Someone in my network asked whether it would be possible to automatically generate educational videos for medical training. As with many generative AI ideas circulating today, I was initially skeptical. Still, instead of turning it down, I'd rather see how far we can get instead of being skeptic.

In medical training, a large portion of existing educational content still follows a very traditional format: someone stands in front of a green screen and speaks while a PowerPoint presentation plays behind them. Sometimes it is even less advanced, with just a voiceover explaining slides. Given this low bar, I felt there might be an opportunity to automate large parts of this process using modern GenAI tools.

We decided to focus on short-form content for mobile consumption. Attention spans have become shorter for various reasons, and much of the content people engage with happens on their phones. So the idea was to generate modular, mobile-first educational videos that could be assembled programmatically.

Automatically generating a video

The overall pipeline consists of several independent steps: generating the script, producing a voiceover, aligning visuals like keywords and images, and finally composing everything into one video. Each part was treated as a separate task, which turned out to help improve the overall quality.

Diagram showing the process of generating a video

Our first goal was to see if we could automatically generate a short educational video on the skin, also known as the integumentary system. We began by defining clear learning objectives for the topic. Once we had those, it was straightforward to use a large language model to generate a script that followed the objectives closely. One technical limitation that remains unresolved is getting the LLM to strictly adhere to word or character limits. Even if it claims to count the words, the result is almost always off. What worked best thus far was prompting the model to generate a maximum of six paragraphs, each containing no more than four sentences. While it still occasionally overshoots by a paragraph or two, it stays mostly within bounds and is predictable enough to work with.

With a script in hand, we turned to Heygen to generate a synthetic avatar to present the material. The result is clearly an AI-generated avatar, but it compares quite well to many green screen presentations already in use.

See the raw video of the avatar here or examples on Heygen.

We also needed slide content behind the avatar. This presented another challenge. While image generation has improved significantly, it still struggles with text rendering and layout consistency. For example, generating a clean and legible infographic with consistent formatting is not deterministic enough for production use, especially when accuracy matters. As a workaround, I had an LLM extract keywords and phrases from each paragraph of the script. These were then programmatically displayed during the relevant portions of the video.

To sync these overlays with the voiceover, we needed precise timing metadata. At first, I considered using the subtitles generated by Heygen, but those often had slight inaccuracies. Even a half-second discrepancy was enough to misalign slides and narration. This led me to switch to Elevenlabs for voice synthesis. Their alignment metadata is extremely precise, providing timing down to the millisecond per character. I then fed the Elevenlabs audio into Heygen as input, which worked surprisingly well. The timing was much more consistent, and the overall experience felt smoother.

For video composition, I used MoviePy, a Python library for programmatic video editing. It allowed me to lay out the text overlays, sync them with the voice, and mask the avatar video on top. This setup meant I could generate a full video from modular components with relatively little manual effort.

One of the hardest parts to get right was images. In medical training, visuals are essential for a good video. We experimented with several approaches. I tried fine-tuning generative models on medical drawings, searching for open datasets of diagrams, and generating diagrams directly from prompts. Most of these options fell short. Labeling was often wrong, anatomical accuracy was hit-or-miss, and consistency between frames was lacking. In the end, the best hybrid approach was to prompt ChatGPT to generate unlabeled diagrams, which I then labeled manually. While this added some overhead, it gave me control where it mattered most. For higher-stakes visuals, especially when used in real educational settings, working together with a human illustrator might be the only safe option. Still, for quick internal prototypes or demos, this hybrid model is effective enough.

For testing, I did a couple of follow-up videos on random topics, one about ballpoints pens, and another about why cats purr. In those examples, the only manual step was defining learning objectives for each topic. Everything else, from script to visuals, was handled programmatically. For these two, I skipped the avatar entirely to keep API costs down. In the video about the skin, I manually labeled all the diagrams myself to ensure accuracy.

Videos

Integumentary system video

Cat purr video

Ballpoint pen video

Overall, the process of generating a video takes around fifteen minutes, the bulk of which is rendering time and image generation. The total cost per video, including API calls for text, audio, keywords, and visuals, is roughly three dollars. If we include API calls to generate the avatar that would add an additional three dollars.

Learnings

So what did I learn?

Large language models can automate a surprising amount of this workflow, but they still tend to fall short of 100 percent reliability. For example, they struggle to hit precise word counts, produce perfectly timed subtitles, or generate usable diagrams with correct labels. If your requirements are loose and you’re okay with “pretty good,” the results are usable. But if you need perfect quality, some manual oversight is still required.
Decomposing the problem made all the difference. Trying to prompt an LLM to generate a complete educational video script, visual plan, and diagram in one go tends to result in generic outputs. By breaking it down into subtasks; writing a script, extracting keywords, generating diagrams, we could apply targeted prompting and quality control to each part. The results were significantly better and more modular.
Context matters. A vague or poorly defined learning objective leads to a weak script, which then cascades into irrelevant keywords, poor visuals, and a confusing video. The more specific the input, the stronger the overall outcome.

If you're exploring similar use cases or want to experiment with this setup in your own domain, feel free to reach out. I see possibilities to apply this in fields outside of healthcare as well, like STEM education, or corporate training where modular, scalable video content could save time and costs.

Every second counts.

Return to blog