Designing consumer AI beyond a text field

The future of consumer AI applications will be outcome-driven, not text-driven. The relative computing power AI has to offer exceeds far beyond the accessibility of just natural language input.

Jun 27, 2024

It’s now been seven months since I joined Captions as a product designer. The decision to join Captions was pretty straightforward. After founding a consumer social studio, I was hoping to spend time in AI and explore what the future of human-computer interaction in an AI-powered world looks like. AI clearly has us all wondering what the world is going to look like going forward. I believe these types of questions are best answered through design. After talking to Gaurav and Dwight, it was pretty clear that at Captions, we’d be asking those questions, which would hopefully result in great answers through design, and ideally the best products.

With AI taking off among consumers ever since the ChatGPT launch, the question that gets me most excited is: how can we move away from the prompt text field? Even though I never lived through the pre-PC era, the ubiquitous AI prompt-box reminds me of the command-line interface (CLI) when computers started to come around, just before personal computing became mainstream: an immense amount of (relative) computing power, hidden away behind a text field. I argue the same is true today: anyone with an internet connection has access to an (even more) insane amount of computing power, but again, hidden away behind a simple text field.

Of course, ChatGPT reached a lot of people much more quickly than personal computing did, and I believe that’s largely due to:

Fast distribution, thanks to the internet and social networks,
ChatGPT being a faster and more useful search engine than Google Search for most questions, and,
large language models being driven by literal natural language input, making human-computer interaction inherently more accessible for those who can formulate their prompts well.

However, I would argue that the relative computing power AI has to offer exceeds far beyond the accessibility (and in turn, adoptability) of just natural language input. AI allows for boundless creativity, but in its current form, you need to be able to express the output you’d like to see in natural language, and there are lots of potential applications for AI that don’t fit into that type of user interaction or are too complex for the average consumer to express in a text prompt. Similar to how the command line interface was too complex for the average person to use on a daily basis.

There’s beauty in the historic repetition and simplicity of computing power sitting behind a simple text field, but there isn’t necessarily accessibility nor utility for the average Joe in that. So when we ask, “What does the world with AI look like going forward?” I don’t think it will look like a text-driven chat interface. We may be able to get close to an answer by considering how personal computers evolved from the command-line interface to the graphical user-interface.

In very simple terms, the evolution from the command-line interface to the Graphical User Interface (GUI) was one that made specific commands that you could type in text, accessible through visual interactions and state changes (icons, windows, menus, point-and-click interactions, buttons, loading states, etc). The development of GUIs introduced visual elements such as icons, windows, and menus, which allowed users to interact with computers using pointing devices like a mouse, making computing more accessible and intuitive.

In other words, the Graphical User Interface was in service of making computing accessible to the average consumer. This was a shift made possible by combining new technology with user-friendly product design. By designing interfaces and human-computer interaction paradigms that were visual, interactive and outcome-driven, the GUI made two things extremely clear:

What a computer could do for you, and,
what would happen when you operated within a computer’s operating system.

The mouse-cursor combination is a great example of this — it put humans in control and provided us with a tool we all understand early on in our lives: pointing. Cursors, in their most primitive form, allow us to point within the GUI and get direct and reactive feedback. This allowed the computing interface to evolve from an inaccessible (yet powerful) text field (the CLI) to a true tool that pushed the world forward in now trivial ways. The GUI productized computing power, making it accessible for anyone to use.

Human-Agent Interaction?

Progressing Human-Computer Interaction with AI beyond a text field will probably follow similar paradigms. And thanks to the progress we’ve made with mobile computing, there are many input interactions we can rely on already, such as voice (through speech-to-text) and the camera (through computer vision). But the key question remains: if a human does X, AI does what? Or in other words, what does the productization of LLMs’ computing power look like in consumer interfaces? Trying to answer that question may help us get closer to the next paradigm of human-computer interaction, or should we say, human-agent interaction?

The last few decades of technological evolution have given us tons of software and human-computer interactions, as well as vast amounts of media content. This content has been instrumental in training our AI models. I’m starting to think of these technologies and software products as a set primitive tools or functions that we use to make our lives easier. So far, we’ve primarily trained our models on media formats such as text and imagery. I can imagine a future in which models will be trained on human-computer interaction behavior, potentially even allowing agents to navigate and use our interfaces in the same way we do. Begging the question, should designers keep AI agents in mind when designing their interfaces?

Whether it’s actually useful to see AI agents navigate our interfaces is yet to be determined, but the point is that AI will be able to use the same software products (and of course the underlying technologies) we’ve used over the last decades, and interact with software in the same way we, humans, do. In some instances, it may be useful to see how an AI “chooses” to operate, and in other instances, you might just want to get the outcome and move on. This threshold largely depends on how important the human input is from either a practical or emotional point of view.

Instances where it may be useful to see how the AI thinks or operates are most likely instances where both the human and the AI are working together practically towards a certain outcome. And most importantly, instances where it’s important for the end-user to leave the experience with a feeling of ownership. As product designers, our role then becomes balancing the interaction between humans and AI when using those foundational tools. There might be instances where the AI could do the job, but the end-user would value the outcome more if they were able to contribute. Rex Woodbury wrote a great piece on this concept of what’s called: the “Egg Theory”.

Captions: AI Edit

At Captions, we started playing around with the idea of an AI editing your video for you, and it provided a chance to experiment with the notion of collaborative Human-Agent Interaction. We explored how AI can assist in editing your video while still allowing humans to maintain control and direction.

Video editors are software tools in their truest form, with editing primitives such as trimming, adding keyframes, animation curves, etc. We learn what these tools can do for us and how to use them, and finally learn when to apply certain primitives based on our own creative direction and the outcome we’re hoping to achieve.

When you’re letting an AI edit your video for you, you might agree with all, a few, or none of the changes an AI applied. What’s important, is that when an AI is editing a video for you, the AI is using the same human video editing primitives we already know and have a GUI for. Whether that’s how it’s done technically or not, doesn’t matter too much for the end-user experience, but could help with the implementation nonetheless. What matters is how it’s perceived by the end-user, and most importantly, if it allows the end-user to get to their intended outcome faster.

By letting the AI work using the same video editing primitives, we can keep the video editing process collaborative and allow humans to come back and tweak the work the AI did. The AI and the human are speaking the same language by letting an AI use the same primitive tools that humans have used and still can use. This allows us to work faster and allows us to keep our creative control. With this approach, we’re not eliminating any creative human input, we’re eliminating a learning curve, and in fact, widening the addressable audience for these types of creative tools.

For AI Edit, this meant leveraging the set of primitive editing tools we already have in the Captions mobile app and showing where an AI “decided” to apply those primitives. This involved not just showing our users where the actual primitives are being applied as the AI is doing its work (video above), which of course looks cool, but also involved pointing our users to the respective editing primitives in the action bar of the app, making it easy to quickly remove a Zoom, Sound, or Image that was added.

Showing how an AI uses those same primitives allows our users to more easily understand what the AI is doing and, more importantly, how to undo or tweak it. This also helps new users understand how certain video editing primitives affect the final outcome, making the video editing process faster, more collaborative with AI and our over-all product more accessible to new users.

Our work at Captions is hopefully just one step in the direction of AI working collaboratively with creative professionals and consumers to build the best AI-powered products possible. We don’t believe AI will replace “creative professions that never should have existed in the first place“, our intent and approach come down to enabling creatives more so than disrupting them.

Thank you to Gaurav, Dwight, Dan, Callie, Jordan, Tyler, Gaby, Shreyas, Ashley, and Grace for reviewing and chatting about these topics.

PS: If you’re a product designer and you enjoyed reading about this type of work: we’re hiring, apply here.

Midas’ Substack

Discussion about this post