Musing 1: Exploring Prompt Engineering Practices in the Enterprise
Arxiv paper out of IBM research from March 15, 2024
This paper just came out of IBM research on the preprint server arXiv and is an interesting exploration about ‘prompt engineering’, a hot practice currently in industry and academia alike. For all of us who’ve played with the large language models (LLMs) like ChatGPT and DALL E, we know what a big difference (still) prompts can have on the quality of your output. Prompt engineers are in demand at the time of writing. Lots of people have lots of things to say about prompt engineering practices. So it’s interesting to hear it straight from the horse’s mouth: an actual large company that’s applying it in practice.
In some sense, the paper is not surprising in its overall message. It emphasizes that effective prompt creation requires skill, knowledge, and substantial iteration to guide the model towards accomplishing specific goals. The part about “iteration” should not be lost: as the saying goes, “if you don’t succeed, try, try again!” The researchers hypothesized that analyzing users' prompt editing behaviors could provide insights into their understanding of LLMs and the types of support needed for more efficient prompt engineering. The main contributions of the two researchers (at IBM) include:
Large-scale Analysis of Prompt Editing Practices: They conduct a comprehensive analysis of how enterprise practitioners edit and refine prompts. They categorize the parts of prompts that users iterated on and the types of changes made, providing us with more insight on the iterative process of prompt engineering.
Human-Centered Perspective on Prompting Practices: They aim to understand prompting practices using a human-centered approach, by investigating the challenges non-experts face in prompt engineering, such as trial and error, the tendency to generalize from individual instances, and the expectation for LLMs to act like humans.
Understanding of Prompt Components and Edits: They identify the most commonly edited components of prompts, such as context and task instructions, and the most frequent types of edits, like modifications that maintain the original meaning. This understanding could help in developing better prompt engineering strategies.
Design Implications and Future Directions: Based on the findings, they discuss design implications for supporting prompt engineering practices. One of their suggestions is that more tools and resources need to be developed to assist users in creating more effective prompts and in navigating the complexities of LLMs. Therefore, this is an interesting and exciting area for future research, especially in enterprise.
Okay, so what did the authors find experimentally? This is always the most important part of papers like these because so much of the work is in actually conducting experiments and reporting the findings. I list some of the important highlights here:
The study found that prompt editing sessions were often long, with a mean duration of 43.4 minutes. The median duration was 39 minutes, indicating that users spend a significant amount of time refining their prompts. This highlights the complex nature of prompt engineering and the effort required to achieve desired outputs. Of course, I should add here that the sessions will depend on what you’re trying to achieve in the first place. Since this is an enterprise setting, the use-cases and sessions are non-trivial. All of this is to say that, don’t underestimate or get frustrated by how much time you’re having to spend on getting an LLM to give you the expected or ‘right’ output.
The authors found that the most frequently edited component of prompts was the context, followed by task instructions and labels. This suggests that users focus on refining the background information and the specific tasks they want the LLM to perform, likely to improve the accuracy and relevance of the model's responses.
Unsurprisingly perhaps, but still good to know, the most common type of edit was modification, where the meaning of the prompt remained the same, but the wording or structure was changed. This finding indicates that users are fine-tuning their prompts to better communicate their intentions to the LLM without altering the overall objective.
Non-experts did face challenges when engaging in prompt engineering, forcing them to rely on trial and error and struggling to effectively communicate their needs to the LLM. Purely my two cents: I am not personally convinced that this is a ‘technological’ problem inasmuch as an ‘articulation’ problem where the intent is not clear in the users’ minds to begin with. However, there is no doubt that some kinds of prompting strategies work better than others so I’m also not surprised that ‘expert’ prompt engineers are currently in high demand.
Some of my non-technical musings on the paper:
The paper is easy to ready and practical (and not long): if you’re in a company, you may or may not get a lot out of it, but you will likely have some ideas. If you’re planning to conduct your own prompt engineering study (highly recommended), this paper is a must-read. It could save you a lot of time and energy!
The paper is on arXiv, so it could undergo changes. But the authors are good researchers, and I have no doubt that they would not put something out there unless it met a minimum quality standard. So that’s yet another reason to read it.
All of that being said, this is a research paper, so it’s not light bedtime reading (unless you’re a PhD student in this area). And not everything in the paper is equally relevant or high-quality. Some parts of the paper could be abstract.
If you’ve read the paper and have thoughts on anything in this post, please comment! I hope that each post can be like a living document where we hear from others. I also want to put out a disclaimer that any errors above are most likely my own so please do point anything out that you might find problematic.