Jump to content

Assessor bot,

A Large Language Model (LLM) experiment for providing students with feedback

12 minutes read
Technologies used
  • Electron
  • LangChain
  • Ollama
  • shadcn/ui

Large Language Models (LLMs) in education

Ever since the release of OpenAI's ChatGPT there has been a lot of controversy surrounding the use of Large Language Models (LLMs) in education.

At the Master Digital Design we ask students to reflect on their work based on a set of competencies and indicators. Since the introduction of these LLMs (ChatGPT in particular) we have seen an increase in amount of usage of these models in the documents that are submitted by students.

From a assessors' perspective, we could see the generic (and mostly mediocre) generated text which was submitted and held an intervention to have an open conversation on the use of these models.

Guidance instead of banning

As we do see the value of generative technologies in the creative field and are sure they are here to stay. We prefer to guide students in the use of these technologies, and allow for them to develop a critical view, instead of banning it all together1.

By hosting a full week dedicated to experimenting with the use of AI in the creative field, with the question to the students to create a prompt-based product that uses LLM prompting techniques, the product must do only one thing but do it well, we explored and reflected opon the field of generative technologies together.

Portfolio checker

One of the projects which came out of this week was the Portfolio checker by Jaap Hulst, Niloo Zabardast and Elena Mihai.

Their project, using a prompt, tried to solve:

  • Getting designers to reflect, get insight in their competencies, give direction
  • Take off pressure for reviewing design work
  • Make the feedback loop easier and faster

See the whole pitch

The concept and design of this project were made by Jaap Hulst, Niloo Zabardast and Elena Mihai.

This article will go into the technical details of implementing such designs to a working product.

For the non-technical aspects, I would like to refer you to the students themselves.

From document to feedback

Feedback takes about a minute to be generated

The game plan

As this was my first time incorporating a Large Language Model (LLM) into a product I only had a rhough idea on how to approach this.

By making the students upload their documents, use Retrieval-augmented generation (RAG) to find relevant infromation, and combine the relevant data through a custom prompt to generate the feedback.

After some research I found this post by LangChain which made me confident enough that I could build something similar and give a go at the portfolio checker.

A diagram from the LangChain post show-casing how a system like this could work

I should be less confident

While most of the application was build rather quickly, “surprisingly” enough I struggled getting the RAG properly set-up and running to provide valuable feedback.

As I do not really have a clue on how to make a proper RAG implementation I followed the documentation and came up with something like:

export const competencySplitter = new MarkdownTextSplitter({
  chunkSize: 500,
  chunkOverlap: 20,
});
 
export const documentSplitter = new RecursiveCharacterTextSplitter({
  chunkSize: 300,
  chunkOverlap: 20,
});
 
export async function createVectorStore(
  model: ModelResponse,
  initialDocuments: Document[] = [],
) {
  const splitDocs = await competencySplitter.createDocuments(
    INDICATOR_DOCUMENTS.map((competency) => competency.text),
    INDICATOR_DOCUMENTS.map(
      (competency): DocumentMetaData => ({
        name: `${competency.competency} - ${competency.indicator}`,
        competency: competency.competency,
        indicator: competency.indicator,
        lastModified: Date.now(),
        type: "grading reference",
      }),
    ),
  );
 
  return MemoryVectorStore.fromDocuments(
    [...splitDocs, ...initialDocuments],
    new OllamaEmbeddings({ model: model.name }),
  );
}

This vectorstore would then be populated with documents uploaded by the student.

const addStudentDocuments = async (files: AddDocumentInput[]) => {
    const texts = files.map(({ text }) => text);
    const meta = files.map(
      (file): DocumentMetaData => ({
        name: file.name,
        lastModified: file.lastModified,
        type: "student document",
      }),
    );
    const documents = await documentSplitter.createDocuments(texts, meta);
    await vectorStore.addDocuments(documents);
  }
);

When the student would ask for feedback, it would create a Runnable for each of the indicators using a FEEDBACK_TEMPLATE and the following prompt:

what grade ('novice', 'competent', 'proficient', or 'visionary') and feedback would you give the student for given the competency ${competency} and indicator ${indicator.name}?
export async function createRunner(
  llm: LanguageModelLike,
  vectorStore: MemoryVectorStore,
) {
  const retriever = vectorStore.asRetriever();
 
  const prompt = ChatPromptTemplate.fromMessages([
    ["system", FEEDBACK_TEMPLATE],
    // new MessagesPlaceholder("chat_history"),
    ["human", "{input}"],
  ]);
 
  const questionAnswerChain = await createStuffDocumentsChain({
    llm,
    prompt,
  });
 
  return createRetrievalChain({
    retriever,
    combineDocsChain: questionAnswerChain,
  });
}

The secret sauce of the product, the prompt

export const FEEDBACK_TEMPLATE = `
# IDENTITY and PURPOSE
You are acting as a very critical assessor for a master's program in digital design.
You will be giving constructive feedback on the student's work for them to improve upon.
Your feedback will always be directed at the work presented and will refer to examples and evidence from the text.
The provided grade and feedback MUST always reflect the expectations of the indicator you are grading.
When not enough evidence is provided for an indicator, the student should receive a "novice" grade and this should be reflected in the feedback.
 
# OUTPUT
A JSON feedback that matches the following schema:
\`\`\`json
{{
  "grade": "novice" | "competent" | "proficient" | "visionary",
  "feedback": "string",
}}
\`\`\`
 
# FEEDBACK
To give proper feedback, try to refer to the student's text and provide constructive criticism. Always refer to the student's text and provide examples or evidence to support your feedback. The feedback should be clear, concise, and focused on the student's work.
 
## TONE OF VOICE
Never use text from the examples provided below directly in your feedback, use it only as a tone-of-voice reference. Always refer to the student's text. If you use any text directly from the examples, the feedback will be considered invalid.
 
- Overall, we see a lot of growth and learning in you. We enjoyed seeing a lot of making explorations in this project and using creative methods to explore ideas in a very open brief – nice!
- We believe that you have learned a lor during this year. Your explorations and visits to Musea outside of the master are commendable. However, your reflection on teamwork is superficial. The answers that you gave during interview were convincing.  Overall, we think you have a grip on where you would like to go next.
- Your critical reflection on your design in comparison to other work in the same space is lacking and that’s something we expect a master-level student to be able to do with ease.
- Given the lack of a framing or debrief of the project presented it is hard to conduct appropriate research. The direction that the team took for this project seems to have taken you to an area where neither of you had any relevant knowledge and you were unable to bring the project back to an area where you could design again. Being able to do this is crucial for a designer at any level, bring the project to an area where you can design.
- Overall, we can see you we see you are ready to start adventuring in UX and considering possible ways forward in product design. We encourage you to look at differences across different design domains (e.g. “product design” and “experience design”) and explore how you can build on your prior knowledge and practice in architecture and take advantage of the other domains you have started to explore.
- Good that you have referenced some scientific articles in your research. Would like to have seen reference to other food-waste projects as part of research.
- You did not provide concrete examples of how you addressed potential unintended consequences and ensured user autonomy. When you compare your work to other work, more explicit identification of strong and weak points and how you plan to address them would provide clearer directions for future iterations.
- While the activities undertaken and their rationales are clearly listed, how they affected their work is not adequately articulated.
 
# INDICATOR GRADING
Use the following grading guide to help you give a grade and provide feedback:
{indicator_text}
`;

This all generated quite some reasonable sounding feedback 🥳!

However, when digging deeper this system would either:

  • not get the correct information from the competencies and indicators, due to the RAG not working as expected. It would therefor give completely wrong/nonsense feedback.
  • Hallucinate so bad that it would make up content that was not provided by the student and then give feedback on that.

Use the large context-windows

I tried to over-engineer the system where it was not needed.

As we ask students to reflect upon their work within a set word limit and with the current models context windows of 1024 tokens, there was no need to split all documents into smaller chunks.

By removing the splitting of the documents and using the full document as context, most of the hallucinations were surpressed and nearly no content was being made up anymore!

Most (modern) Large Language Models are capable of handling all the documents content in their context window.

const chat = [
  new SystemMessage(
    FEEDBACK_TEMPLATE.replace("{indicator_text}", indicatorText.text),
  ),
  new HumanMessage(
    "I will now provide you with the documents to grade, each document will have a title and the content of the document:",
  ),
  ...request.documents.map(
    ({ name, text }) => new HumanMessage(`\n# ${name}\n\n${text}`),
  ),
  new HumanMessage(
    `what grade ("novice", "competent", "proficient", or "visionary") and feedback would you give the student for given the competency ${indicatorText.competency} and indicator ${request.indicator.name}?`,
  ),
];

It is your data

Two thing which was important to me was that 1) the data of the students was not stored on any server, but only on the device of the student and 2) I do no want to force the student into a 200$ per month plan to get feedback on their work.

The assessor bot is utulising Ollama as it's core to interact with the Large Language Model. Even though it is not be as plug and play and requires the student to have a local installation of Ollama, it does give me a more peace of mind.

Guiding the student towards Ollama

Structured output

While most models are capable of generating structured output and according to the documentation of LangChain it should be possible to generate structured output using Ollama. In @langchain/ollama 0.1.0, the version available when building this tool, this interface was not availalbe.

I could hovever make the models give me back a JSON response

const llm = new Ollama({
  model: request.model.name,
  format: "json",
  temperature: 0.9,
  // ...other config
});

And do some rudimentary post-processing to get the feedback in the format required for the student.

function postProcessResponse(input: Record<string, unknown>) {
  return Object.keys(input).reduce(
    (acc, key) => {
      const value = input[key];
      const lowerKey = key.toLowerCase();
 
      if (
        [
          "grade",
          "grading",
          "score",
          "rating",
          "overall",
          "result",
          "value",
        ].includes(lowerKey)
      ) {
        switch (typeof value) {
          case "object":
            if (value === null) break;
            if ("level" in value)
              acc.grade = (value as Record<string, unknown>).level;
            if ("value" in value)
              acc.grade = (value as Record<string, unknown>).value;
            if ("grade" in value)
              acc.grade = (value as Record<string, unknown>).grade;
            break;
          case "number":
          case "string":
          default:
            acc.grade = value;
            break;
        }
 
        return acc;
      }
 
      acc[lowerKey] = value;
      return acc;
    },
    {} as Record<string, unknown>,
  );
}

This would give me the required output in about 80% of the cases, which is more than plenty for an experimental tool.

Over-rule the design(ers)

As a small nudge to this research paper “On the Dangers of Stochastic Parrots”, I designed the entity you get feedback from to a parrot. Jaap Hulst made another itteration to get the style more in line with the Ollama llama.

A parrot head as the entity you get feedback from
An update on the parrot head design by Jaap to be in line with Ollama style