Gemini in Java with Vertex AI and LangChain4j

1. Introduction

This codelab focuses on the Gemini Large Language Model (LLM), hosted on Vertex AI on Google Cloud. Vertex AI is a platform that encompasses all the machine learning products, services, and models on Google Cloud.

You will use Java to interact with the Gemini API using the LangChain4j framework. You'll go through concrete examples to take advantage of the LLM for question answering, idea generation, entity and structured content extraction, retrieval augmented generation, and function calling.

What is Generative AI?

Generative AI refers to the use of artificial intelligence to create new content, such as text, images, music, audio, and videos.

Generative AI is powered by large language models (LLMs) that can multi-task and perform out-of-the-box tasks such as summarization, Q&A, classification, and more. With minimal training, foundational models can be adapted for targeted use cases with very little example data.

How does Generative AI work?

Generative AI works by using a Machine Learning (ML) model to learn the patterns and relationships in a dataset of human-created content. It then uses the learned patterns to generate new content.

The most common way to train a generative AI model is to use supervised learning. The model is given a set of human-created content and corresponding labels. It then learns to generate content that is similar to the human-created content.

What are common Generative AI applications?

Generative AI can be used to:

  • Improve customer interactions through enhanced chat and search experiences.
  • Explore vast amounts of unstructured data through conversational interfaces and summarizations.
  • Assist with repetitive tasks like replying to requests for proposals, localizing marketing content in different languages and checking customer contracts for compliance, and more.

What Generative AI offerings does Google Cloud have?

With Vertex AI, you can interact with, customize, and embed foundation models into your applications with little to no ML expertise. You can access foundation models on Model Garden, tune models via a simple UI on Vertex AI Studio, or use models in a data science notebook.

Vertex AI Search and Conversation offers developers the fastest way to build generative AI powered search engines and chatbots.

Powered by Gemini, Gemini for Google Cloud is an AI-powered collaborator available across Google Cloud and IDEs to help you get more done, faster. Gemini Code Assist provides code completion, code generation, code explanations, and lets you chat with it to ask technical questions.

What is Gemini?

Gemini is a family of generative AI models developed by Google DeepMind that is designed for multimodal use cases. Multimodal means it can process and generate different kinds of content such as text, code, images, and audio.

b9913d011999e7c7.png

Gemini comes in three variations:

  • Gemini Ultra: The largest, most capable version for complex tasks.
  • Gemini Pro: Mid-sized, optimized for scaling across various tasks.
  • Gemini Nano: The most efficient, designed for on-device tasks.

Key Features:

  • Multimodality: Gemini's ability to understand and handle multiple information formats is a significant step beyond traditional text-only language models.
  • Performance: Gemini Ultra outperforms the current state-of-the-art on many benchmarks and was the first model to surpass human experts on the challenging MMLU (Massive Multitask Language Understanding) benchmark.
  • Flexibility: The different Gemini sizes make it adaptable for various use cases, from large-scale research to deployment on mobile devices.

How can you interact with Gemini on Vertex AI from Java?

You have two options:

  1. The official Vertex AI Java API for Gemini library.
  2. LangChain4j framework.

In this codelab, you will use the LangChain4j framework.

What is the LangChain4j framework?

The LangChain4j framework is an open source library for integrating LLMs in your Java applications, by orchestrating various components, such as the LLM itself, but also other tools like vector databases (for semantic searches), document loaders and splitters (to analyze documents and learn from them), output parsers, and more.

The project was inspired by the LangChain Python project but with the goal to serve Java developers.

bb908ea1e6c96ac2.png

What you'll learn

  • How to setup a Java project to use Gemini and LangChain4j
  • How to send your first prompt to Gemini programmatically
  • How to stream responses from Gemini
  • How to create a conversation between a user and Gemini
  • How to use Gemini in a multimodal context by sending both text and images
  • How to extract useful structured information from unstructured content
  • How to manipulate prompt templates
  • How to do text classification such as sentiment analysis
  • How to chat with your own documents (Retrieval Augmented Generation)
  • How to extend your chatbots with function calling

What you'll need

  • Knowledge of the Java programming language
  • A Google Cloud project
  • A browser, such as Chrome or Firefox

2. Setup and requirements

Self-paced environment setup

  1. Sign-in to the Google Cloud Console and create a new project or reuse an existing one. If you don't already have a Gmail or Google Workspace account, you must create one.

fbef9caa1602edd0.png

a99b7ace416376c4.png

5e3ff691252acf41.png

  • The Project name is the display name for this project's participants. It is a character string not used by Google APIs. You can always update it.
  • The Project ID is unique across all Google Cloud projects and is immutable (cannot be changed after it has been set). The Cloud Console auto-generates a unique string; usually you don't care what it is. In most codelabs, you'll need to reference your Project ID (typically identified as PROJECT_ID). If you don't like the generated ID, you might generate another random one. Alternatively, you can try your own, and see if it's available. It can't be changed after this step and remains for the duration of the project.
  • For your information, there is a third value, a Project Number, which some APIs use. Learn more about all three of these values in the documentation.
  1. Next, you'll need to enable billing in the Cloud Console to use Cloud resources/APIs. Running through this codelab won't cost much, if anything at all. To shut down resources to avoid incurring billing beyond this tutorial, you can delete the resources you created or delete the project. New Google Cloud users are eligible for the $300 USD Free Trial program.

Start Cloud Shell

While Google Cloud can be operated remotely from your laptop, in this codelab you will be using Cloud Shell, a command line environment running in the Cloud.

Activate Cloud Shell

  1. From the Cloud Console, click Activate Cloud Shell 853e55310c205094.png.

3c1dabeca90e44e5.png

If this is your first time starting Cloud Shell, you're presented with an intermediate screen describing what it is. If you were presented with an intermediate screen, click Continue.

9c92662c6a846a5c.png

It should only take a few moments to provision and connect to Cloud Shell.

9f0e51b578fecce5.png

This virtual machine is loaded with all the development tools needed. It offers a persistent 5 GB home directory and runs in Google Cloud, greatly enhancing network performance and authentication. Much, if not all, of your work in this codelab can be done with a browser.

Once connected to Cloud Shell, you should see that you are authenticated and that the project is set to your project ID.

  1. Run the following command in Cloud Shell to confirm that you are authenticated:
gcloud auth list

Command output

 Credentialed Accounts
ACTIVE  ACCOUNT
*       <my_account>@<my_domain.com>

To set the active account, run:
    $ gcloud config set account `ACCOUNT`
  1. Run the following command in Cloud Shell to confirm that the gcloud command knows about your project:
gcloud config list project

Command output

[core]
project = <PROJECT_ID>

If it is not, you can set it with this command:

gcloud config set project <PROJECT_ID>

Command output

Updated property [core/project].

3. Preparing your development environment

In this codelab, you're going to use the Cloud Shell terminal and Cloud Shell editor to develop your Java programs.

Enable Vertex AI APIs

In the Google Cloud console, make sure your project name is displayed at the top of your Google Cloud console. If it's not, click Select a project to open the Project Selector, and select your intended project.

You can enable Vertex AI APIs either from the Vertex AI section of Google Cloud console or from Cloud Shell terminal.

To enable from the Google Cloud console, first, go to the Vertex AI section of Google Cloud console menu:

451976f1c8652341.png

Click Enable All Recommended APIs in the Vertex AI dashboard.

This will enable several APIs, but the most important one for the codelab is the aiplatform.googleapis.com.

Alternatively, you can also enable this API from the Cloud Shell terminal with the following command:

gcloud services enable aiplatform.googleapis.com

Clone the Github repository

In the Cloud Shell terminal, clone the repository for this codelab:

git clone https://github.com/glaforge/gemini-workshop-for-java-developers.git

To check that the project is ready to run, you can try running the "Hello World" program.

Make sure you're at the top level folder:

cd gemini-workshop-for-java-developers/ 

Create the Gradle wrapper:

gradle wrapper

Run with gradlew:

./gradlew run

You should see the following output:

..
> Task :app:run
Hello World!

Open and setup Cloud Editor

Open the code with the Cloud Code Editor from Cloud Shell:

42908e11b28f4383.png

In the Cloud Code Editor, open the codelab source folder by selecting File -> Open Folder and point to the codelab source folder (eg. /home/username/gemini-workshop-for-java-developers/).

Setup environment variables

Open a new terminal in Cloud Code Editor by selecting Terminal -> New Terminal. Set up two environment variables required for running the code examples:

  • PROJECT_ID — Your Google Cloud project ID
  • LOCATION — The region where the Gemini model is deployed

Export the variables as follows:

export PROJECT_ID=$(gcloud config get-value project)
export LOCATION=us-central1

Install Gradle for Java

To get the cloud code editor working properly with Gradle, install the Gradle for Java extension.

First, go to the Java Projects section and press the plus sign:

84d15639ac61c197.png

Select Gradle for Java:

34d6c4136a3cc9ff.png

Select the Install Pre-Release version:

3b044fb450cccb7.png

Once installed, you should see the Disable and the Uninstall buttons:

46410fe86d777f9c.png

Finally, clean the workspace to have the new settings applied:

31e27e9bb61d975d.png

This will ask you to reload and delete the workshop. Go ahead and choose Reload and delete:

d6303bc49e391dc.png

If you open one of the files, for example App.java, you should now see the editor working correctly with syntax highlighting:

fed1b1b5de0dff58.png

You're now ready to run some samples against Gemini!

4. First call to the Gemini model

Now that the project is properly set up, it is time to call the Gemini API.

Take a look at QA.java in the app/src/main/java/gemini/workshop directory:

package gemini.workshop;

import dev.langchain4j.model.vertexai.VertexAiGeminiChatModel;
import dev.langchain4j.model.chat.ChatLanguageModel;

public class QA {
    public static void main(String[] args) {
        ChatLanguageModel model = VertexAiGeminiChatModel.builder()
            .project(System.getenv("PROJECT_ID"))
            .location(System.getenv("LOCATION"))
            .modelName("gemini-1.0-pro")
            .build();

        System.out.println(model.generate("Why is the sky blue?"));
    }
}

In this first example, you need to import the VertexAiGeminiChatModel class, which implements the ChatModel interface.

In the main method, you configure the chat language model by using the builder for the VertexAiGeminiChatModel and specify:

  • Project
  • Location
  • Model name (gemini-1.0-pro).

Now that the language model is ready, you can call the generate() method and pass your prompt, your question or instructions to send to the LLM. Here, you ask a simple question about what makes the sky blue.

Feel free to change this prompt to try different questions or tasks.

Run the sample at the source code root folder:

./gradlew run -q -DjavaMainClass=gemini.workshop.QA

You should see an output similar to this one:

The sky appears blue because of a phenomenon called Rayleigh scattering.
When sunlight enters the atmosphere, it is made up of a mixture of
different wavelengths of light, each with a different color. The
different wavelengths of light interact with the molecules and particles
in the atmosphere in different ways.

The shorter wavelengths of light, such as those corresponding to blue
and violet light, are more likely to be scattered in all directions by
these particles than the longer wavelengths of light, such as those
corresponding to red and orange light. This is because the shorter
wavelengths of light have a smaller wavelength and are able to bend
around the particles more easily.

As a result of Rayleigh scattering, the blue light from the sun is
scattered in all directions, and it is this scattered blue light that we
see when we look up at the sky. The blue light from the sun is not
actually scattered in a single direction, so the color of the sky can
vary depending on the position of the sun in the sky and the amount of
dust and water droplets in the atmosphere.

Congratulations, you made your first call to Gemini!

Streaming response

Did you notice that the response was given in one go, after a few seconds? It's also possible to get the response progressively, thanks to the streaming response variant. The streaming response, the model returns the response piece by piece, as it becomes available.

In this codelab, we'll stick with the non-streaming response but let's have a look at the streaming response to see how it can be done.

In StreamQA.java in the app/src/main/java/gemini/workshop directory you can see the streaming response in action:

package gemini.workshop;

import dev.langchain4j.model.chat.StreamingChatLanguageModel;
import dev.langchain4j.model.vertexai.VertexAiGeminiStreamingChatModel;
import dev.langchain4j.model.StreamingResponseHandler;

public class StreamQA {
    public static void main(String[] args) {
        StreamingChatLanguageModel model = VertexAiGeminiStreamingChatModel.builder()
            .project(System.getenv("PROJECT_ID"))
            .location(System.getenv("LOCATION"))
            .modelName("gemini-1.0-pro")
            .build();
        
        model.generate("Why is the sky blue?", new StreamingResponseHandler<>() {
            @Override
            public void onNext(String text) {
                System.out.println(text);
            }

            @Override
            public void onError(Throwable error) {
                error.printStackTrace();
            }
        });
    }
}

This time, we import the streaming class variants VertexAiGeminiStreamingChatModel which implements the StreamingChatLanguageModel interface. You'll also need a StreamingResponseHandler.

This time, the signature of the generate() method is a little bit different. Instead of returning a string, the return type is void. In addition to the prompt, you have to pass a streaming response handler. Here, you implement the interface by creating an anonymous inner class, with two methods onNext(String text) and onError(Throwable error). The former is called each time a new piece of the response is available, while the latter is called only if ever an error occurs.

Run:

./gradlew run -q -DjavaMainClass=gemini.workshop.StreamQA

You will get a similar answer to the previous class, but this time, you will notice that the answer appears progressively in your shell, rather than waiting for the display of the full answer.

Extra configuration

For configuration, we only defined the project, the location, and the model name, but there are other parameters you can specify for the model:

  • temperature(Float temp) — to define how creative you want the response to be (0 being low creative and often more factual, while 1 is for more creative outputs)
  • topP(Float topP) — to select the possible words whose total probability add up to that floating point number (between 0 and 1)
  • topK(Integer topK) — to randomly select a word out of a maximum number of probable words for the text completion (from 1 to 40)
  • maxOutputTokens(Integer max) — to specify the maximum length of the answer given by the model (generally, 4 tokens represent roughly 3 words)
  • maxRetries(Integer retries) — in case you're running past the request per time quota, or the platform is facing some technical issue, you can have the model retry the call 3 times

So far, you asked a single question to Gemini, but you can also have a multi-turn conversation. That's what you'll explore in the next section.

5. Chat with Gemini

In the previous step, you asked a single question. It's now time to have a real conversation between a user and the LLM. Each question and answer can build upon the previous ones to form a real discussion.

Take a look at Conversation.java in the app/src/main/java/gemini/workshop folder:

package gemini.workshop;

import dev.langchain4j.model.chat.ChatLanguageModel;
import dev.langchain4j.model.vertexai.VertexAiGeminiChatModel;
import dev.langchain4j.memory.chat.MessageWindowChatMemory;
import dev.langchain4j.service.AiServices;

import java.util.List;

public class Conversation {
    public static void main(String[] args) {
        ChatLanguageModel model = VertexAiGeminiChatModel.builder()
            .project(System.getenv("PROJECT_ID"))
            .location(System.getenv("LOCATION"))
            .modelName("gemini-1.0-pro")
            .build();

        MessageWindowChatMemory chatMemory = MessageWindowChatMemory.builder()
            .maxMessages(20)
            .build();

        interface ConversationService {
            String chat(String message);
        }

        ConversationService conversation =
            AiServices.builder(ConversationService.class)
                .chatLanguageModel(model)
                .chatMemory(chatMemory)
                .build();

        List.of(
            "Hello!",
            "What is the country where the Eiffel tower is situated?",
            "How many inhabitants are there in that country?"
        ).forEach( message -> {
            System.out.println("\nUser: " + message);
            System.out.println("Gemini: " + conversation.chat(message));
        });
    }
}

A couple new interesting imports in this class:

  • MessageWindowChatMemory — a class that will help handle the multi-turn aspect of the conversation, and keep in local memory the previous questions and answers
  • AiServices — a class that will tie together the chat model and the chat memory

In the main method, you're going to set up the model, the chat memory, and the conversational chain. The model is configured as usual with the project, location, and model name information.

For the chat memory, we use MessageWindowChatMemory‘s builder to create a memory that keeps the last 20 messages exchanged. It's a sliding window over the conversation whose context is kept locally in our Java class client.

You then create the AI service that binds the chat model with the chat memory.

Notice how the AI service makes use of a custom ConversationService interface we've defined, that LangChain4j implementa, and that takes a String query and returns a String response.

Now, it's time to have a conversation with Gemini. First, a simple greeting is sent, then a first question about the Eiffel tower to know in which country it can be found. Notice that the last sentence is related to the answer of the first question, as you wonder how many inhabitants are in the country where the Eiffel tower is situated, without explicitly mentioning the country that was given in the previous answer. It shows that past questions and answers are sent with every prompt.

Run the sample:

./gradlew run -q -DjavaMainClass=gemini.workshop.Conversation

You should see three answers similar to these ones:

User: Hello!
Gemini: Hi there! How can I assist you today?

User: What is the country where the Eiffel tower is situated?
Gemini: France

User: How many inhabitants are there in that country?
Gemini: As of 2023, the population of France is estimated to be around 67.8 million.

You can ask single-turn questions or have multi-turn conversations with Gemini but so far, the input has been only text. What about images? Let's explore images in the next step.

6. Multimodality with Gemini

Gemini is a multimodal model. Not only does it accept text as input, but also it accepts images, or even videos as input. In this section, you'll see a use case for mixing text and images.

Do you think Gemini will recognise this cat?

af00516493ec9ade.png

Picture of a cat in the snow taken from Wikipediahttps://upload.wikimedia.org/wikipedia/commons/b/b6/Felis_catus-cat_on_snow.jpg

Take a look at Multimodal.java in app/src/main/java/gemini/workshop directory:

package gemini.workshop;

import dev.langchain4j.model.chat.ChatLanguageModel;
import dev.langchain4j.model.vertexai.VertexAiGeminiChatModel;
import dev.langchain4j.data.message.UserMessage;
import dev.langchain4j.data.message.AiMessage;
import dev.langchain4j.model.output.Response;
import dev.langchain4j.data.message.ImageContent;
import dev.langchain4j.data.message.TextContent;

public class Multimodal {

    static final String CAT_IMAGE_URL =
        "https://upload.wikimedia.org/wikipedia/" +
        "commons/b/b6/Felis_catus-cat_on_snow.jpg";


    public static void main(String[] args) {
        ChatLanguageModel model = VertexAiGeminiChatModel.builder()
            .project(System.getenv("PROJECT_ID"))
            .location(System.getenv("LOCATION"))
            .modelName("gemini-1.0-pro-vision")
            .build();

        UserMessage userMessage = UserMessage.from(
            ImageContent.from(CAT_IMAGE_URL),
            TextContent.from("Describe the picture")
        );

        Response<AiMessage> response = model.generate(userMessage);

        System.out.println(response.content().text());
    }
}

In the imports, notice we distinguish between different kinds of messages and contents. A UserMessage can contain both a TextContent and an ImageContent object. This is multimodality at play: mixing text and images. The model sends back a Response which contains an AiMessage.

You then retrieve the AiMessage from the response via content(), and then the text of the message thanks to text().

Run the sample:

./gradlew run -q -DjavaMainClass=gemini.workshop.Multimodal

The name of the picture certainly gave you a hint of what the picture contained but Gemini output is similar to the following:

A cat with brown fur is walking in the snow. The cat has a white patch of fur on its chest and white paws. The cat is looking at the camera.

Mixing images and text prompts opens up interesting use cases. You can create applications that can:

  • Recognize text in pictures.
  • Check if an image is safe to display.
  • Create image captions.
  • Search through a database of images with plain text descriptons.

In addition to extracting information from images, you can also extract information from unstructured text. That's what you're going to learn in the next section.

7. Extract structured information from unstructured text

There are many situations where important information is given in report documents, in emails, or other long form texts in an unstructured way. Ideally, you'd like to be able to extract the key details contained in the unstructured text, in the form of structured objects. Let's see how you can do that.

Let's say you want to extract the name and age of a person, given a biography or description of that person. You can instruct the LLM to extract JSON from unstructured text with a cleverly tweaked prompt (this is commonly called "prompt engineering").

Take a look at ExtractData.java in app/src/main/java/gemini/workshop:

package gemini.workshop;

import dev.langchain4j.model.chat.ChatLanguageModel;
import dev.langchain4j.model.vertexai.VertexAiGeminiChatModel;
import dev.langchain4j.service.AiServices;
import dev.langchain4j.service.UserMessage;

public class ExtractData {

    static record Person(String name, int age) {}

    interface PersonExtractor {
        @UserMessage("""
            Extract the name and age of the person described below.
            Return a JSON document with a "name" and an "age" property, \
            following this structure: {"name": "John Doe", "age": 34}
            Return only JSON, without any markdown markup surrounding it.
            Here is the document describing the person:
            ---
            {{it}}
            ---
            JSON:
            """)
        Person extractPerson(String text);
    }

    public static void main(String[] args) {
        ChatLanguageModel model = VertexAiGeminiChatModel.builder()
            .project(System.getenv("PROJECT_ID"))
            .location(System.getenv("LOCATION"))
            .modelName("gemini-1.0-pro")
            .temperature(0f)
            .topK(1)
            .build();

        PersonExtractor extractor = AiServices.create(PersonExtractor.class, model);

        Person person = extractor.extractPerson("""
            Anna is a 23 year old artist based in Brooklyn, New York. She was born and 
            raised in the suburbs of Chicago, where she developed a love for art at a 
            young age. She attended the School of the Art Institute of Chicago, where 
            she studied painting and drawing. After graduating, she moved to New York 
            City to pursue her art career. Anna's work is inspired by her personal 
            experiences and observations of the world around her. She often uses bright 
            colors and bold lines to create vibrant and energetic paintings. Her work 
            has been exhibited in galleries and museums in New York City and Chicago.    
            """
        );

        System.out.println(person.name());  // Anna
        System.out.println(person.age());   // 23
    }
}

Let's have a look at the various steps in this file:

  • A Person record is defined to represent the details describing a person ( name and age).
  • The PersonExtractor interface is defined with a method that given an unstructured text string, returns a Person instance.
  • The extractPerson() is annotated with a @UserMessage annotation that associates a prompt with it. That's the prompt that the model will use to extract the information, and return the details in the form of a JSON document, that will be parsed for you, and unmarshalled into a Person instance.

Now let's look at the content of the main() method:

  • The chat model is instantiated. Notice that we use a very low temperature of zero, and a topK of only one, to ensure a very deterministic answer. This also helps the model follow the instructions better. In particular, we don't want Gemini to wrap the JSON response with extra Markdown markup.
  • A PersonExtractor object is created thanks to LangChain4j's AiServices class.
  • Then, you can simply call Person person = extractor.extractPerson(...) to extract the details of the person from the unstructured text, and get back a Person instance with the name and age.

Run the sample:

./gradlew run -q -DjavaMainClass=gemini.workshop.ExtractData

You should see the following output:

Anna
23

Yes, this is Anna and she is 23!

With this AiServices approach is that you operate with strongly typed objects. You are not interacting directly with the LLM. Instead, you are working with concrete classes, like the Person record to represent the extracted personal information, and you have a PersonExtractor object with an extractPerson() method which returns a Person instance. The notion of LLM is abstracted away, and as a Java developer, you are just manipulating normal classes and objects.

8. Structure prompts with prompt templates

When you interact with an LLM using a common set of instructions or questions, there's a part of that prompt that never changes, while other parts contain the data. For example, if you want to create recipes, you might use a prompt like "You're a talented chef, please create a recipe with the following ingredients: ...", and then you'd append the ingredients to the end of that text. That's what prompt templates are for — similar to interpolated strings in programming languages. A prompt template contains placeholders which you can replace with the right data for a particular call to the LLM.

More concretely, let's study PromptTemplate.java in the app/src/main/java/gemini/workshop directory:

package gemini.workshop;

import dev.langchain4j.model.chat.ChatLanguageModel;
import dev.langchain4j.model.vertexai.VertexAiGeminiChatModel;
import dev.langchain4j.data.message.AiMessage;
import dev.langchain4j.model.input.Prompt;
import dev.langchain4j.model.input.PromptTemplate;
import dev.langchain4j.model.output.Response;

import java.util.HashMap;
import java.util.Map;

public class PromptTemplate {
    public static void main(String[] args) {
        ChatLanguageModel model = VertexAiGeminiChatModel.builder()
            .project(System.getenv("PROJECT_ID"))
            .location(System.getenv("LOCATION"))
            .modelName("gemini-1.0-pro")
            .maxOutputTokens(500)
            .temperature(0.8f)
            .topK(40)
            .topP(0.95f)
            .maxRetries(3)
            .build();

        PromptTemplate promptTemplate = PromptTemplate.from("""
            You're a friendly chef with a lot of cooking experience.
            Create a recipe for a {{dish}} with the following ingredients: \
            {{ingredients}}, and give it a name.
            """
        );

        Map<String, Object> variables = new HashMap<>();
        variables.put("dish", "dessert");
        variables.put("ingredients", "strawberries, chocolate, and whipped cream");

        Prompt prompt = promptTemplate.apply(variables);

        Response<AiMessage> response = model.generate(prompt.toUserMessage());

        System.out.println(response.content().text());
    }
}

As usual, you configure VertexAiGeminiChatModel model, with a high level of creativity with a high temperature and also high topP and topK values. Then you create a PromptTemplate with its from() static method, by passing the string of our prompt, and use the double curly-braces placeholder variables: {{dish}} and {{ingredients}}.

You create the final prompt by calling apply() that takes a map of key/value pairs that represent the name of the placeholder and the string value to replace it with.

Lastly, you call the generate() method of the Gemini model by creating a user message from that prompt, with the prompt.toUserMessage() instruction.

Run the sample:

./gradlew run -q -DjavaMainClass=gemini.workshop.PromptTemplate

You should see a generated output that looks similar to this one:

**Strawberry Shortcake**

Ingredients:

* 1 pint strawberries, hulled and sliced
* 1/2 cup sugar
* 1/4 cup cornstarch
* 1/4 cup water
* 1 tablespoon lemon juice
* 1/2 cup heavy cream, whipped
* 1/4 cup confectioners' sugar
* 1/4 teaspoon vanilla extract
* 6 graham cracker squares, crushed

Instructions:

1. In a medium saucepan, combine the strawberries, sugar, cornstarch, 
water, and lemon juice. Bring to a boil over medium heat, stirring 
constantly. Reduce heat and simmer for 5 minutes, or until the sauce has 
thickened.
2. Remove from heat and let cool slightly.
3. In a large bowl, combine the whipped cream, confectioners' sugar, and 
vanilla extract. Beat until soft peaks form.
4. To assemble the shortcakes, place a graham cracker square on each of 
6 dessert plates. Top with a scoop of whipped cream, then a spoonful of 
strawberry sauce. Repeat layers, ending with a graham cracker square.
5. Serve immediately.

**Tips:**

* For a more elegant presentation, you can use fresh strawberries 
instead of sliced strawberries.
* If you don't have time to make your own whipped cream, you can use 
store-bought whipped cream.

Prompt templates is a good way to have reusable and parameterizable instructions for LLM calls. You can pass data and customize prompts for different values, provided by your users.

9. Text classification with few-shot prompting

LLMs are pretty good at classifying text into different categories. You can help an LLM in that task by providing some examples of texts and their associated categories. This approach is often called few shot prompting.

Take a look at TextClassification.java in the app/src/main/java/gemini/workshop directory, to do a particular type of text classification: sentiment analysis.

package gemini.workshop;

import dev.langchain4j.model.chat.ChatLanguageModel;
import dev.langchain4j.model.vertexai.VertexAiGeminiChatModel;
import dev.langchain4j.data.message.AiMessage;
import dev.langchain4j.model.input.Prompt;

package gemini.workshop;

import dev.langchain4j.model.chat.ChatLanguageModel;
import dev.langchain4j.model.vertexai.VertexAiGeminiChatModel;
import dev.langchain4j.data.message.AiMessage;
import dev.langchain4j.model.input.Prompt;
import dev.langchain4j.model.input.PromptTemplate;
import dev.langchain4j.model.output.Response;

import java.util.Map;

public class TextClassification {
    public static void main(String[] args) {
        ChatLanguageModel model = VertexAiGeminiChatModel.builder()
            .project(System.getenv("PROJECT_ID"))
            .location(System.getenv("LOCATION"))
            .modelName("gemini-1.0-pro")
            .maxOutputTokens(10)
            .maxRetries(3)
            .build();

        PromptTemplate promptTemplate = PromptTemplate.from("""
            Analyze the sentiment of the text below. Respond only with one word to describe the sentiment.

            INPUT: This is fantastic news!
            OUTPUT: POSITIVE

            INPUT: Pi is roughly equal to 3.14
            OUTPUT: NEUTRAL

            INPUT: I really disliked the pizza. Who would use pineapples as a pizza topping?
            OUTPUT: NEGATIVE

            INPUT: {{text}}
            OUTPUT: 
            """);

        Prompt prompt = promptTemplate.apply(
            Map.of("text", "I love strawberries!"));

        Response<AiMessage> response = model.generate(prompt.toUserMessage());

        System.out.println(response.content().text());
    }
}

In the main() method, you create the Gemini chat model as usual, but with a small maximum output token number, as you only want a short response: the text is POSITIVE, NEGATIVE, or NEUTRAL.

Then, you create a reusable prompt template with the few-shot prompting technique, by instructing the model about a few examples of inputs and outputs. This also helps the model follow the actual output. Gemini won't reply with a full blown sentence, instead, it's instructed to reply with just one word.

You apply the variables with the apply() method, to replace the {{text}} placeholder with the real parameter ("I love strawberries"), and turn that template into a user message with toUserMessage().

Run the sample:

./gradlew run -q -DjavaMainClass=gemini.workshop.TextClassification

You should see a single word:

POSITIVE

Looks like loving strawberries is a positive sentiment!

10. Retrieval Augmented Generation

LLMs are trained on a large quantity of text. However, their knowledge covers only information that it has seen during its training. If there is new information released after the model training cut-off-date, those details won't be available to the model. Thus, the model will not be able to answer questions on information that it hasn't seen.

That's why approaches like Retrieval Augmented Generation (RAG) help provide the extra information that an LLM may need to know to fulfill the requests of its users, to reply with information that may be more current or on private information that is not accessible at training time.

Let's come back to conversations. This time, you will be able to ask questions about your documents. You will build a chatbot that is able to retrieve relevant information from a database containing your documents split in smaller pieces ("chunks") and that information will be used by the model to ground its answers, instead of relying solely on the knowledge contained in its training.

In RAG, there are two phases:

  1. Ingestion phase — Documents are loaded in memory, split into smaller chunks, and vector embeddings (a high multidimensional vector representation of the chunks) are calculated and stored in a vector database that is capable of doing semantic searches. This ingestion phase is normally done once, when new documents need to be added to the document corpus.

cd07d33d20ffa1c8.png

  1. Query phase — Users can now ask questions about the documents. The question will be transformed into a vector as well and compared with all the other vectors in the database. The most similar vectors are usually semantically related and are returned by the vector database. Then, the LLM is given the context of the conversation, the chunks of text that correspond to the vectors returned by the database, and it is asked to ground its answer by looking at those chunks.

a1d2e2deb83c6d27.png

Prepare your documents

For this new demo, you will ask questions about the "Attention is all you need" research paper. It describes the transformer neural network architecture, pioneered by Google, which is how all modern large language models are implemented nowadays.

You can retrieve the research paper by using the wget command to download the PDF from the internet:

wget -O /tmp/attention-is-all-you-need.pdf \
    https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

Implement the chatbot

Let's explore how to build the 2-phase approach: first with the document ingestion, and then the query time when users ask questions about the document.

In this example, both phases are implemented in the same class. Normally, you'd have one application that takes care of the ingestion, and another application that offers the chatbot interface to your users.

Also, you use an in-memory vector database. In a real setting, the ingestion and the querying phases would be separated in two distinct applications, and the vectors are persisted in a standalone database.

Document ingestion

The very first step of the document ingestion phase is to locate the PDF file that we download it, and prepare a PdfParser to read it:

ApachePdfBoxDocumentParser pdfParser = new ApachePdfBoxDocumentParser();
Document document = pdfParser.parse(new FileInputStream(
    "/tmp/attention-is-all-you-need.pdf"));

Instead of creating the usual chat language model, you create an instance of an embedding model. This is a particular model whose role is to create vector representations of text pieces (words, sentences or even paragraphs). It returns vectors of floating point numbers, rather than returning text responses.

VertexAiEmbeddingModel embeddingModel = VertexAiEmbeddingModel.builder()
    .endpoint(System.getenv("LOCATION") + "-aiplatform.googleapis.com:443")
    .project(System.getenv("PROJECT_ID"))
    .location(System.getenv("LOCATION"))
    .publisher("google")
    .modelName("textembedding-gecko@001")
    .maxRetries(3)
    .build();

Next, you will need a few classes to collaborate together to:

  • Load and split the PDF document in chunks.
  • Create vector embeddings for all of these chunks.
InMemoryEmbeddingStore<TextSegment> embeddingStore = 
    new InMemoryEmbeddingStore<>();

EmbeddingStoreIngestor storeIngestor = EmbeddingStoreIngestor.builder()
    .documentSplitter(DocumentSplitters.recursive(500, 100))
    .embeddingModel(embeddingModel)
    .embeddingStore(embeddingStore)
    .build();
storeIngestor.ingest(document);

An instance of InMemoryEmbeddingStore, an in-memory vector database, is created to store the vector embeddings.

The document is split in chunks thanks to the DocumentSplitters class. It is going to split the text of the PDF file into snippets of 500 characters, with an overlap of 100 characters (with the following chunk, to avoid cutting words or sentences, in bits and pieces).

The store ingestor links the document splitter, the embedding model to calculate the vectors, and the in-memory vector database. Then, the ingest() method will take care of doing the ingestion.

Now, the first phase is over, the document has been transformed into text chunks with their associated vector embeddings, and stored in the vector database.

Asking questions

It's time to get ready to ask questions! Create a chat model to start the conversation:

ChatLanguageModel model = VertexAiGeminiChatModel.builder()
        .project(System.getenv("PROJECT_ID"))
        .location(System.getenv("LOCATION"))
        .modelName("gemini-1.0-pro")
        .maxOutputTokens(1000)
        .build();

You also need a retriever class to link the vector database (in the embeddingStore variable) with the embedding model. Its job is to query the vector database by computing a vector embedding for the user's query, to find similar vectors in the database:

EmbeddingStoreContentRetriever retriever =
    new EmbeddingStoreContentRetriever(embeddingStore, embeddingModel);

Outside of the main method, create an interface that represents an LLM expert assistant, that's an interface that the AiServices class will implement for you to interact with the model:

interface LlmExpert {
    String ask(String question);
}

At this point, you can configure a new AI service:

LlmExpert expert = AiServices.builder(LlmExpert.class)
    .chatLanguageModel(model)
    .chatMemory(MessageWindowChatMemory.withMaxMessages(10))
    .contentRetriever(retriever)
    .build();

This service binds together:

  • The chat language model that you configured earlier.
  • A chat memory to keep track of the conversation.
  • The retriever compares a vector embedding query to the vectors in the database.
  • A prompt template explicitly says that the chat model should reply by basing its response on the provided information (i.e. the relevant excerpts of the documentation whose vector embedding is similar to the vector of the user's question).
.retrievalAugmentor(DefaultRetrievalAugmentor.builder()
    .contentInjector(DefaultContentInjector.builder()
        .promptTemplate(PromptTemplate.from("""
            You are an expert in large language models,\s
            you excel at explaining simply and clearly questions about LLMs.

            Here is the question: {{userMessage}}

            Answer using the following information:
            {{contents}}
            """))
        .build())
    .queryRouter(new DefaultQueryRouter(retriever))
    .build())

You're finally ready to ask your questions!

List.of(
    "What neural network architecture can be used for language models?",
    "What are the different components of a transformer neural network?",
    "What is attention in large language models?",
    "What is the name of the process that transforms text into vectors?"
).forEach(query ->
    System.out.printf("%n=== %s === %n%n %s %n%n", query, expert.ask(query)));
);

The full source code is in RAG.java in app/src/main/java/gemini/workshop directory:

Run the sample:

./gradlew -q run -DjavaMainClass=gemini.workshop.RAG

In the output, you should see answers to your questions:

=== What neural network architecture can be used for language models? === 

 Transformer architecture 


=== What are the different components of a transformer neural network? === 

 The different components of a transformer neural network are:

1. Encoder: The encoder takes the input sequence and converts it into a 
sequence of hidden states. Each hidden state represents the context of 
the corresponding input token.
2. Decoder: The decoder takes the hidden states from the encoder and 
uses them to generate the output sequence. Each output token is 
generated by attending to the hidden states and then using a 
feed-forward network to predict the token's probability distribution.
3. Attention mechanism: The attention mechanism allows the decoder to 
attend to the hidden states from the encoder when generating each output 
token. This allows the decoder to take into account the context of the 
input sequence when generating the output sequence.
4. Positional encoding: Positional encoding is a technique used to 
inject positional information into the input sequence. This is important 
because the transformer neural network does not have any inherent sense 
of the order of the tokens in the input sequence.
5. Feed-forward network: The feed-forward network is a type of neural 
network that is used to predict the probability distribution of each 
output token. The feed-forward network takes the hidden state from the 
decoder as input and outputs a vector of probabilities. 


=== What is attention in large language models? === 

Attention in large language models is a mechanism that allows the model 
to focus on specific parts of the input sequence when generating the 
output sequence. This is important because it allows the model to take 
into account the context of the input sequence when generating each output token.

Attention is implemented using a function that takes two sequences as 
input: a query sequence and a key-value sequence. The query sequence is 
typically the hidden state from the previous decoder layer, and the 
key-value sequence is typically the sequence of hidden states from the 
encoder. The attention function computes a weighted sum of the values in 
the key-value sequence, where the weights are determined by the 
similarity between the query and the keys.

The output of the attention function is a vector of context vectors, 
which are then used as input to the feed-forward network in the decoder. 
The feed-forward network then predicts the probability distribution of 
the next output token.

Attention is a powerful mechanism that allows large language models to 
generate text that is both coherent and informative. It is one of the 
key factors that has contributed to the recent success of large language 
models in a wide range of natural language processing tasks. 


=== What is the name of the process that transforms text into vectors? === 

The process of transforming text into vectors is called **word embedding**.

Word embedding is a technique used in natural language processing (NLP) 
to represent words as vectors of real numbers. Each word is assigned a 
unique vector, which captures its meaning and semantic relationships 
with other words. Word embeddings are used in a variety of NLP tasks, 
such as machine translation, text classification, and question 
answering.

There are a number of different word embedding techniques, but one of 
the most common is the **skip-gram** model. The skip-gram model is a 
neural network that is trained to predict the surrounding words of a 
given word. By learning to predict the surrounding words, the skip-gram 
model learns to capture the meaning and semantic relationships of words.

Once a word embedding model has been trained, it can be used to 
transform text into vectors. To do this, each word in the text is 
converted to its corresponding vector. The vectors for all of the words 
in the text are then concatenated to form a single vector, which 
represents the entire text.

Text vectors can be used in a variety of NLP tasks. For example, text 
vectors can be used to train machine translation models, text 
classification models, and question answering models. Text vectors can 
also be used to perform tasks such as text summarization and text 
clustering. 

11. Function calling

There are also situations where you would like an LLM to have access to external systems, like a remote web API that retrieves information or have an action, or services that perform some kind of computation. For example:

Remote web APIs:

  • Track and update customer orders.
  • Find or create a ticket in an issue tracker.
  • Fetch real time data like stock quotes or IoT sensor measurements.
  • Send an email.

Computation tools:

  • A calculator for more advanced math problems.
  • Code interpretation for running code when LLMs need reasoning logic.
  • Convert natural language requests into SQL queries so that an LLM can query a database.

Function calling is the ability for the model to request one or more function calls to be made on its behalf, so it can properly answer a user's prompt with more fresh data.

Given a particular prompt from a user, and the knowledge of existing functions that can be relevant to that context, an LLM can reply with a function call request. The application integrating the LLM can then call the function, and then reply back to the LLM with a response, and the LLM then interprets back by replying with a textual answer.

Four steps of function calling

Let's have a look at an example of function calling: getting information about the weather forecast.

If you ask Gemini or any other LLM about the weather in Paris, they would reply by saying that it has no information about the weather forecast. If you want the LLM to have real time acccess to the weather data, you need to define some functions it can use.

Take a look at the following diagram:

31e0c2aba5e6f21c.png

1️⃣ First, a user asks about the weather in Paris. The chatbot app knows there are one or more functions that are at its disposal to help the LLM fulfill the query. The chatbot both sends the initial prompt, as well as the list of functions that can be called. Here, a function called getWeather() which takes a string parameter for the location.

8863be53a73c4a70.png

As the LLM doesn't know about weather forecasts, instead of replying via text, it sends back a function execution request. The chatbot must call the getWeather() function with "Paris" as location parameter.

d1367cc69c07b14d.png

2️⃣ The chatbot invokes that function on behalf of the LLM, retrieves the function response. Here, we imagine that the response is {"forecast": "sunny"}.

73a5f2ed19f47d8.png

3️⃣ The chatbot app sends the JSON response back to the LLM.

20832cb1ee6fbfeb.png

4️⃣ The LLM looks at the JSON response, interprets that information, and eventually replies back with the text that the weather is sunny in Paris.

Each step as code

First, you'll configure the Gemini model as usual:

ChatLanguageModel model = VertexAiGeminiChatModel.builder()
    .project(System.getenv("PROJECT_ID"))
    .location(System.getenv("LOCATION"))
    .modelName("gemini-1.0-pro")
    .maxOutputTokens(100)
    .build();

You specify a tool specification that describes the function that can be called:

ToolSpecification weatherToolSpec = ToolSpecification.builder()
    .name("getWeatherForecast")
    .description("Get the weather forecast for a location")
    .addParameter("location", JsonSchemaProperty.STRING,
        JsonSchemaProperty.description("the location to get the weather forecast for"))
    .build();

The name of the function is defined, as well as the name and type of the parameter, but notice that both the function and the parameters are given descriptions. Descriptions are very important and help the LLM really understand what a function can do, and thus judge whether this function needs to be called in the context of the conversation.

Let's start step #1, by sending the initial question about the weather in Paris:

List<ChatMessage> allMessages = new ArrayList<>();

// 1) Ask the question about the weather
UserMessage weatherQuestion = UserMessage.from("What is the weather in Paris?");
allMessages.add(weatherQuestion);

In step #2, we pass the tool we'd like the model to use, and the model replies with a too execution request:

// 2) The model replies with a function call request
Response<AiMessage> messageResponse = model.generate(allMessages, weatherToolSpec);
ToolExecutionRequest toolExecutionRequest = messageResponse.content().toolExecutionRequests().getFirst();
System.out.println("Tool execution request: " + toolExecutionRequest);
allMessages.add(messageResponse.content());

Step #3. At this point, we know what function the LLM would like us to call. In the code, we're not making a real call to an external API, we just return an hypothetical weather forecast directly:

// 3) We send back the result of the function call
ToolExecutionResultMessage toolExecResMsg = ToolExecutionResultMessage.from(toolExecutionRequest,
    "{\"location\":\"Paris\",\"forecast\":\"sunny\", \"temperature\": 20}");
allMessages.add(toolExecResMsg);

And in step #4, the LLM learns about the function execution result, and can then synthesize a textual response:

// 4) The model answers with a sentence describing the weather
Response<AiMessage> weatherResponse = model.generate(allMessages);
System.out.println("Answer: " + weatherResponse.content().text());

The output is:

Tool execution request: ToolExecutionRequest { id = null, name = "getWeatherForecast", arguments = "{"location":"Paris"}" }
Answer:  The weather in Paris is sunny with a temperature of 20 degrees Celsius.

You can see in the output above the tool execution request, as well as the answer.

The full source code is in FunctionCalling.java in app/src/main/java/gemini/workshop directory:

Run the sample:

./gradlew run -q -DjavaMainClass=gemini.workshop.FunctionCalling

You should see an output similar to the following:

Tool execution request: ToolExecutionRequest { id = null, name = "getWeatherForecast", arguments = "{"location":"Paris"}" }
Answer:  The weather in Paris is sunny with a temperature of 20 degrees Celsius.

12. LangChain4j handles function calling

In the previous step, you saw how the normal text question/answer and function request/response interactions are interleaved, and in between, you provided the requested function response directly, without calling a real function.

However, LangChain4j also offers a higher-level abstraction that can handle the function calls transparently for you, while handling the conversation as usual.

Let's have a look at FunctionCallingAssistant.java, piece by piece.

First, you create a record that will represent the function's response data structure:

record WeatherForecast(String location, String forecast, int temperature) {}

The response contains information about the location, the forecast, and the temperature.

Then you create a class that contains the actual function you want to make available to the model:

static class WeatherForecastService {
    @Tool("Get the weather forecast for a location")
    WeatherForecast getForecast(@P("Location to get the forecast for") String location) {
        if (location.equals("Paris")) {
            return new WeatherForecast("Paris", "Sunny", 20);
        } else if (location.equals("London")) {
            return new WeatherForecast("London", "Rainy", 15);
        } else {
            return new WeatherForecast("Unknown", "Unknown", 0);
        }
    }
}

Note that this class contains a single function, but it is annotated with the @Tool annotation which corresponds to the description of the function the model can request to call.

The parameters of the function (a single one here) is also annotated, but with this short @P annotation, which also gives a description of the parameter. You could add as many functions as you wanted, to make them available to the model, for more complex scenarios.

In this class, you return some canned responses, but if you wanted to call a real external weather forecast service, this is in the body of that method that you would make the call to that service.

As we saw when you created a ToolSpecification in the previous approach, it's important to document what a function does, and describe what the parameters correspond to. This helps the model understand how and when this function can be used.

Next, LangChain4j lets you provide an interface that corresponds to the contract you want to use to interact with the model. Here, it's a simple interface that takes in a string representing the user message, and returns a string corresponding to the model's response:

interface WeatherAssistant {
    String chat(String userMessage);
}

It is also possible to use more complex signatures that involve LangChain4j's UserMessage (for a user message) or AiMessage (for a model response), or even a TokenStream, if you want to handle more advanced situations, as those more complicated objects also contain extra information such as the number of tokens consumed, etc. But for simplicity sake, we'll just take string in input, and string in output.

Let's finish with the main() method that ties all the pieces together:

public static void main(String[] args) {
    ChatLanguageModel model = VertexAiGeminiChatModel.builder()
        .project(System.getenv("PROJECT_ID"))
        .location(System.getenv("LOCATION"))
        .modelName("gemini-1.0-pro")
        .maxOutputTokens(100)
        .build();

    WeatherForecastService weatherForecastService = new WeatherForecastService();

    WeatherAssistant assistant = AiServices.builder(WeatherAssistant.class)
        .chatLanguageModel(model)
        .chatMemory(MessageWindowChatMemory.withMaxMessages(10))
        .tools(weatherForecastService)
        .build();

    System.out.println(assistant.chat("What is the weather in Paris?"));
}

As usual, you configure the Gemini chat model. Then you instantiate your weather forecast service that contains the "function" that the model will request us to call.

Now, you use the AiServices class again to bind the chat model, the chat memory, and the tool (ie. the weather forecast service with its function). AiServices returns an object that implements your WeatherAssistant interface you defined. The only thing left is to call the chat() method of that assistant. When invoking it, you will only see the text responses, but the function call requests and the function call responses will not be visible from the developer, and those requests will be handled automatically and transparently. If Gemini thinks a function should be called, it'll reply with the function call request, and LangChain4j will take care of calling the local function on your behalf.

Run the sample:

./gradlew run -q -DjavaMainClass=gemini.workshop.FunctionCallingAssistant

You should see an output similar to the following:

OK. The weather in Paris is sunny with a temperature of 20 degrees.

This was an example of a single function but you can also have multiple functions and let LangChain4j handle multiple function callings.

You can take a look at MultiFunctionCalling.java for a multiple function example that uses multiple functions to find the price of a stock, get its percentage, and convert its currency. You can run it as follows:

./gradlew run -q -DjavaMainClass=gemini.workshop.MultiFunctionCalling

And you should see the multiple functions called:

getStockPrice(symbol = AAPL) == 172.8022224055534
convertCurrency(fromCurrency = USD, toCurrency = EUR, amount = 172.8022224055534) == 160.70606683716468
applyPercentage(amount = 160.70606683716468, percentage = 10.0) == 16.07060668371647
10% of the AAPL stock price converted from USD to EUR is 16.07060668371647 EUR.

Towards Agents

Function calling is a great extension mechanism for large language models like Gemini. It enables to build more complex systems often called "agents" or "AI assistants". These agents can interact with the external world via external APIs and with services that can have side effects on the external environment (like sending emails, creating tickets, etc.)

When creating such powerful agents, you should do so responsibly. You should consider a human-in-the-loop before making automatic actions. It's important to keep safety in mind when designing LLM-powered agents that interact with the external world.

13. Congratulations

Congratulations, you've successfully built your first Generative AI chat application in Java using LangChain4j and the Gemini API! You discovered along the way that multimodal large language models are pretty powerful and capable of handling various tasks like question/answering, even on your own documentation, data extraction, interacting with external APIs, and more.

What's next?

It's your turn to enhance your applications with powerful LLM integrations!

Further reading

Reference docs