Gemini models can process videos, enabling many frontier developer use cases that would have historically required domain specific models. Some of Gemini's vision capabilities include the ability to:
Gemini was built to be multimodal from the ground up and we continue to push the frontier of what is possible. This guide shows how to use the Gemini API to generate text responses based on video inputs.
Before calling the Gemini API, ensure you have your SDK of choice installed, and a Gemini API key configured and ready to use.
You can provide videos as input to Gemini in the following ways:
generateContent
. Use this method for files larger than 20MB, videos longer than approximately 1 minute, or when you want to reuse the file across multiple requests.generateContent
. Use this method for smaller files (<20MB) and shorter durations.You can use the Files API to upload a video file. Always use the Files API when the total request size (including the file, text prompt, system instructions, etc.) is larger than 20 MB, the video duration is significant, or if you intend to use the same video in multiple prompts.
The File API accepts video file formats directly. This example uses the short NASA film "Jupiter's Great Red Spot Shrinks and Grows". Credit: Goddard Space Flight Center (GSFC)/David Ladd (2018).
"Jupiter's Great Red Spot Shrinks and Grows" is in the public domain and does not show identifiable people. ([NASA image and media usage guidelines.](https://www.n
VIDEO_PATH="path/to/sample.mp4"
MIME_TYPE=$(file -b --mime-type "${VIDEO_PATH}")
NUM_BYTES=$(wc -c < "${VIDEO_PATH}")
DISPLAY_NAME=VIDEO
tmp_header_file=upload-header.tmp
echo "Starting file upload..."
curl "https://generativelanguage.googleapis.com/upload/v1beta/files?key=${GOOGLE_API_KEY}" \
-D ${tmp_header_file} \
-H "X-Goog-Upload-Protocol: resumable" \
-H "X-Goog-Upload-Command: start" \
-H "X-Goog-Upload-Header-Content-Length: ${NUM_BYTES}" \
-H "X-Goog-Upload-Header-Content-Type: ${MIME_TYPE}" \
-H "Content-Type: application/json" \
-d "{'file': {'display_name': '${DISPLAY_NAME}'}}" 2> /dev/null
upload_url=$(grep -i "x-goog-upload-url: " "${tmp_header_file}" | cut -d" " -f2 | tr -d "\r")
rm "${tmp_header_file}"
echo "Uploading video data..."
curl "${upload_url}" \
-H "Content-Length: ${NUM_BYTES}" \
-H "X-Goog-Upload-Offset: 0" \
-H "X-Goog-Upload-Command: upload, finalize" \
--data-binary "@${VIDEO_PATH}" 2> /dev/null > file_info.json
file_uri=$(jq -r ".file.uri" file_info.json)
echo file_uri=$file_uri
echo "File uploaded successfully. File URI: ${file_uri}"
# --- 3. Generate content using the uploaded video file ---
echo "Generating content from video..."
curl "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash:generateContent?key=$GOOGLE_API_KEY" \
-H 'Content-Type: application/json' \
-X POST \
-d '{
"contents": [{
"parts":[
{"file_data":{"mime_type": "'"${MIME_TYPE}"'", "file_uri": "'"${file_uri}"'"}},
{"text": "Summarize this video. Then create a quiz with an answer key based on the information in this video."}]
}]
}' 2> /dev/null > response.json
jq -r ".candidates[].content.parts[].text" response.json
To learn more about working with media files, see Files API.
Instead of uploading a video file using the File API, you can pass smaller videos directly in the request to generateContent
. This is suitable for shorter videos under 20MB total request size.
Here's an example of providing inline video data:
Note: If you get an Argument list too long
error, the base64 encoding of your file might be too long for the curl command line. Use the File API method instead for larger files.
VIDEO_PATH=/path/to/your/video.mp4
if [[ "$(base64 --version 2>&1)" = *"FreeBSD"* ]]; then
B64FLAGS="--input"
else
B64FLAGS="-w0"
fi
curl "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash:generateContent?key=$GOOGLE_API_KEY" \
-H 'Content-Type: application/json' \
-X POST \
-d '{
"contents": [{
"parts":[
{
"inline_data": {
"mime_type":"video/mp4",
"data": "'$(base64 $B64FLAGS $VIDEO_PATH)'"
}
},
{"text": "Please summarize the video in 3 sentences."}
]
}]
}' 2> /dev/null
Preview: The YouTube URL feature is in preview and is available at no charge. Pricing and rate limits are likely to change.
The Gemini API and AI Studio support YouTube URLs as a file data Part
. You can include a YouTube URL with a prompt asking the model to summarize, translate, or otherwise interact with the video content.
Limitations:
The following example shows how to include a YouTube URL with a prompt:
curl "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash:generateContent?key=$GOOGLE_API_KEY" \
-H 'Content-Type: application/json' \
-X POST \
-d '{
"contents": [{
"parts":[
{"text": "Please summarize the video in 3 sentences."},
{
"file_data": {
"file_uri": "https://www.youtube.com/watch?v=9hE5-98ZeCg"
}
}
]
}]
}' 2> /dev/null
You can ask questions about specific points in time within the video using timestamps of the form `MM:S
PROMPT="What are the examples given at 00:05 and 00:10 supposed to show us?"
The Gemini models can transcribe and provide visual descriptions of video content by processing both the audio track and visual frames. For visual descriptions, the model samples the video at a rate of 1 frame per second. This sampling rate may affect the level of detail in the descriptions, particularly for videos with rapidly changing visuals.
PROMPT="Transcribe the audio from this video, giving timestamps for slient events in the video. Also provide visual descriptions."
Gemini supports the following video format MIME types:
video/mp4
video/mpeg
video/mov
video/avi
video/x-flv
video/mpg
video/webm
video/wmv
video/3gpp
Supported models & context
: All Gemini 2.0 and 2.5 models can process video data.
File API processing
: When using the File API, videos are sampled at 1 frame per second (FPS) and audio is processed at 1Kbps (single channel). Timestamps are added every second.
Token calculation
: Each second of video is tokenized as follows:
Timestamp format: When referring to specific moments in a video within your prompt, use the MM:SS
format (e.g., 01:15
for 1 minute and 15 seconds).
Best practices
:
contents
array.This guide shows how to upload video files and generate text outputs from video inputs. To learn more, see the following resources: