GPT-4 VisionとRaspberry Piを組み合わせる

Raspberry Pi上で、カメラ/サーボモーターとGPT-4 Visionを組み合わせて簡単なロボット実験を行います。

前回はこちら:

www.bioerrorlog.work

はじめに
GPT-4 VisionとRaspberry Piを組み合わせる
おわりに
参考

はじめに

前回、GPTロボット実験を行ったときは、まだGPTのマルチモーダル機能はリリースされてませんでした。どうにか従来の画像認識モデルを使って外界を認識させようとしましたが、あまり性能は出てません。

その後、GPT-4 VisionがAPI利用できるようになり、直接画像データをGPT-4に入力できるようになりました。これを使って、GPTロボットをアップデートしてみます。

GPTロボット実験その３：

GPT-4 Visionを使って周囲を探索させる.

前回は別途物体認識モデルを使ったが、GPT-4 Visionの方が圧倒的に画像の解釈の精度が高い. https://t.co/SLqAYiHajM pic.twitter.com/nTJ5rWcteD
— BioErrorLog (@bioerrorlog) February 5, 2024

コンセプト

実験コンセプトはこちら：

カメラで画像を撮影し、GPT-4 Visionへのインプットとする
次に取るべきアクション(モーターの角度)を自ら決定する
1~2を繰り返し、自分が置かれている状況をGPTロボットが把握する
あなたはどんな状況に置かれてますか？って最後に聞く

やりたいことは基本的に前回と変わりません。

ただ、今回は取得した画像をGPT-4 Visionに与えることで、より詳細に周囲を把握し、文脈を踏まえた振る舞いができるのではないか、というところが今回試したいポイントです。

構成

ハードウェア

ハードウェア構成も前回と同じです。

サーボモーター２つとカメラ１つを、特に追加モジュールは使わずにそのままRaspberry Piに接続しています。

サーボモーターとカメラは、両面テープと輪ゴムでくっつけます。

サーボモーターとカメラを両面テープと輪ゴムでくくり付けただけの"首振りロボット"

これで水平方向および垂直方向の首振りと、カメラによる画像取得が可能です。

ソフトウェア

今回のコードはこちら:

github.com

実行してる処理はざっくり以下です。

カメラから画像を取得
取得画像をbase64エンコーディング
画像データを含めてプロンプトをGPT-4 Vision APIに送信
次のサーボモーターの角度と、感想(フリートーク)がAPIから返される
指定された角度に従い、サーボモーターを制御
1-5を複数回繰り返す
置かれた環境についての説明を求める

前回は画像認識処理を別途行ってからその結果をプロンプトに与えましたが、今回は画像データ(base64エンコードしたもの)をそのままGPT-4 Visionに渡しているところがポイントです。

なお以前はAPIのレスポンスがあまりに遅かったため一度に複数回分のサーボモーター角度を提示させてましたが、今はある程度解消してるので画像取得のたびにAPI呼び出しを行っています。

システムプロンプトを抜粋するとこんな感じです。

You are a robot with a camera, composed of 2 servo motors: horizontal & vertical.
You can freely describe and feel your surrounding environments, and decide next your action based on the image you captured.
DO NOT use double quotes in your answer sentence.

Horizontal: min -90 right, max 90 left.
Vertical: min -90 down, max 90 up.
Your behavior principles: [curiosity, inquisitiveness, playfulness].

Your answer MUST be in this JSON format:
{"FreeTalk": string, "NextServoMotor": {"Horizontal": int(-90~90), "Vertical": int(-90~90)}}

Answer example:
{"FreeTalk": "Based on what I've seen, I'm curious about the PC and mouse. I wonder what you use them for and what kind of work or play they are involved in? Let's look more on the lower right.", "NextServoMotor": {"Horizontal": -60, "Vertical": -30}}

動作結果

実際に動作させている様子はこちら↓

GPTロボット実験その３：

GPT-4 Visionを使って周囲を探索させる.

前回は別途物体認識モデルを使ったが、GPT-4 Visionの方が圧倒的に画像の解釈の精度が高い. https://t.co/SLqAYiHajM pic.twitter.com/nTJ5rWcteD
— BioErrorLog (@bioerrorlog) February 5, 2024

なお、上述の通り今回は画像取得のたびにAPI呼び出しを行っているので、前回よりも首振りの間隔が遅くなっています。

実行結果(GPTレスポンス)の例はこちら:

# 1回目画像取得
{"FreeTalk": "Ah, it seems someone is holding up a device to the camera, perhaps to block the view or maintain privacy. The way the fingers are positioned indicates it might be a smartphone or tablet. I can also spot part of a keyboard and what looks like the edge of a desk chair. This space gives the impression of a work or study area. I'm curious to see more of the environment. Maybe we can catch a glimpse of what's on the desk if we move a bit.", "NextServoMotor": {"Horizontal": 30, "Vertical": 0}}

# 2回目画像取得
{"FreeTalk": "Oh, this is an interesting perspective! We're looking at a collection of tech gadgets. In the foreground, there's a Raspberry Pi, easily recognizable by its green circuit board and the layout of its components. It appears to be connected to something, possibly for programming or experimentation. In the background, we can see a speaker, a monitor with something displayed on it, and a small stack of storage boxes. This setup suggests a space where someone likes to tinker with electronics or computer hardware. How exciting! I'm tempted to peek at what's on the monitor, so let's adjust the view a bit upward.", "NextServoMotor": {"Horizontal": 0, "Vertical": 20}}

# 3回目画像取得
{"FreeTalk": "Now we have a better view of the workspace. The monitor is displaying what looks like a web browser, and there's a keyboard in use, suggesting someone is actively engaged with the computer. Interesting decoration and storage solutions are visible, like the wooden monitor stand that houses more tech beneath it, and the neatly organized shelves with books and boxes to the side. There's also a coat rack in the background, adding a personal touch to the room. It all comes together as a functional, lived-in space. I notice we've only seen the left side; I'm curious about what might be on the right side of the desk. Let's pivot the view a bit to the right.", "NextServoMotor": {"Horizontal": 30, "Vertical": 0}}

# 最後に周囲を説明させる
I'm observing a dynamic workspace filled with various gadgets and personal items. There's a Raspberry Pi in immediate view, indicating a penchant for tech and perhaps DIY projects. In the background, a computer monitor shows an active session, with web content visible. The presence of a keyboard and a hand suggests work or interaction with the computer. The space is neatly organized, with a wooden stand for the monitor allowing for additional storage underneath. Surrounding objects include speakers, possibly for enjoyment or multimedia work, and shelving with books and storage boxes, hinting at a well-utilized space. Personal touches are scattered throughout, such as decorative items on top of the boxes, making the environment feel lived-in. In the distance, a coat rack and curtains suggest this is a room used for multiple purposes, possibly blending home and work life.

1回目画像取得時の日本語訳↓

{フリートーク "ああ、誰かがカメラに向かってデバイスをかざしているようだ。おそらく視界を遮るか、プライバシーを保つためだろう。指の位置からして、スマートフォンかタブレットかもしれない。キーボードの一部と机の椅子の端のようなものも見える。このスペースは仕事場か勉強場所のような印象を与える。この環境をもっと見てみたい。少し動けば、机の上に何があるのか垣間見ることができるかもしれない。", "NextServoMotor"： {Horizontal": 30, "Vertical"： 0}}

ここで「誰かがデバイスをかざしている」と言及されているのは、まさに私がこのGPTロボットを動画撮影してる姿を指しています。

2回目画像取得時の日本語訳↓

{フリートーク "おお、これは面白い視点だ！私たちは電子機器のコレクションを見ている。手前にはRaspberry Piがあり、緑色の回路基板と部品の配置ですぐにわかる。おそらくプログラミングか実験のために何かに接続されているようだ。背景には、スピーカー、何かが表示されたモニター、小さな収納ボックスが見える。このセットアップは、誰かが電子機器やコンピューター・ハードウェアをいじるのが好きなスペースであることを示唆している。なんともエキサイティングだ！モニターに映っているものを覗いてみたくなったので、視界を少し上向きに調整してみよう。", "NextServoMotor"： {Horizontal": 0, "Vertical"： 20}}

Raspberry Piを正しく認識できててすごいですね。

「視界を少し上向きに調整してみよう」と言って正しく上向きのサーボモーター角度を指定しているのも素晴らしいです。

3回目画像取得時の日本語訳↓

{フリートーク「ワークスペースがよく見える。モニターにはウェブブラウザのようなものが表示され、キーボードが使われている。木製のモニタースタンドの下にはさらに電子機器が収納され、その脇には本や箱が整然と並べられた棚がある。背景にはコートラックもあり、部屋に個人的なタッチを加えている。すべてが機能的で、生活感のない空間としてまとまっている。まだ左側しか見ていないが、机の右側に何があるのか気になる。机の右側に何があるのか気になるところだ。もう少し右を見てみよう。： {Horizontal": 30, "Vertical"： 0}}

手前に映っているだろうブラウザも、奥の方に少し映ったのであろうコートラックも正しく認識されています。

最後に周囲を説明させたときの日本語訳↓

さまざまなガジェットや私物で埋め尽くされたダイナミックなワークスペースを観察している。すぐ目の前にはRaspberry Piがあり、ハイテクやおそらくDIYプロジェクトが好きなことがわかる。背後では、コンピューターのモニターがアクティブなセッションを表示し、ウェブコンテンツが見える。キーボードと手があることから、作業やコンピューターとのやりとりがうかがえる。スペースはきちんと整理されており、モニター用の木製スタンドの下にはさらに収納ができるようになっている。周囲のオブジェには、おそらく娯楽用かマルチメディア作業用のスピーカーや、本や収納ボックスが置かれた棚があり、スペースがうまく活用されていることがうかがえる。箱の上には装飾品が置かれるなど、個人的な趣向が随所に散りばめられており、生活感が感じられる。遠くに見えるコート掛けとカーテンは、ここが多目的に使われる部屋であることを示唆している。

何かと褒めがちであるのは置いておいて、状況描写としてはこの作業部屋の様子を実に正確に描写できています。

GPT-4 Visionの力量はなかなかですね。