# Gemini Skill
English | [δΈζ](./README.md)
Automate Gemini web (gemini.google.com) via CDP (Chrome DevTools Protocol) β AI image generation, conversations, image extraction, and more.
## β¨ Features
- π¨ **AI Image Generation** β Send prompts to generate images, with full-size high-res download support
- π¬ **Text Conversations** β Multi-turn dialogue with Gemini
- πΌοΈ **Image Upload** β Upload reference images for image-to-image generation
- π₯ **Image Extraction** β Extract images from sessions via base64 or CDP full-size download
- π **Session Management** β New chat, temp chat, model switching, navigate to historical sessions
- π§Ή **Auto Watermark Removal** β Downloaded images automatically have the Gemini watermark stripped
- π€ **MCP Server** β Standard MCP protocol interface, callable by any MCP client (Claude, CodeBuddy, etc.)
## πΈ Example
Generate game-style sticker images through AI conversation:
## ποΈ Architecture
```
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MCP Client (AI) β
β Claude / CodeBuddy / ... β
ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββ
β stdio (JSON-RPC)
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β mcp-server.js (MCP Protocol Layer) β
β Registers all MCP tools, orchestrates β
ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β index.js β browser.js (Connection Layer) β
β ensureBrowser() β auto-start Daemon β CDP link β
ββββββββββββ¬βββββββββββββββββββββββββββββββ¬ββββββββββββ
β HTTP (acquire/status) β WebSocket (CDP)
βΌ βΌ
ββββββββββββββββββββββββ βββββββββββββββββββββββββββ
β Browser Daemon β β Chrome / Edge β
β (standalone process)βββββΆβ gemini.google.com β
β daemon/server.js β β β
β ββ engine.js β β Stealth + anti-detect β
β ββ handlers.js β βββββββββββββββββββββββββββ
β ββ lifecycle.js β
β 30-min idle TTL β
ββββββββββββββββββββββββ
```
**Core Design Principles:**
- **Daemon Mode** β The browser process is managed by a standalone Daemon. After MCP calls finish, the browser stays alive; it auto-terminates only after 30 minutes of inactivity.
- **On-demand Auto-start** β If the Daemon isn't running, MCP tools will automatically spawn it. No manual startup required.
- **Stealth Anti-detect** β Uses `puppeteer-extra-plugin-stealth` to bypass website bot detection.
- **Separation of Concerns** β `mcp-server.js` (protocol) β `gemini-ops.js` (operations) β `browser.js` (connection) β `daemon/` (process management)
## π¦ Installation
### Prerequisites
- **Node.js** β₯ 18
- **Chrome / Edge / Chromium** β Any one of these must be installed on your system (or specify a path via `BROWSER_PATH`)
- The browser must be **logged into a Google account** beforehand (Gemini requires authentication)
### Install Dependencies
```bash
git clone https://github.com/WJZ-P/gemini-skill.git
cd gemini-skill
npm install
```
## βοΈ Configuration
All configuration is done via environment variables or a `.env` file. Create a `.env` file in the project root:
```env
# Browser executable path (auto-detects Chrome/Edge/Chromium if unset)
# BROWSER_PATH=C:\Program Files\Google\Chrome\Application\chrome.exe
# CDP remote debugging port (default: 40821)
# BROWSER_DEBUG_PORT=40821
# Headless mode (default: false β keep it off for first-time login)
# BROWSER_HEADLESS=false
# Image output directory (default: ./gemini-image)
# OUTPUT_DIR=./gemini-image
# Daemon HTTP port (default: 40225)
# DAEMON_PORT=40225
# Daemon idle timeout in ms (default: 30 minutes)
# DAEMON_TTL_MS=1800000
```
`.env.development` is also supported (takes priority over `.env`).
**Priority order:** `process.env` > `.env.development` > `.env` > code defaults
## π Usage
### Option 1: As an MCP Server (Recommended)
Add the following to your MCP client configuration:
```json
{
"mcpServers": {
"gemini": {
"command": "node",
"args": ["/src/mcp-server.js"]
}
}
}
```
Once started, the AI can invoke all tools via the MCP protocol.
### Option 2: Command Line
```bash
# Start MCP Server (stdio mode, for AI clients)
npm run mcp
# Start Browser Daemon standalone (usually unnecessary β MCP auto-starts it)
npm run daemon
# Run the demo
npm run demo
```
### Option 3: As a Library
```javascript
import { createGeminiSession, disconnect } from './src/index.js';
const { ops } = await createGeminiSession();
// Generate an image
const result = await ops.generateImage('Draw a cute cat', { fullSize: true });
console.log('Image saved to:', result.filePath);
// Disconnect when done (browser stays alive, managed by Daemon)
disconnect();
```
## π§ MCP Tools
### Image Generation
| Tool | Description | Key Parameters |
|------|-------------|----------------|
| `gemini_generate_image` | Full image generation pipeline (takes 60β120s) | `prompt`, `newSession`, `referenceImages`, `fullSize`, `timeout` |
### Session Management
| Tool | Description | Key Parameters |
|------|-------------|----------------|
| `gemini_new_chat` | Start a new blank conversation | β |
| `gemini_temp_chat` | Enter temporary chat mode (no history saved) | β |
| `gemini_navigate_to` | Navigate to a specific Gemini URL (e.g. a saved session) | `url`, `timeout` |
### Model & Conversation
| Tool | Description | Key Parameters |
|------|-------------|----------------|
| `gemini_switch_model` | Switch model (pro / quick / think) | `model` |
| `gemini_send_message` | Send text and wait for reply (takes 10β60s) | `message`, `timeout` |
### Image Operations
| Tool | Description | Key Parameters |
|------|-------------|----------------|
| `gemini_upload_images` | Upload images to the input box | `images` |
| `gemini_get_images` | List all images in the current session (metadata only) | β |
| `gemini_extract_image` | Extract image base64 data and save locally | `imageUrl` |
| `gemini_download_full_size_image` | Download full-size high-res image | `index` |
### Text Responses
| Tool | Description | Key Parameters |
|------|-------------|----------------|
| `gemini_get_all_text_responses` | Get all text responses in the session | β |
| `gemini_get_latest_text_response` | Get the latest text response | β |
### Diagnostics & Management
| Tool | Description | Key Parameters |
|------|-------------|----------------|
| `gemini_check_login` | Check Google login status | β |
| `gemini_probe` | Probe page element states | β |
| `gemini_reload_page` | Reload the page | `timeout` |
| `gemini_browser_info` | Get browser connection info | β |
## π Daemon Lifecycle
```
First MCP call
β
ββ Daemon not running β auto-spawn (detached + unref)
β β poll until ready (up to 15s)
β
ββ GET /browser/acquire β launch/reuse browser + reset 30-min countdown
β
ββ MCP tool finishes β disconnect() (closes WebSocket, keeps browser alive)
β
ββ Another call within 30 min β countdown resets (extends TTL)
β
ββ 30 min with no activity β close browser + stop HTTP server + exit process
(next call will auto-respawn)
```
**Daemon API Endpoints:**
| Endpoint | Description |
|----------|-------------|
| `GET /browser/acquire` | Acquire browser connection (resets TTL) |
| `GET /browser/status` | Query browser status (does NOT reset TTL) |
| `POST /browser/release` | Manually destroy the browser |
| `GET /health` | Daemon health check |
## π Project Structure
```
gemini-skill/
βββ src/
β βββ index.js # Unified entry point
β βββ mcp-server.js # MCP protocol server (registers all tools)
β βββ gemini-ops.js # Gemini page operations (core logic)
β βββ operator.js # Low-level DOM operation wrappers
β βββ browser.js # Browser connector (Skill-facing)
β βββ config.js # Centralized configuration
β βββ util.js # Utility functions
β βββ watermark-remover.js # Image watermark removal (via sharp)
β βββ demo.js # Usage examples
β βββ assets/ # Static assets
β βββ daemon/ # Browser Daemon (standalone process)
β βββ server.js # HTTP micro-service entry
β βββ engine.js # Browser engine (launch/connect/terminate)
β βββ handlers.js # API route handlers
β βββ lifecycle.js # Lifecycle control (lazy shutdown timer)
βββ references/ # Reference documentation
βββ SKILL.md # AI invocation spec (read by MCP clients)
βββ package.json
βββ .env # Environment config (create manually)
```
## β οΈ Notes
1. **First-time login required** β On the first run, the browser will open the Gemini page. Complete Google account login manually. Login state is persisted in `userDataDir`, so subsequent runs won't require re-login.
2. **Single instance only** β Only one browser instance can use a given CDP port. Running multiple instances will cause port conflicts.
3. **Windows Server considerations** β Path normalization and Safe Browsing bypass are built-in, but double-check:
- Chrome/Edge is properly installed
- The output directory is writable
- The firewall is not blocking localhost traffic
4. **Image generation takes time** β Typically 60β120 seconds. Set your MCP client's `timeoutMs` to β₯ 180000 (3 minutes).
## π License
ISC
## LINUX DO
This project supports [LINUX DO](https://linux.do) community.