🦊 Data Kitsune

Idea by Alexey @darkolorin Moiseenkov
Design by Oleksandr @alexbemore Shatov
Executed by Vladimir @matterai Vlasiuk

Data Kitsune is a microservices-based application for content extraction, processing, and delivery through Telegram. It leverages multiple AI services, including OpenAI, Gemini, Grok, and R2R for enhanced content understanding and retrieval.

Overview

The application functions as a Telegram bot that:

Captures links shared in Telegram
Extracts content using Firecrawl
Uses AI to process and summarize the content
Stores the content in a searchable database
Delivers summaries and notifications to users

Architecture

The application is built using a microservices architecture with the following components:

Services

telegram-listener: Processes incoming Telegram messages and commands
links-saver: Saves shared links to the database
content-parser: Extracts content from links using Firecrawl and AI
description-writer: Generates summaries of extracted content
r2r-injector: Stores processed content in R2R for vectorized search
updates-sender: Sends notifications to subscribed users
finalizer: Completes the processing pipeline and sends final notifications

Data Stores

PostgreSQL: Primary database for storing all application data
Redis: Used for message queuing between services
R2R: Retrieval augmented generation system for semantic search

AI Integration

The application integrates with multiple AI providers:

OpenAI: For content summarization and understanding
Google Vertex AI (Gemini): Alternative AI provider for content processing
Grok: Additional AI capabilities for X.com posts analysis
R2R: For RAG (Retrieval Augmented Generation) search capabilities

Features

Link Processing: Extract and summarize content from various source types (websites, YouTube, Twitter, etc.)
Search: Search for content using natural language queries via R2R
Notifications: Receive summaries of shared links
Scheduling: Schedule regular updates for channels you're interested in
Subscription Management: Subscribe to and unsubscribe from updates
Pagination: Navigate through large result sets

Installation

Prerequisites

Docker and Docker Compose
Node.js 18+ (for development)
API keys for:
- Telegram Bot
- OpenAI
- Google Vertex AI
- Grok
- Firecrawl

Configuration Files

timezones.json

This file contains a mapping of city names to their UTC time offsets. It is used by the TimeParserService to convert user-friendly timezone references (like "New York" or "Tokyo") into numeric UTC offsets. This enables users to schedule updates using familiar city names rather than remembering UTC time offsets.

Example:

{
  "london": 0,
  "new york": -5,
  "tokyo": 9
}

The values represent hours offset from UTC. This file is loaded at application startup and used when users schedule updates with the /schedule command.

prompts.json

This file contains structured prompt templates for different LLM interactions. Each template has a name and consists of blocks with specific types (system/user) and content that can include variable placeholders like {{URL}} or {{CONTENT}}.

The application uses these prompts for various tasks:

Extracting content from websites, YouTube videos, and tweets
Generating descriptions for different content types
Formatting messages for various AI providers (OpenAI, Gemini, Grok)

Example prompt structure:

{
  "describe-website-prompt": {
    "prompt": [
      {
        "type": "system",
        "content": "System instruction here..."
      },
      {
        "type": "user",
        "content": "User instruction with {{CONTENT}} placeholder"
      }
    ]
  }
}

Setup

Clone the repository

Create a .env file based on default.env and add your API keys:

TELEGRAM_BOT_API_TOKEN=your_telegram_bot_token
TELEGRAM_BOT_WEBHOOK_DOMAIN=your_webhook_domain
TELEGRAM_BOT_WEBHOOK_PATH=/webhook
TELEGRAM_BOT_USERNAME=your_bot_username
FIRECRAWL_API_KEY=your_firecrawl_key
OPENAI_API_KEY=your_openai_key
GOOGLE_PROJECT_ID=your_google_project_id
GOOGLE_LOCATION=your_google_location
GOOGLE_CREDENTIALS=your_base64_encoded_credentials
GROK_API_KEY=your_grok_key

Setup R2R service (required for search functionality):
- Follow the instructions at R2R Documentation
- Update the R2R configuration in your .env file:
```
R2R_BASE_URL=http://r2r:7272
[email protected]
R2R_PASSWORD=your_secure_password
```

Launch the application using Docker Compose:

docker compose build && docker compose up -d

Development

For development:

Install dependencies:
```
pnpm install
```
Run specific services in development mode:
```
pnpm app -m telegram-listener
```
Run specific job:
```
pnpm app -m schedule-updates -t job
```

Database Migrations

To add a new migration:

pnpm migration:generate src/migrations/MigrationName

To run migrations:
```
pnpm migration:run
```

Usage

Telegram Bot Configuration

Before using the bot in groups, you must configure it properly:

Contact @BotFather on Telegram
Select your bot and go to "Bot Settings"
Select "Group Privacy"
Choose "Disable" - this is required for the bot to see and process messages in groups

Without disabling Group Privacy, the bot will only be able to see commands explicitly directed to it, not regular messages containing links.

Using the Bot

Once the bot is properly configured and running, add it to your Telegram chats or groups. You can:

Share links in the chat for automatic processing
Use commands like /search, /help, /subscribe, /unsubscribe
Schedule updates with /schedule
View statistics with /stats
Get summaries with /summary

Monitoring

The application exposes health endpoints for each service and can be monitored with Prometheus and visualized with Grafana using data from metrics endpoint.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
src		src
.gitignore		.gitignore
.prettierrc		.prettierrc
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
default.env		default.env
docker-compose.yaml		docker-compose.yaml
docker-entrypoint.sh		docker-entrypoint.sh
init.sql		init.sql
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
prompts.json		prompts.json
timezones.json		timezones.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🦊 Data Kitsune

Overview

Architecture

Services

Data Stores

AI Integration

Features

Installation

Prerequisites

Configuration Files

timezones.json

prompts.json

Setup

Development

Database Migrations

Usage

Telegram Bot Configuration

Using the Bot

Monitoring

License

About

Languages

License

matterai/DataKitsune

Folders and files

Latest commit

History

Repository files navigation

🦊 Data Kitsune

Overview

Architecture

Services

Data Stores

AI Integration

Features

Installation

Prerequisites

Configuration Files

timezones.json

prompts.json

Setup

Development

Database Migrations

Usage

Telegram Bot Configuration

Using the Bot

Monitoring

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages