Data Collection Tools Tutorial

Tutorial on how to collect, process and analyze AI standard data tools

Overview

The Data Collection Tools section provides three powerful tools to help you gather and process AI standards data from multiple sources. Each tool has a specific purpose and works together in a workflow.

Three Main Cards

1One-Click Data Collection

Purpose: Automatically collect AI standards data from three sources:

AI for Good: ITU AI for Good database
AI Standards Hub: International standards repository
China Standards: Chinese national standards

Button: Start Data Collection - Click to begin collecting data from all three sources

Result: Data is saved to your personal database with user isolation

2Data Processing

Purpose: Process and standardize the collected data through 8 intelligent steps:

Add region/country information
Audit and fix prefixes
Map regions to standard format
Filter and clean data
Merge standards from different sources
Integrate China standards
Standardize field names
AI-powered classification (DeepSeek model)

Button: Start Data Processing - Click to process your collected data

Button: Download Processed Data - Download the processed CSV file after completion

3Data Analysis

Purpose: Generate visualizations and discover terminology

Buttons:

Run Visualization - Generate 14 interactive HTML charts
Create Glossary - Build AI terminology database from 4 sources
Download Glossary - Download the merged glossary CSV
Discover Terms - Use AI to find new terms in unclassified standards
Download New Terms - Download newly discovered terms

Advanced Options

Click the "Advanced Options" dropdown to configure collection and processing parameters:

Data Source Options

Use Official AI Standards Database: Skip data collection and use the official pre-collected database (recommended for quick analysis)
Use Official AI Glossary Database: Skip glossary creation and use the official pre-built glossary

Collection Parameters

AI Standards Hub Scrape All: Collect all pages (unchecked = limited pages)
AI Standards Hub Pages: Number of pages to collect (default: 35)
China Standards Scrape All: Collect all pages (unchecked = limited pages)
China Standards Pages: Number of pages to collect (default: 10)

Processing Parameters

Process all batches: Classify all standards (unchecked = limited batches)
Classification Batch Limit: Number of batches to process (default: 5)

AI Model Configuration (DeepSeek R1-14b)

How to Use Your Own DeepSeek (⚠️ This requires your local computer GPU memory ≥ 16G!):

Check the checkbox: "Use my own DeepSeek R1-14b deployment"
Input field will appear: Enter your DeepSeek URL (format: http://your-ip:11434)
Click "Test Connection": Verify your DeepSeek service is accessible
Wait for result: Green ✅ = Success, Red ❌ = Failed
Save automatically: Configuration saves when you leave the input field

Finding Your DeepSeek API Endpoint

The DeepSeek API endpoint is your computer's IP address + port 11434:

Open PowerShell or Command Prompt
Run: ipconfig
Find IPv4 Address (e.g., 192.168.1.100)
Your endpoint: http://192.168.1.100:11434

How to Deploy and Configure Your Own DeepSeek

Step 1: Install Ollama

Download from: https://ollama.com/download
Install on your computer (Windows/Mac/Linux)

Step 2: Download DeepSeek R1-14b Model

Open PowerShell or Terminal
Run: ollama pull deepseek-r1:14b
Wait for download to complete (~9GB)
Verify: ollama list should show deepseek-r1:14b

Step 3: Configure Ollama to Accept External Connections

⚠️ Important: Temporary vs Permanent Configuration

Temporary (current terminal session only):

Windows: set OLLAMA_HOST=0.0.0.0:11434
Mac/Linux: export OLLAMA_HOST=0.0.0.0:11434
⚠️ This setting is lost when you close the terminal

Permanent (recommended):

Windows: Add to System Environment Variables
- Search: Edit System Environment Variables
- Environment Variables → System variables → New
- Variable name: OLLAMA_HOST
- Variable value: 0.0.0.0:11434
- Restart your computer or terminal
Mac/Linux: Add to shell profile
- Edit ~/.bashrc or ~/.zshrc
- Add line: export OLLAMA_HOST=0.0.0.0:11434
- Run: source ~/.bashrc (or restart terminal)

Step 4: Start Ollama Service

Run: ollama serve
Keep this terminal window open

Step 5: Find Your DeepSeek API Endpoint

Open PowerShell/CMD
Run: ipconfig
Find "IPv4 Address" (e.g., 192.168.1.100 or 10.181.134.69)
Your endpoint is: http://[YOUR-IP]:11434
Example: http://192.168.1.100:11434

Step 6: Configure Firewall

Windows:
- Open PowerShell as Administrator
- Run: netsh advfirewall firewall add rule name="Ollama" dir=in action=allow protocol=TCP localport=11434
- This allows the server to connect to your computer
Mac/Linux: Usually no configuration needed (firewall disabled by default)
- If using UFW (Ubuntu): sudo ufw allow 11434/tcp
- If using firewalld (CentOS): sudo firewall-cmd --add-port=11434/tcp --permanent

Step 7: Test Connection (3 Methods)

Method 1 (Web Interface - Recommended):
- Log in to the website
- Open Advanced Options
- Check "Use my own DeepSeek R1-14b deployment"
- Input field will appear on the right
- Enter your DeepSeek URL
- Example inputs:
  - http://192.168.1.100:11434 (Ethernet)
  - http://10.181.134.69:11434 (WiFi)
- Click "Test" button
- Possible results:
  - ✅ Connected - Success!
  - ❌ Please enter URL - Input is empty
  - ❌ Invalid URL - Format error
  - ❌ Connection failed - Cannot reach
Method 2 (Browser): Open http://your-ip:11434/api/tags in browser, should see JSON response
- Example: {"models":[{"name":"deepseek-r1:14b","model":"deepseek-r1:14b",...}
Method 3 (Command Line): Run curl http://your-ip:11434/api/tags
- Example: {"models":[{"name":"deepseek-r1:14b","model":"deepseek-r1:14b",...}

Step 8: Save Configuration

Check "Use my own DeepSeek R1-14b deployment"
Configuration saves automatically
Now you can use Data Processing with your own DeepSeek!

Workflow & Button Dependencies

Start Data Collection

Always available. Click to begin collecting data from three sources. This is the starting point of the workflow.

Start Data Processing

Enabled after collection completes. Processes your collected data through 8 steps.

Download Processed Data

Enabled after processing completes. Download your processed standards as CSV.

Run Visualization

Enabled after processing completes. Generates 14 interactive charts for analysis.

Create Glossary

Enabled after processing completes. Builds terminology database from 4 sources.

Discover Terms

Enabled after glossary is created. Uses AI to find new terms in your standards.

Quick Start Scenarios

Scenario 1: Full Workflow (Collect Your Own Data)

Steps:

Click Start Data Collection and wait for completion
Click Start Data Processing and wait for completion
Click Run Visualization to see 14 interactive charts
Click Create Glossary to build terminology database
Click Discover Terms to find new terms with AI

Scenario 2: Quick Analysis (Use Official Al Standards Database)

Steps:

Open Advanced Options
Check "Use Official AI Standards Database"
The first two cards become disabled (no need to collect/process)
Click Run Visualization immediately
Click Create Glossary to build your glossary
Click Discover Terms when ready

Scenario 3: Skip Glossary Creation (Use Official AI Glossary Database)

Steps:

Open Advanced Options
Check "Use Official AI Glossary Database"
Complete normal workflow (collect → process → visualize)
Create Glossary button is disabled (no need to create)
Click Discover Terms directly after processing

Tips & Notes

Pro Tip: Use Official AI Standards Database

For quick analysis without waiting, check "Use Official AI Standards Database" in Advanced Options. This skips the 30-40 minutes collection and processing time.

⚠️ Important Notes

You must sign in to use these tools
Each user's data is isolated and stored separately
Data collection and processing may take 30-40 minutes depending on sources
AI classification uses DeepSeek R1 14B model (requires server connection)

Checkbox Logic

Official Standards Database Checkbox

When checked:

One-Click Data Collection card becomes disabled
Data Processing card becomes disabled
Run Visualization becomes immediately available
Create Glossary becomes immediately available

Official Glossary Database Checkbox

When checked:

Create Glossary button becomes disabled
Download Glossary button becomes disabled
Discover Terms becomes available after processing (skips glossary creation step)

❓ Frequently Asked Questions

Q: Why are some buttons disabled?

Buttons follow a workflow sequence. Each step must complete before the next becomes available. Check if you've completed the previous steps or if you've enabled official database options.

Q: How long does data collection take?

Typically 30-40 minutes depending on the number of pages configured in Advanced Options.

Q: Can I use the tools without collecting data?

Yes! Check "Use Official AI Standards Database" in Advanced Options to skip collection and processing.

Q: What happens if I check both official database checkboxes?

Both collection and processing cards become disabled. You can directly use Run Visualization and Discover Terms with official data.

Q: Can I use my own DeepSeek model for AI classification?

Yes! You can deploy DeepSeek R1-14b on your own computer and configure the system to use it:

Install Ollama on your computer (https://ollama.com/)
Pull the model: ollama pull deepseek-r1:14b
Configure Ollama to accept external connections: set OLLAMA_HOST=0.0.0.0:11434
Start Ollama: ollama serve
Open firewall port 11434
In Advanced Options, check "Use my own DeepSeek R1-14b deployment"
Enter your DeepSeek URL (e.g., http://192.168.1.100:11434)
Click "Test Connection" to verify

Q: Why can't the server connect to my DeepSeek service?

Common reasons and solutions:

Firewall blocking: Ensure port 11434 is open on your computer
Different networks: Your computer and the server must be on the same network or use VPN/ZeroTier
Wrong URL: Verify your IP address with ipconfig and ensure Ollama is running
Model not found: Ensure DeepSeek R1-14b is installed: ollama list

Q: What if my DeepSeek service becomes unavailable during processing?

The processing will fail with an error message. You can:

Fix your DeepSeek service and retry the processing
Uncheck "Use my own DeepSeek" to use the shared service
Check if your local computer GPU memory is ≥ 16G

Q: Is my DeepSeek configuration private?

Yes. Each user's DeepSeek configuration is stored privately in the database. Other users cannot see or use your configuration.

← Back to AI Standards