Data Collection Tools Tutorial
Tutorial on how to collect, process and analyze AI standard data tools
Overview
The Data Collection Tools section provides three powerful tools to help you gather and process AI standards data from multiple sources. Each tool has a specific purpose and works together in a workflow.
Three Main Cards
1One-Click Data Collection
Purpose: Automatically collect AI standards data from three sources:
- AI for Good: ITU AI for Good database
- AI Standards Hub: International standards repository
- China Standards: Chinese national standards
Button: Start Data Collection - Click to begin collecting data from all three sources
Result: Data is saved to your personal database with user isolation
2Data Processing
Purpose: Process and standardize the collected data through 8 intelligent steps:
- Add region/country information
- Audit and fix prefixes
- Map regions to standard format
- Filter and clean data
- Merge standards from different sources
- Integrate China standards
- Standardize field names
- AI-powered classification (DeepSeek model)
Button: Start Data Processing - Click to process your collected data
Button: Download Processed Data - Download the processed CSV file after completion
3Data Analysis
Purpose: Generate visualizations and discover terminology
Buttons:
Run Visualization - Generate 14 interactive HTML charts
Create Glossary - Build AI terminology database from 4 sources
Download Glossary - Download the merged glossary CSV
Discover Terms - Use AI to find new terms in unclassified standards
Download New Terms - Download newly discovered terms
Advanced Options
Click the "Advanced Options" dropdown to configure collection and processing parameters:
Data Source Options
- Use Official AI Standards Database: Skip data collection and use the official pre-collected database (recommended for quick analysis)
- Use Official AI Glossary Database: Skip glossary creation and use the official pre-built glossary
Collection Parameters
- AI Standards Hub Scrape All: Collect all pages (unchecked = limited pages)
- AI Standards Hub Pages: Number of pages to collect (default: 35)
- China Standards Scrape All: Collect all pages (unchecked = limited pages)
- China Standards Pages: Number of pages to collect (default: 10)
Processing Parameters
- Process all batches: Classify all standards (unchecked = limited batches)
- Classification Batch Limit: Number of batches to process (default: 5)
AI Model Configuration (DeepSeek R1-14b)
How to Use Your Own DeepSeek (⚠️ This requires your local computer GPU memory ≥ 16G!):
- Check the checkbox: "Use my own DeepSeek R1-14b deployment"
- Input field will appear: Enter your DeepSeek URL (format: http://your-ip:11434)
- Click "Test Connection": Verify your DeepSeek service is accessible
- Wait for result: Green ✅ = Success, Red ❌ = Failed
- Save automatically: Configuration saves when you leave the input field
Finding Your DeepSeek API Endpoint
The DeepSeek API endpoint is your computer's IP address + port 11434:
- Open PowerShell or Command Prompt
- Run:
ipconfig
- Find IPv4 Address (e.g., 192.168.1.100)
- Your endpoint:
http://192.168.1.100:11434
How to Deploy and Configure Your Own DeepSeek
Step 1: Install Ollama
- Download from: https://ollama.com/download
- Install on your computer (Windows/Mac/Linux)
Step 2: Download DeepSeek R1-14b Model
- Open PowerShell or Terminal
- Run:
ollama pull deepseek-r1:14b
- Wait for download to complete (~9GB)
- Verify:
ollama list should show deepseek-r1:14b
Step 3: Configure Ollama to Accept External Connections
⚠️ Important: Temporary vs Permanent Configuration
Temporary (current terminal session only):
- Windows:
set OLLAMA_HOST=0.0.0.0:11434
- Mac/Linux:
export OLLAMA_HOST=0.0.0.0:11434
- ⚠️ This setting is lost when you close the terminal
Permanent (recommended):
- Windows: Add to System Environment Variables
- Search: Edit System Environment Variables
- Environment Variables → System variables → New
- Variable name:
OLLAMA_HOST
- Variable value:
0.0.0.0:11434
- Restart your computer or terminal
- Mac/Linux: Add to shell profile
- Edit
~/.bashrc or ~/.zshrc
- Add line:
export OLLAMA_HOST=0.0.0.0:11434
- Run:
source ~/.bashrc (or restart terminal)
Step 4: Start Ollama Service
- Run:
ollama serve
- Keep this terminal window open
Step 5: Find Your DeepSeek API Endpoint
- Open PowerShell/CMD
- Run:
ipconfig
- Find "IPv4 Address" (e.g., 192.168.1.100 or 10.181.134.69)
- Your endpoint is:
http://[YOUR-IP]:11434
- Example:
http://192.168.1.100:11434
Step 6: Configure Firewall
- Windows:
- Open PowerShell as Administrator
- Run:
netsh advfirewall firewall add rule name="Ollama" dir=in action=allow protocol=TCP localport=11434
- This allows the server to connect to your computer
- Mac/Linux: Usually no configuration needed (firewall disabled by default)
- If using UFW (Ubuntu):
sudo ufw allow 11434/tcp
- If using firewalld (CentOS):
sudo firewall-cmd --add-port=11434/tcp --permanent
Step 7: Test Connection (3 Methods)
- Method 1 (Web Interface - Recommended):
- Log in to the website
- Open Advanced Options
- Check "Use my own DeepSeek R1-14b deployment"
- Input field will appear on the right
- Enter your DeepSeek URL
- Example inputs:
http://192.168.1.100:11434 (Ethernet)
http://10.181.134.69:11434 (WiFi)
- Click "Test" button
- Possible results:
- ✅ Connected - Success!
- ❌ Please enter URL - Input is empty
- ❌ Invalid URL - Format error
- ❌ Connection failed - Cannot reach
- Method 2 (Browser): Open
http://your-ip:11434/api/tags in browser, should see JSON response
- Example:
{"models":[{"name":"deepseek-r1:14b","model":"deepseek-r1:14b",...}
- Method 3 (Command Line): Run
curl http://your-ip:11434/api/tags
- Example:
{"models":[{"name":"deepseek-r1:14b","model":"deepseek-r1:14b",...}
Step 8: Save Configuration
- Check "Use my own DeepSeek R1-14b deployment"
- Configuration saves automatically
- Now you can use Data Processing with your own DeepSeek!
Workflow & Button Dependencies
1
Start Data Collection
Always available. Click to begin collecting data from three sources. This is the starting point of the workflow.
2
Start Data Processing
Enabled after collection completes. Processes your collected data through 8 steps.
3
Download Processed Data
Enabled after processing completes. Download your processed standards as CSV.
4
Run Visualization
Enabled after processing completes. Generates 14 interactive charts for analysis.
5
Create Glossary
Enabled after processing completes. Builds terminology database from 4 sources.
6
Discover Terms
Enabled after glossary is created. Uses AI to find new terms in your standards.
Quick Start Scenarios
Scenario 1: Full Workflow (Collect Your Own Data)
Steps:
- Click
Start Data Collection and wait for completion
- Click
Start Data Processing and wait for completion
- Click
Run Visualization to see 14 interactive charts
- Click
Create Glossary to build terminology database
- Click
Discover Terms to find new terms with AI
Scenario 2: Quick Analysis (Use Official Al Standards Database)
Steps:
- Open Advanced Options
- Check "Use Official AI Standards Database"
- The first two cards become disabled (no need to collect/process)
- Click
Run Visualization immediately
- Click
Create Glossary to build your glossary
- Click
Discover Terms when ready
Scenario 3: Skip Glossary Creation (Use Official AI Glossary Database)
Steps:
- Open Advanced Options
- Check "Use Official AI Glossary Database"
- Complete normal workflow (collect → process → visualize)
Create Glossary button is disabled (no need to create)
- Click
Discover Terms directly after processing
Tips & Notes
Pro Tip: Use Official AI Standards Database
For quick analysis without waiting, check "Use Official AI Standards Database" in Advanced Options. This skips the 30-40 minutes collection and processing time.
⚠️ Important Notes
- You must sign in to use these tools
- Each user's data is isolated and stored separately
- Data collection and processing may take 30-40 minutes depending on sources
- AI classification uses DeepSeek R1 14B model (requires server connection)
Checkbox Logic
Official Standards Database Checkbox
When checked:
- One-Click Data Collection card becomes disabled
- Data Processing card becomes disabled
- Run Visualization becomes immediately available
- Create Glossary becomes immediately available
Official Glossary Database Checkbox
When checked:
- Create Glossary button becomes disabled
- Download Glossary button becomes disabled
- Discover Terms becomes available after processing (skips glossary creation step)
❓ Frequently Asked Questions
Q: Why are some buttons disabled?
Buttons follow a workflow sequence. Each step must complete before the next becomes available. Check if you've completed the previous steps or if you've enabled official database options.
Q: How long does data collection take?
Typically 30-40 minutes depending on the number of pages configured in Advanced Options.
Q: Can I use the tools without collecting data?
Yes! Check "Use Official AI Standards Database" in Advanced Options to skip collection and processing.
Q: What happens if I check both official database checkboxes?
Both collection and processing cards become disabled. You can directly use Run Visualization and Discover Terms with official data.
Q: Can I use my own DeepSeek model for AI classification?
Yes! You can deploy DeepSeek R1-14b on your own computer and configure the system to use it:
- Install Ollama on your computer (https://ollama.com/)
- Pull the model:
ollama pull deepseek-r1:14b
- Configure Ollama to accept external connections:
set OLLAMA_HOST=0.0.0.0:11434
- Start Ollama:
ollama serve
- Open firewall port 11434
- In Advanced Options, check "Use my own DeepSeek R1-14b deployment"
- Enter your DeepSeek URL (e.g., http://192.168.1.100:11434)
- Click "Test Connection" to verify
Q: Why can't the server connect to my DeepSeek service?
Common reasons and solutions:
- Firewall blocking: Ensure port 11434 is open on your computer
- Different networks: Your computer and the server must be on the same network or use VPN/ZeroTier
- Wrong URL: Verify your IP address with
ipconfig and ensure Ollama is running
- Model not found: Ensure DeepSeek R1-14b is installed:
ollama list
Q: What if my DeepSeek service becomes unavailable during processing?
The processing will fail with an error message. You can:
- Fix your DeepSeek service and retry the processing
- Uncheck "Use my own DeepSeek" to use the shared service
- Check if your local computer GPU memory is ≥ 16G
Q: Is my DeepSeek configuration private?
Yes. Each user's DeepSeek configuration is stored privately in the database. Other users cannot see or use your configuration.
← Back to AI Standards