Hugging Face Provider Guide

Access 100,000+ open-source AI models through Hugging Face's free inference API

Overview

Hugging Face is the world's largest platform for open-source AI models, hosting over 100,000 models spanning text generation, code generation, translation, summarization, and more. NeuroLink's Hugging Face provider gives you free access to this vast ecosystem through a unified interface.

Free Tier Advantage

Hugging Face's inference API is completely free for most models, with a generous daily cap (~1,000 requests/day per model). Perfect for development, testing, and low-to-medium production workloads without any cost concerns.

Key Benefits

🆓 Free Access: No API costs - completely free to use
🌍 100,000+ Models: Largest collection of open-source models
🔓 Open Source: All models are open and transparent
⚡ Quick Start: No credit card required
🎯 Specialized Models: Models fine-tuned for specific tasks
🔬 Research-Friendly: Access to latest research models

Use Cases

Experimentation: Try different models without cost concerns
Research: Access cutting-edge research models
Budget-Constrained: Production usage without API costs
Specialized Tasks: Fine-tuned models for specific domains
Learning: Perfect for students and developers learning AI

Quick Start

1. Get Your API Token

Visit Hugging Face
Create a free account (no credit card required)
Go to Settings → Access Tokens
Click "New token"
Give it a name (e.g., "NeuroLink")
Select "Read" permissions
Copy the token (starts with hf_...)

2. Configure NeuroLink

Add to your .env file:

HUGGINGFACE_API_KEY=hf_your_token_here

Security Best Practice

Never commit your API token to version control. Always use environment variables and add .env to your .gitignore file.

3. Test the Setup

# CLI - Test with default model
npx @juspay/neurolink generate "Hello from Hugging Face!" --provider huggingface

# CLI - Use specific model
npx @juspay/neurolink generate "Write a poem" --provider huggingface --model "Qwen/Qwen2.5-72B-Instruct"

# SDK
node -e "
const { NeuroLink } = require('@juspay/neurolink');
(async () => {
  const ai = new NeuroLink();
  const result = await ai.generate({
    input: { text: 'Hello from Hugging Face!' },
    provider: 'huggingface'
  });
  console.log(result.content);
})();
"

Model Selection Guide

Popular Models by Category

1. General Text Generation

Model	Size	Description	Best For
`Qwen/Qwen2.5-72B-Instruct`	72B	Qwen 2.5 instruction-tuned (default)	General tasks, high quality
`Qwen/Qwen3-235B-A22B`	235B	Latest Qwen 3 MoE flagship	Complex reasoning, multilingual
`Qwen/Qwen3-32B`	32B	Qwen 3 dense model	Balanced quality and speed
`Qwen/Qwen3-8B`	8B	Qwen 3 efficient model	Fast responses, low cost
`meta-llama/Llama-3.3-70B-Instruct`	70B	Meta Llama 3.3 instruction-tuned	Conversational AI, reasoning
`meta-llama/Llama-4-Scout-17B-16E-Instruct`	17B MoE	Meta Llama 4 Scout	Efficient multimodal tasks
`meta-llama/Llama-4-Maverick-17B-128E-Instruct`	17B MoE	Meta Llama 4 Maverick	Advanced multimodal reasoning
`deepseek-ai/DeepSeek-R1`	671B	DeepSeek reasoning model	Math, logic, step-by-step
`deepseek-ai/DeepSeek-V3-0324`	671B	DeepSeek V3 general-purpose	General tasks, coding
`mistralai/Mistral-Large-2501`	123B	Mistral Large 3	Enterprise, multilingual
`mistralai/Mistral-Small-3.1-24B-Instruct`	24B	Mistral Small 3.1	Fast, cost-effective
`google/gemma-3-27b-it`	27B	Google Gemma 3 instruction-tuned	General tasks, research
`google/gemma-3-12b-it`	12B	Google Gemma 3 mid-size	Balanced performance
`google/gemma-3-4b-it`	4B	Google Gemma 3 lightweight	Edge deployment, fast
`microsoft/phi-4`	14B	Microsoft Phi-4	Reasoning, STEM tasks
`microsoft/Phi-4-mini-instruct`	3.8B	Microsoft Phi-4-mini	Lightweight, on-device

2. Code Generation

Model	Description	Best For
`mistralai/Devstral-Small-2507`	Mistral Devstral 2 code model	Code generation, IDE
`Qwen/Qwen2.5-Coder-32B-Instruct`	Qwen 2.5 code specialist	Complex coding tasks
`deepseek-ai/DeepSeek-V3-0324`	DeepSeek V3 with strong code perf	Full-stack development
`meta-llama/Llama-3.3-70B-Instruct`	Llama 3.3 with code capabilities	Code review, refactor

3. Summarization

Model	Description	Best For
`facebook/bart-large-cnn`	News summarization	Articles, news
`Qwen/Qwen3-8B`	Qwen 3 with summarization	General summaries
`google/pegasus-xsum`	Extreme summarization	Very brief summaries

4. Translation

Model	Languages	Best For
`facebook/mbart-large-50-many-to-many-mmt`	50 languages	Multi-language translation
`Helsinki-NLP/opus-mt-*`	Language pairs	Specific language pairs

5. Question Answering

Model	Description	Best For
`deepset/roberta-base-squad2`	SQuAD-trained	Factual Q&A
`Qwen/Qwen2.5-72B-Instruct`	General QA via chat	Open-ended Q&A

Model Selection by Use Case

// General conversation
const general = await ai.generate({
  input: { text: "Explain quantum computing" },
  provider: "huggingface",
  model: "Qwen/Qwen2.5-72B-Instruct",
});

// Code generation
const code = await ai.generate({
  input: { text: "Write a Python function to sort a list" },
  provider: "huggingface",
  model: "mistralai/Devstral-Small-2507",
});

// Summarization
const summary = await ai.generate({
  input: { text: "Summarize: [long article text]" },
  provider: "huggingface",
  model: "facebook/bart-large-cnn",
});

// Translation
const translation = await ai.generate({
  input: { text: "Translate to French: Hello, how are you?" },
  provider: "huggingface",
  model: "facebook/mbart-large-50-many-to-many-mmt",
});

Free Tier Details

What's Included

✅ Unlimited requests to public models
✅ No cost - completely free
✅ No credit card required
✅ Rate limits: 1,000 requests/day per model (generous)
✅ Access to 100,000+ public models

Rate Limits

Per Model: ~1,000 requests/day
Strategy: Use different models to scale
Best Practice: Combine with other providers for production

// Rate limit friendly approach
const ai = new NeuroLink({
  providers: [
    { name: "huggingface", priority: 1 }, // Free tier first
    { name: "google-ai", priority: 2 }, // Fallback to Google AI
  ],
});

Limitations

⚠️ Free Tier Constraints:

Models load on-demand (first request may be slow)
Rate limits per model (use multiple models to scale)
No guaranteed uptime (community infrastructure)
Some popular models may have queues

💡 For Production:

Use Hugging Face for experimentation
Consider paid inference for critical workloads
Combine with other providers for reliability

SDK Integration

Basic Usage

import { NeuroLink } from "@juspay/neurolink";

const ai = new NeuroLink();

// Simple generation
const result = await ai.generate({
  input: { text: "Write a haiku about coding" },
  provider: "huggingface",
});

console.log(result.content);

With Specific Model

// Use Qwen 2.5 for instruction following
const qwen = await ai.generate({
  input: { text: "Explain Docker in simple terms" },
  provider: "huggingface",
  model: "Qwen/Qwen2.5-72B-Instruct",
});

// Use Qwen Coder for code generation
const coder = await ai.generate({
  input: { text: "Create a REST API endpoint in Express.js" },
  provider: "huggingface",
  model: "Qwen/Qwen2.5-Coder-32B-Instruct",
});

Multi-Model Strategy

// Try multiple models for best results
const models = [
  "Qwen/Qwen2.5-72B-Instruct",
  "meta-llama/Llama-3.3-70B-Instruct",
  "google/gemma-3-27b-it",
];

for (const model of models) {
  try {
    const result = await ai.generate({
      input: { text: "Your prompt here" },
      provider: "huggingface",
      model,
    });
    console.log(`${model}: ${result.content}`);
  } catch (error) {
    console.log(`${model} failed, trying next...`);
  }
}

With Streaming

// Stream responses for better UX
const result = await ai.stream({
  input: { text: "Write a long story about space exploration" },
  provider: "huggingface",
  model: "Qwen/Qwen2.5-72B-Instruct",
});

for await (const chunk of result.stream) {
  process.stdout.write(chunk.content);
}

With Error Handling

try {
  const result = await ai.generate({
    input: { text: "Your prompt" },
    provider: "huggingface",
    maxTokens: 500,
    temperature: 0.7,
  });
  console.log(result.content);
} catch (error) {
  if (error.message.includes("rate limit")) {
    console.log("Rate limited - try another model or wait");
  } else if (error.message.includes("loading")) {
    console.log("Model is loading - try again in a moment");
  } else {
    console.error("Error:", error.message);
  }
}

CLI Usage

Basic Commands

# Generate with default model
npx @juspay/neurolink generate "Hello world" --provider huggingface

# Use specific model
npx @juspay/neurolink gen "Write code" --provider huggingface --model "Qwen/Qwen2.5-Coder-32B-Instruct"

# Stream response
npx @juspay/neurolink stream "Tell a story" --provider huggingface

# Check available models
npx @juspay/neurolink models --provider huggingface

Advanced Usage

# With temperature control
npx @juspay/neurolink gen "Creative story" \
  --provider huggingface \
  --model "Qwen/Qwen2.5-72B-Instruct" \
  --temperature 0.9 \
  --max-tokens 1000

# Save output to file
npx @juspay/neurolink gen "Technical documentation" \
  --provider huggingface \
  --model "google/gemma-3-27b-it" \
  > output.txt

# Interactive mode
npx @juspay/neurolink loop --provider huggingface

Model Comparison

# Compare different models
for model in "Qwen/Qwen2.5-72B-Instruct" \
             "meta-llama/Llama-3.3-70B-Instruct" \
             "google/gemma-3-27b-it"; do
  echo "Testing $model:"
  npx @juspay/neurolink gen "What is AI?" \
    --provider huggingface \
    --model "$model"
  echo "---"
done

Configuration Options

Environment Variables

# Required
HUGGINGFACE_API_KEY=hf_your_token_here

# Optional
HUGGINGFACE_BASE_URL=https://api-inference.huggingface.co  # Custom endpoint
HUGGINGFACE_MODEL=Qwen/Qwen2.5-72B-Instruct  # Default model
HUGGINGFACE_TIMEOUT=60000  # Request timeout (ms)

Programmatic Configuration

const ai = new NeuroLink({
  providers: [
    {
      name: "huggingface",
      config: {
        apiKey: process.env.HUGGINGFACE_API_KEY,
        defaultModel: "Qwen/Qwen2.5-72B-Instruct",
        timeout: 60000,
      },
    },
  ],
});

Troubleshooting

Common Issues

1. "Model is currently loading"

Problem: Model hasn't been used recently and needs to load.

Solution:

# Wait 20-30 seconds and retry
# Or use a popular model that's always loaded
npx @juspay/neurolink gen "test" \
  --provider huggingface \
  --model "Qwen/Qwen2.5-72B-Instruct"

2. "Rate limit exceeded"

Problem: Hit the ~1,000 requests/day limit for a model.

Solution:

// Switch to a different model
const alternativeModels = [
  "Qwen/Qwen2.5-72B-Instruct",
  "google/gemma-3-27b-it",
  "meta-llama/Llama-3.3-70B-Instruct",
];

// Or use multi-provider fallback
const ai = new NeuroLink({
  providers: [
    { name: "huggingface", priority: 1 },
    { name: "google-ai", priority: 2 }, // Fallback
  ],
});

3. "Invalid API token"

Problem: Token is incorrect or expired.

Solution:

Verify token at https://huggingface.co/settings/tokens
Ensure token has "Read" permissions
Check for typos in .env file
Token should start with hf_

4. "Model not found"

Problem: Model name is incorrect or private.

Solution:

# Verify model exists at huggingface.co
# Use exact model ID: username/model-name
npx @juspay/neurolink gen "test" \
  --provider huggingface \
  --model "Qwen/Qwen2.5-72B-Instruct"  # ✅ Correct format

5. Slow Response Times

Problem: Model is loading or under high load.

Solution:

Use popular models (always loaded)
Add timeout handling
Consider caching results
Use streaming for long responses

const result = await ai.generate({
  input: { text: "Your prompt" },
  provider: "huggingface",
  timeout: 120000, // 2 minute timeout
});

Best Practices

1. Model Selection

// ✅ Good: Use appropriate model for task
const code = await ai.generate({
  input: { text: "Write a function" },
  model: "Qwen/Qwen2.5-Coder-32B-Instruct", // Code specialist
});

// ❌ Avoid: Using a task-specific model for unrelated tasks
const badCode = await ai.generate({
  input: { text: "Write a function" },
  model: "facebook/bart-large-cnn", // Summarization model, not for code
});

2. Rate Limit Management

// ✅ Good: Rotate between models
const models = [
  "Qwen/Qwen2.5-72B-Instruct",
  "google/gemma-3-27b-it",
  "meta-llama/Llama-3.3-70B-Instruct",
];

let requestCount = 0; // Track the number of requests
const modelIndex = requestCount % models.length;
const result = await ai.generate({
  input: { text: prompt },
  provider: "huggingface",
  model: models[modelIndex],
});
requestCount++; // Increment after each request

3. Error Handling

// ✅ Good: Handle model loading gracefully
async function generateWithRetry(prompt, maxRetries = 3) {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await ai.generate({
        input: { text: prompt },
        provider: "huggingface",
      });
    } catch (error) {
      if (error.message.includes("loading") && i < maxRetries - 1) {
        console.log("Model loading, waiting 30s...");
        await new Promise((resolve) => setTimeout(resolve, 30000));
      } else {
        throw error;
      }
    }
  }
}

4. Production Deployment

// ✅ Good: Use Hugging Face with fallback
const ai = new NeuroLink({
  providers: [
    {
      name: "huggingface",
      priority: 1,
      config: {
        defaultModel: "Qwen/Qwen2.5-72B-Instruct",
      },
    },
    {
      name: "google-ai", // Free tier fallback
      priority: 2,
    },
    {
      name: "anthropic", // Paid fallback for critical
      priority: 3,
    },
  ],
});

Performance Optimization

1. Model Warm-Up

// Keep popular models warm with periodic requests
setInterval(async () => {
  await ai.generate({
    input: { text: "ping" },
    provider: "huggingface",
    model: "Qwen/Qwen2.5-72B-Instruct",
    maxTokens: 1,
  });
}, 300000); // Every 5 minutes

2. Caching

// Cache responses for repeated queries
const cache = new Map();

async function cachedGenerate(prompt) {
  if (cache.has(prompt)) {
    return cache.get(prompt);
  }

  const result = await ai.generate({
    input: { text: prompt },
    provider: "huggingface",
  });

  cache.set(prompt, result);
  return result;
}

3. Parallel Requests

// Use different models in parallel to avoid rate limits
const prompts = ["prompt1", "prompt2", "prompt3"];
const models = [
  "Qwen/Qwen2.5-72B-Instruct",
  "google/gemma-3-27b-it",
  "meta-llama/Llama-3.3-70B-Instruct",
];

const results = await Promise.all(
  prompts.map((prompt, i) =>
    ai.generate({
      input: { text: prompt },
      provider: "huggingface",
      model: models[i],
    }),
  ),
);

Provider Setup Guide - General provider configuration
SDK API Reference - Complete API documentation
CLI Commands - CLI reference
Multi-Provider Failover - Enterprise patterns

Additional Resources

Hugging Face Models - Browse all models
Hugging Face Inference API - API documentation
Model Cards - Understanding model capabilities
Hugging Face Hub - Platform documentation

Need Help? Join our GitHub Discussions or open an issue.

Overview​

Key Benefits​

Use Cases​

Quick Start​

1. Get Your API Token​

2. Configure NeuroLink​

3. Test the Setup​

Model Selection Guide​

Popular Models by Category​

1. General Text Generation​

2. Code Generation​

3. Summarization​

4. Translation​

5. Question Answering​

Model Selection by Use Case​

Free Tier Details​

What's Included​

Rate Limits​

Limitations​

SDK Integration​

Basic Usage​

With Specific Model​

Multi-Model Strategy​

With Streaming​

With Error Handling​

CLI Usage​

Basic Commands​

Advanced Usage​

Model Comparison​

Configuration Options​

Environment Variables​

Programmatic Configuration​

Troubleshooting​

Common Issues​

1. "Model is currently loading"​

2. "Rate limit exceeded"​

3. "Invalid API token"​

4. "Model not found"​

5. Slow Response Times​

Best Practices​

1. Model Selection​

2. Rate Limit Management​

3. Error Handling​

4. Production Deployment​

Performance Optimization​

1. Model Warm-Up​

2. Caching​

3. Parallel Requests​

Related Documentation​

Additional Resources​

Overview

Key Benefits

Use Cases

Quick Start

1. Get Your API Token

2. Configure NeuroLink

3. Test the Setup

Model Selection Guide

Popular Models by Category

1. General Text Generation

2. Code Generation

3. Summarization

4. Translation

5. Question Answering

Model Selection by Use Case

Free Tier Details

What's Included

Rate Limits

Limitations

SDK Integration

Basic Usage

With Specific Model

Multi-Model Strategy

With Streaming

With Error Handling

CLI Usage

Basic Commands

Advanced Usage

Model Comparison

Configuration Options

Environment Variables

Programmatic Configuration

Troubleshooting

Common Issues

1. "Model is currently loading"

2. "Rate limit exceeded"

3. "Invalid API token"

4. "Model not found"

5. Slow Response Times

Best Practices

1. Model Selection

2. Rate Limit Management

3. Error Handling

4. Production Deployment

Performance Optimization

1. Model Warm-Up

2. Caching

3. Parallel Requests

Related Documentation

Additional Resources