warc

package module
v0.8.96 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 15, 2025 License: CC0-1.0 Imports: 38 Imported by: 0

README

warc

GoDoc Go Report Card

A Go library for reading and writing WARC files, with advanced features for web archiving.

Features

  • Read and write WARC files with support for multiple compression formats (GZIP, ZSTD)
  • HTTP client with built-in WARC recording capabilities
  • Content deduplication (local URL-agnostic and CDX-based)
  • Configurable file rotation and size limits
  • DNS caching and custom DNS resolution (with DNS archiving)
  • Support for socks5 proxies and custom TLS configurations
  • Random local IP assignment for distributed crawling (including Linux kernel AnyIP feature)
  • Smart memory management with disk spooling options
  • IPv4/IPv6 support with configurable preferences

Installation

go get github.com/internetarchive/gowarc

Usage

This library's biggest feature is to provide a standard HTTP client through which you can execute requests that will be recorded automatically to WARC files. It's the basis of Zeno.

HTTP Client with WARC Recording
package main

import (
	"context"
	"fmt"
	"github.com/internetarchive/gowarc"
	"io"
	"net/http"
	"time"
)

func main() {
	// Configure WARC settings
	rotatorSettings := &warc.RotatorSettings{
		WarcinfoContent: warc.Header{
			"software": "My WARC writing client v1.0",
		},
		Prefix:             "WEB",
		Compression:        "gzip",
		WARCWriterPoolSize: 4, // Records will be written to 4 WARC files in parallel, it helps maximize the disk IO on some hardware. To be noted, even if we have multiple WARC writers, WARCs are ALWAYS written by pair in the same file. (req/resp pair)
	}

	// Configure HTTP client settings
	clientSettings := warc.HTTPClientSettings{
		RotatorSettings: rotatorSettings,
		Proxy:           "socks5://proxy.example.com:1080",
		TempDir:         "./temp",
		DNSServers:      []string{"8.8.8.8", "8.8.4.4"},
		DedupeOptions: warc.DedupeOptions{
			LocalDedupe:   true,
			CDXDedupe:     false,
			SizeThreshold: 2048, // Only payloads above that threshold will be deduped
		},
		DialTimeout:           10 * time.Second,
		ResponseHeaderTimeout: 30 * time.Second,
		DNSResolutionTimeout:  5 * time.Second,
		DNSRecordsTTL:         5 * time.Minute,
		DNSCacheSize:          10000,
		MaxReadBeforeTruncate: 1000000000,
		DecompressBody:        true,
		FollowRedirects:       true,
		VerifyCerts:           true,
		RandomLocalIP:         true,
	}

	// Create HTTP client
	client, err := warc.NewWARCWritingHTTPClient(clientSettings)
	if err != nil {
		panic(err)
	}
	defer client.Close()

	// The error channel NEED to be consumed, else it will block the
	// execution of the WARC module
	go func() {
		for err := range client.ErrChan {
			fmt.Errorf("WARC writer error: %s", err.Err.Error())
		}
	}()

	// This is optional but the module give a feedback on a channel passed as context value to the
	// request, this helps knowing when the record has been written to disk. If this is not used, the WARC
	// writing is asynchronous
	req, err := http.NewRequest("GET", "https://archive.org", nil)
	if err != nil {
		panic(err)
	}

	feedbackChan := make(chan struct{}, 1)
	req = req.WithContext(warc.WithFeedbackChannel(req.Context(), feedbackChan))

	resp, err := client.Do(req)
	if err != nil {
		panic(err)
	}
	defer resp.Body.Close()

	// Process response
	// Note: the body NEED to be consumed to be written to the WARC file.
	io.Copy(io.Discard, resp.Body)

	// Will block until records are actually written to the WARC file
	<-feedbackChan
}
DNS Resolution and Proxy Behavior

The library handles DNS resolution differently depending on the connection type:

Direct Connections (No Proxy)
  • DNS is resolved locally using configured DNS servers
  • DNS queries and responses are archived in WARC files as resource records
  • Resolved IP addresses are cached with configurable TTL
Local DNS Proxies (socks5://, socks4://)
  • DNS is resolved locally by gowarc
  • DNS records are archived to WARC files
  • Resolved IP addresses are sent to the proxy
  • Only one DNS query is made (no duplicate resolution)
Remote DNS Proxies (socks5h://, socks4a://, http://, https://)
  • DNS archiving is skipped to prevent privacy leaks
  • Hostnames are sent directly to the proxy
  • The proxy handles DNS resolution on its end
  • Trade-offs:
    • Privacy: No local DNS queries that could expose browsing activity
    • Accuracy: WARC reflects the actual connection (no potential DNS mismatch)
    • ⚠️ No DNS WARC records: DNS information is not archived for these connections

Important for Privacy: When using socks5h:// or other remote DNS proxies, your local DNS servers will not see any queries for the target domains, maintaining better privacy and anonymity.

CLI Tools

In addition to the Go library, gowarc provides several command-line utilities for working with WARC files:

Installation

Pre-built releases are available on the GitHub releases page.

# Install from source
go install github.com/internetarchive/gowarc/cmd/warc@latest

# Or build locally
cd cmd/warc/
go build -o warc
Available Commands
warc extract

Extract files and content from WARC archives with filtering options.

# Extract all files from WARC archives
warc extract file1.warc.gz file2.warc.gz

# Extract only specific content types
warc extract --content-type "text/html" --content-type "image/jpeg" archive.warc.gz

# Extract to specific directory with multiple threads  
warc extract --output ./extracted --threads 4 *.warc.gz

# Sort extracted files by host
warc extract --host-sort archive.warc.gz
warc mend

Repair and close incomplete gzip-compressed WARC files that were left with .open suffix during crawling.

# Dry run to see what would be fixed
warc mend --dry-run *.warc.gz.open

# Fix files with confirmation prompts  
warc mend corrupted.warc.gz.open

# Auto-fix without prompts
warc mend --yes *.warc.gz.open

# Force verification of any gzip WARC files (not just .open)
warc mend --force --dry-run archive.warc.gz

Features:

  • By default, only processes .open files; use --force to verify any gzip WARC files
  • Verifies gzip format using magic bytes, not just file extension
  • Detects and removes trailing garbage bytes
  • Truncates at corruption points while preserving maximum valid data
  • Removes .open suffix to "close" files when present
  • Provides comprehensive statistics on repairs performed
  • Memory-efficient streaming for large files

See cmd/warc/mend/README.md for detailed documentation.

warc verify

Validate the integrity and structure of WARC files.

# Verify single file
warc verify archive.warc.gz

# Verify multiple files with progress
warc verify -v *.warc.gz

# JSON output for automation
warc verify --json archive.warc.gz
warc completion

Generate shell completion scripts for bash, zsh, fish, or PowerShell.

# Bash completion
warc completion bash > /etc/bash_completion.d/warc

# Zsh completion
warc completion zsh > ~/.zsh/completions/_warc

# Fish completion
warc completion fish > ~/.config/fish/completions/warc.fish

# PowerShell completion
warc completion powershell > warc.ps1
Global Flags

All commands support these global options:

  • -v, --verbose - Enable verbose/debug logging
  • --json - Output logs in JSON format for structured processing
  • -h, --help - Show help for any command

Build tags

  • standard_gzip: Use the standard library gzip implementation instead of the faster one from klauspost
  • klauspost_gzip: Use the faster gzip implementation from klauspost (default, you don't need to specify it)

License

This module is released under CC0 license. You can find a copy of the CC0 License in the LICENSE file.

Documentation

Index

Constants

View Source
const (
	// ContextKeyFeedback is the context key for the feedback channel.
	// When provided, the channel will receive a signal once the WARC record
	// has been written to disk, making WARC writing synchronous.
	// Use WithFeedbackChannel() helper function for convenience.
	ContextKeyFeedback contextKey = "feedback"

	// ContextKeyWrappedConn is the context key for the wrapped connection channel.
	// This is used internally to retrieve the wrapped connection for advanced use cases.
	// Use WithWrappedConnection() helper function for convenience.
	ContextKeyWrappedConn contextKey = "wrappedConn"
)

Variables

View Source
var (
	IPv6 *availableIPs
	IPv4 *availableIPs
)
View Source
var (
	// Create a couple of counters for tracking various stats
	DataTotal atomic.Int64

	CDXDedupeTotalBytes          atomic.Int64
	DoppelgangerDedupeTotalBytes atomic.Int64
	LocalDedupeTotalBytes        atomic.Int64

	CDXDedupeTotal          atomic.Int64
	DoppelgangerDedupeTotal atomic.Int64
	LocalDedupeTotal        atomic.Int64
)
View Source
var DedupeHTTPClient = http.Client{
	Timeout: 10 * time.Second,
	Transport: &http.Transport{
		Dial: (&net.Dialer{
			Timeout: 5 * time.Second,
		}).Dial,
		TLSHandshakeTimeout: 5 * time.Second,
	},
}

TODO: Add stats on how long dedupe HTTP requests take

View Source
var ErrUnknownDigestAlgorithm = errors.New("unknown digest algorithm")

Functions

func GetDigest added in v0.8.86

func GetDigest(r io.Reader, digestAlgorithm DigestAlgorithm) (string, error)

func GetNextIP

func GetNextIP(availableIPs *availableIPs) net.IP

func IsDigestSupported added in v0.8.86

func IsDigestSupported(algorithm string) bool

func WithFeedbackChannel added in v0.8.92

func WithFeedbackChannel(ctx context.Context, feedbackChan chan struct{}) context.Context

WithFeedbackChannel adds a feedback channel to the request context. When provided, the channel will receive a signal once the WARC record has been written to disk, making WARC writing synchronous. Without this, WARC writing is asynchronous.

Example:

feedbackChan := make(chan struct{}, 1)
req = req.WithContext(warc.WithFeedbackChannel(req.Context(), feedbackChan))
// ... perform request ...
<-feedbackChan // blocks until WARC is written

func WithWrappedConnection added in v0.8.92

func WithWrappedConnection(ctx context.Context, wrappedConnChan chan *CustomConnection) context.Context

WithWrappedConnection adds a wrapped connection channel to the request context. This is used for advanced use cases where direct access to the connection is needed.

Types

type CustomConnection added in v0.8.82

type CustomConnection struct {
	net.Conn
	io.Reader
	io.Writer

	sync.WaitGroup
	// contains filtered or unexported fields
}

func (*CustomConnection) Close added in v0.8.82

func (cc *CustomConnection) Close() error

func (*CustomConnection) CloseWithError added in v0.8.82

func (cc *CustomConnection) CloseWithError(err error) error

func (*CustomConnection) Read added in v0.8.82

func (cc *CustomConnection) Read(b []byte) (int, error)

func (*CustomConnection) Write added in v0.8.82

func (cc *CustomConnection) Write(b []byte) (int, error)

type CustomHTTPClient

type CustomHTTPClient struct {
	WaitGroup *WaitGroupWithCount

	ErrChan    chan *Error
	WARCWriter chan *RecordBatch

	http.Client
	TempDir                string
	WARCWriterDoneChannels []chan bool
	DiscardHook            DiscardHook

	TLSHandshakeTimeout   time.Duration
	ConnReadDeadline      time.Duration
	MaxReadBeforeTruncate int

	FullOnDisk      bool
	DigestAlgorithm DigestAlgorithm

	// MaxRAMUsageFraction is the fraction of system RAM above which we'll force spooling to disk. For example, 0.5 = 50%.
	// If set to <= 0, the default value is DefaultMaxRAMUsageFraction.
	MaxRAMUsageFraction float64

	DataTotal *atomic.Int64

	CDXDedupeTotalBytes          *atomic.Int64
	DoppelgangerDedupeTotalBytes *atomic.Int64
	LocalDedupeTotalBytes        *atomic.Int64

	CDXDedupeTotal          *atomic.Int64
	DoppelgangerDedupeTotal *atomic.Int64
	LocalDedupeTotal        *atomic.Int64
	// contains filtered or unexported fields
}

func NewWARCWritingHTTPClient

func NewWARCWritingHTTPClient(HTTPClientSettings HTTPClientSettings) (httpClient *CustomHTTPClient, err error)

func (*CustomHTTPClient) Close

func (c *CustomHTTPClient) Close() error

func (*CustomHTTPClient) WriteRecord

func (c *CustomHTTPClient) WriteRecord(WARCTargetURI, WARCType, contentType, payloadString string, payloadReader io.Reader)

type DedupeOptions

type DedupeOptions struct {
	CDXURL             string
	DoppelgangerHost   string
	CDXCookie          string
	SizeThreshold      int
	LocalDedupe        bool
	CDXDedupe          bool
	DoppelgangerDedupe bool
}

type DigestAlgorithm added in v0.8.86

type DigestAlgorithm int
const (
	SHA1 DigestAlgorithm = iota
	// According to IIPC, lowercase base 16 is the "popular" encoding for SHA256
	SHA256Base16
	SHA256Base32
	BLAKE3
)

func GetDigestFromPrefix added in v0.8.86

func GetDigestFromPrefix(prefix string) DigestAlgorithm

type DiscardHook

type DiscardHook func(resp *http.Response) (bool, string)

DiscardHook is a hook function that is called for each response. (if set) It can be used to determine if the response should be discarded. Returns:

  • bool: should the response be discarded
  • string: (optional) why the response was discarded or not

type DiscardHookError

type DiscardHookError struct {
	URL    string
	Reason string // reason for discarding
	Err    error  // nil: discarded successfully
}

func (*DiscardHookError) Error

func (e *DiscardHookError) Error() string

func (*DiscardHookError) Unwrap

func (e *DiscardHookError) Unwrap() error

type Error

type Error struct {
	Err  error
	Func string
}

type GzipReaderInterface added in v0.8.90

type GzipReaderInterface interface {
	io.ReadCloser
	Multistream(enable bool)
	Reset(r io.Reader) error
}

GzipReaderInterface defines the interface for gzip readers This allows us to switch between standard gzip and klauspost gzip based on build tags

type GzipWriterInterface added in v0.8.90

type GzipWriterInterface interface {
	io.WriteCloser
	Flush() error
}

GzipWriterInterface defines the interface for gzip writers This allows us to switch between standard gzip and klauspost gzip based on build tags

type HTTPClientSettings

type HTTPClientSettings struct {
	RotatorSettings       *RotatorSettings
	Proxy                 string
	TempDir               string
	DiscardHook           DiscardHook
	DNSServers            []string
	DedupeOptions         DedupeOptions
	DialTimeout           time.Duration
	ResponseHeaderTimeout time.Duration
	DNSResolutionTimeout  time.Duration
	DNSRecordsTTL         time.Duration
	DNSCacheSize          int
	DNSConcurrency        int
	TLSHandshakeTimeout   time.Duration
	ConnReadDeadline      time.Duration
	MaxReadBeforeTruncate int
	DecompressBody        bool
	FollowRedirects       bool
	FullOnDisk            bool
	MaxRAMUsageFraction   float64
	VerifyCerts           bool
	RandomLocalIP         bool
	DisableIPv4           bool
	DisableIPv6           bool
	IPv6AnyIP             bool
	DigestAlgorithm       DigestAlgorithm
}
type Header map[string]string

Header provides information about the WARC record. It stores WARC record field names and their values. Since WARC field names are case-insensitive, the Header methods are case-insensitive as well.

func NewHeader

func NewHeader() Header

NewHeader creates a new WARC header.

func (Header) Del

func (h Header) Del(key string)

Del deletes the value associated with key. The key is compared in a case-insensitive manner.

func (Header) Get

func (h Header) Get(key string) string

Get returns the value associated with the given key. The key is compared in a case-insensitive manner.

func (Header) Set

func (h Header) Set(key, value string)

Set sets the header field associated with key to value. The key is stored as-is, preserving its original case.

type ReadOpts added in v0.8.86

type ReadOpts int

ReadOpts are options for ReadRecord

const (
	// ReadOptsNoContentOutput means that the content of the record should not be returned.
	// This is useful for reading only the headers or metadata of the record.
	ReadOptsNoContentOutput ReadOpts = iota
)

type Reader

type Reader struct {
	// contains filtered or unexported fields
}

Reader stores the bufio.Reader and gzip.Reader for a WARC file

func NewReader

func NewReader(reader io.Reader) (*Reader, error)

NewReader returns a new WARC reader

func (*Reader) Close added in v0.8.88

func (r *Reader) Close() error

Close closes the WARC reader src and dec readers if they are open.

func (*Reader) ReadRecord

func (r *Reader) ReadRecord(opts ...ReadOpts) (*Record, error)

ReadRecord reads the next record from the opened WARC file.

Returns:

  • *Record: Record guaranteed to be non-nil if no errors occurred.
  • error: any parsing/IO error encountered (io.EOF for clean EOF).

type Record

type Record struct {
	Header  Header
	Content spooledtempfile.ReadWriteSeekCloser
	Version string // WARC/1.0, WARC/1.1 ...
	Offset  int64  // Offset of the record start (-1 if WARC file type is not supported yet)
	Size    int64  // COMPRESSED size of the record (gzip member): header + deflate data + trailer. (-1 if WARC file type is not supported yet)
}

Record represents a WARC record.

func NewRecord

func NewRecord(tempDir string, fullOnDisk bool) *Record

NewRecord creates a new WARC record.

type RecordBatch

type RecordBatch struct {
	FeedbackChan chan struct{}
	CaptureTime  string
	Records      []*Record
}

RecordBatch is a structure that contains a bunch of records to be written at the same time, and a common capture timestamp. FeedbackChan is used to signal when the records have been written.

func NewRecordBatch

func NewRecordBatch(feedbackChan chan struct{}) *RecordBatch

NewRecordBatch creates a record batch, it also initialize the capture time.

type RotatorSettings

type RotatorSettings struct {
	// Content of the warcinfo record that will be written
	// to all WARC files
	WarcinfoContent Header
	// Prefix used for WARC filenames, WARC 1.1 specifications
	// recommend to name files this way:
	// Prefix-Timestamp-Serial-Crawlhost.warc.gz
	Prefix string
	// Compression algorithm to use
	Compression string

	// Path to a ZSTD compression dictionary to embed (and use) in .warc.zst files
	CompressionDictionary string
	// Directory where the created WARC files will be stored,
	// default will be the current directory
	OutputDirectory string
	// WARCSize is in Megabytes
	WARCSize float64
	// WARCWriterPoolSize defines the number of parallel WARC writers
	WARCWriterPoolSize int
	// contains filtered or unexported fields
}

RotatorSettings is used to store the settings needed by recordWriter to write WARC files

func NewRotatorSettings

func NewRotatorSettings() *RotatorSettings

NewRotatorSettings creates a RotatorSettings structure and initialize it with default values

func (*RotatorSettings) NewWARCRotator

func (s *RotatorSettings) NewWARCRotator() (recordWriterChan chan *RecordBatch, doneChannels []chan bool, err error)

NewWARCRotator creates and return a channel that can be used to communicate records to be written to WARC files to the recordWriter function running in a goroutine

type WaitGroupWithCount

type WaitGroupWithCount struct {
	sync.WaitGroup
	// contains filtered or unexported fields
}

func (*WaitGroupWithCount) Add

func (wg *WaitGroupWithCount) Add(delta int)

func (*WaitGroupWithCount) Done

func (wg *WaitGroupWithCount) Done()

func (*WaitGroupWithCount) Size

func (wg *WaitGroupWithCount) Size() int

type Writer

type Writer struct {
	GZIPWriter      GzipWriterInterface
	ZSTDWriter      *zstd.Encoder
	FileWriter      *bufio.Writer
	FileName        string
	Compression     string
	DigestAlgorithm DigestAlgorithm
	ParallelGZIP    bool
}

Writer writes WARC records to WARC files.

func NewWriter

func NewWriter(writer io.Writer, fileName string, digestAlgorithm DigestAlgorithm, compression string, contentLengthHeader string, newFileCreation bool, dictionary []byte) (*Writer, error)

NewWriter creates a new WARC writer.

func (*Writer) CloseCompressedWriter

func (w *Writer) CloseCompressedWriter() (err error)

func (*Writer) WriteInfoRecord

func (w *Writer) WriteInfoRecord(payload map[string]string) (recordID string, err error)

WriteInfoRecord method can be used to write informations record to the WARC file

func (*Writer) WriteRecord

func (w *Writer) WriteRecord(r *Record) (recordID string, err error)

WriteRecord writes a record to the underlying WARC file. A record consists of a version string, the record header followed by a record content block and two newlines:

Version CLRF
Header-Key: Header-Value CLRF
CLRF
Content
CLRF
CLRF

Directories

Path Synopsis
cmd
warc command
pkg

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL