octopus-data — Documentation

0

Dependencies

3.44s

Load 7.8M rows

SAB

SharedArrayBuffer

MIT

License

How it works

octopus-data reads your entire dataset into a SharedArrayBuffer — one contiguous block of RAM. Worker Threads receive a reference to that buffer, not a copy. Each worker processes its own partition in parallel, writing results back to shared memory.

Numeric columns are stored as Float64Array views over the shared buffer. This means aggregations, filters, and feature engineering operate directly on raw memory — no object allocation, no garbage collection pressure.

octopus-data uses only Node.js core modules: fs, worker_threads, os, path. No npm install required beyond the package itself.

Core classes

Octopus — Static entry point. Use Octopus.read() to load CSV or JSON files into a DataFrame.

DataFrame — The main data structure. Holds columnar data and exposes all transformation, aggregation, and export methods.

Column — A linked reference to a single column that enables chained arithmetic ops.

Series — The internal columnar primitive. Each column inside a DataFrame is a Series backed by a TypedArray.

npm

install

npm i octopus-data

Import

ESM (recommended)

import { Octopus }   from 'octopus-data'
import { DataFrame } from 'octopus-data'
import { Series, Column } from 'octopus-data'

octopus-data is ESM only. Make sure your package.json has "type": "module" or use .mjs extensions.

Node.js version

SharedArrayBuffer requires Node.js 16.4+ with cross-origin isolation headers if used in a browser context. In Node.js (server-side), no extra configuration is needed.

complete example

import { Octopus } from 'octopus-data'

// 1. Load CSV — uses SharedArrayBuffer + Worker Threads
const df = await Octopus.read('trips.csv')
console.log(df.info())
// { rowCount: 7832546, columnCount: 19, memoryUsage: '...' }

// 2. Select only the columns you need (frees RAM)
const slim = df.select([
  'fare_amount', 'tip_amount', 'trip_distance',
  'passenger_count', 'tpep_pickup_datetime'
])

// 3. Clean — remove invalid rows
const clean = slim.filter(
  ['fare_amount', 'trip_distance', 'passenger_count'],
  (fare, dist, pax) => fare > 0 && dist > 0 && pax > 0
)

// 4. Feature engineering — vectorized over TypedArrays
clean.with_columns([
  { name: 'tip_pct', inputs: ['tip_amount', 'fare_amount'], formula: (tip, fare) => (tip / fare) * 100 },
  { name: 'revenue_per_mile', inputs: ['fare_amount', 'trip_distance'], formula: (fare, dist) => fare / dist }
])

// 5. Extract hour from timestamp
clean.col('tpep_pickup_datetime').to_datetime().extract_hour(0)

// 6. GroupBy — best hour by tip percentage
const byHour = clean.groupBy('tpep_pickup_datetime_hour', { tip_pct: 'mean' }).sort('tip_pct', false)
byHour.show(5)

// 7. Export
await clean.toCSV('output/clean_trips.csv')

static async Octopus.read(filePath: string, options?: ReadOptions) → Promise<DataFrame>

static

Detects file format by extension and routes to the appropriate engine. .csv uses the Nitro engine (SharedArrayBuffer + Workers). .json uses the JSON engine with auto flattening.

Parameter	Type	Default	Description
filePath	string	—	Path to the file
options.workers	number	os.cpus().length	Number of Worker Threads
options.indexerCapacity	number	10_000_000	Max rows for column buffers
options.useOffsets	boolean	true	Store byte offsets for string columns
options.type	'csv' \| 'json'	auto	Force a specific format

examples

const df = await Octopus.read('data.csv')
const df = await Octopus.read('data.json')
const df = await Octopus.read('data.csv', { workers: 4 })

The CSV engine reads the entire file into a single SharedArrayBuffer, then each Worker Thread receives a reference — not a copy. No data is duplicated in RAM.

JSON Engine

static async Octopus._readJSON(filePath: string) → Promise<DataFrame>

static

Automatically detects the root array (e.g. "prizes" in Nobel dataset). Recursively flattens nested objects into columns. Expands nested arrays into multiple rows, inheriting parent fields.

nested json example

// Input: { prizes: [{ year: "2023", laureates: [{id,name}] }] }
const df = await Octopus.read('nobel.json')
// Each laureate becomes its own row, with year inherited
// Columns: year, laureates_id, laureates_firstname, ...

new DataFrame(config: DataFrameConfig)

instance

Creates a new DataFrame. You typically get DataFrames from Octopus.read(), but you can construct one manually.

Property	Type	Description
columns	Record<string, TypedArray>	Column data — use Float64Array for numerics
rowCount	number	Total number of rows
headers	string[]	Column names in order

static DataFrame.fromObjects(data: object[]) → DataFrame

static

Converts a plain JS array of objects into a DataFrame. Numeric values are stored as Float64Array automatically.

example

const df = DataFrame.fromObjects([
  { name: 'Alice', score: 95 },
  { name: 'Bob',   score: 82 },
])

show(n?: number = 5) → void

instance

Prints the first n rows as a console.table. String values longer than 20 characters are truncated with ....

info() → DataFrameInfo

instance

Returns a summary with row count, column count, column names, and estimated memory usage.

example

const info = df.info()
// { rowCount: 7832546, columnCount: 19, columns: [...], memoryUsage: '1139.45 MB' }

describe() → void

instance

Prints descriptive statistics for all numeric columns: count, mean, min, 25%, 50%, 75%, max. Non-numeric columns are skipped.

with_columns(specs: ColSpec[]) → DataFrame

instance

Vectorized feature engineering. Applies formulas row-by-row using direct TypedArray access. Optimized fast paths for 1, 2, and 4 inputs. Returns this for chaining.

ColSpec property	Type	Description
name	string	Name of the new column to create
inputs	string[]	Column names fed into the formula
formula	Function	`(...values: number[]) => number`

example

df.with_columns([
  { name: 'revenue_per_mile', inputs: ['total_amount', 'trip_distance'], formula: (amount, dist) => dist > 0 ? amount / dist : 0 },
  { name: 'speed_mph', inputs: ['trip_distance', 'duration_hours'], formula: (dist, dur) => dur > 0 ? dist / dur : 0 }
])

with_columns mutates in-place and returns this. New columns are stored as Float64Array.

select(columnNames: string[]) → DataFrame

instance

Returns a new DataFrame with only the specified columns. Essential for freeing RAM — drop unused columns as early as possible.

const slim = df.select(['fare_amount', 'tip_amount', 'trip_distance'])

rename(mapping: Record<string, string>) → DataFrame

instance

Renames columns without copying data. Returns a new DataFrame with updated headers.

const df2 = df.rename({ tpep_pickup_datetime: 'pickup', PULocationID: 'zone' })

cast(columnName: string, type: 'float' | 'int' | 'string') → DataFrame

instance

Forces a type conversion on a column. 'float' and 'int' both produce Float64Array. 'string' produces a regular JS Array. Mutates in-place.

cumsum(columnName: string) → DataFrame

instance

Computes a running cumulative sum over a column. Creates a new column named {columnName}_cumsum. Mutates in-place.

with_label(specs: LabelSpec[]) → DataFrame

instance

Applies a StringIndexer to encode a string column as numeric IDs. Creates a new column named {input}_indexed and stores the indexer in metadata.indexers for later decoding.

filter(inputs: string[], predicate: Function) → DataFrame

instance

Returns a new DataFrame containing only rows where the predicate returns true. The predicate receives the values of the listed columns for each row.

example

const valid = df.filter(
  ['fare_amount', 'trip_distance', 'passenger_count'],
  (fare, dist, pax) => fare > 0 && dist > 0 && pax > 0
)

head(n?: number = 5) → DataFrame

instance

Returns a new DataFrame with the first n rows.

tail(n?: number = 5) → DataFrame

instance

Returns a new DataFrame with the last n rows.

dropNA() → DataFrame

instance

Removes all rows containing null, undefined, or NaN in any column.

fillNA(value: number | string) → DataFrame

instance

Replaces all null, undefined, or NaN values in all columns with the specified value. Mutates in-place.

str_contains(columnName: string, pattern: string) → DataFrame

instance

Filters rows where the string column matches the regex pattern. Case-insensitive.

example

const result = df.str_contains('product_name', 'wireless')

groupBy(groupCol: string, aggs: AggSpec) → DataFrame

instance

Groups rows by a column and applies aggregation functions. Supports sum, mean, count, max, min. Pass a string for a single op or an array for multiple — output columns will be named {col}_{op}.

single op

const byHour = df.groupBy('hour', { tip_pct: 'mean', fare_amount: 'sum' })

multiple ops

const byZone = df.groupBy('zone_id', {
  fare_amount: ['sum', 'mean', 'count'],
  tip_amount:  ['sum', 'max']
})
// Columns: fare_amount_sum, fare_amount_mean, fare_amount_count, ...

groupByRange(colName: string, targetCol: string, maxRange: number) → Result[]

instance

O(n) groupBy for bounded integer keys. Uses Uint32Array as a direct lookup — no Map, no hashing, no allocations. Returns array of { group, avg } sorted descending.

This is the fastest aggregation in octopus-data. Use it when group keys are bounded integers (e.g. zone IDs, hour 0–23).

groupByID(colName: string, targetCol: string) → Result[]

instance

Alias for groupByRange(colName, targetCol, 300). Preset for NYC Taxi zone IDs.

sort(columnName: string, ascending?: boolean = true) → DataFrame

instance

Index sort — builds an index array, sorts by target column values, then reorders all columns in one pass. Returns a new DataFrame.

example — top 10 most profitable trips

const top10 = df.sort('revenue_per_mile', false).head(10)

Scalar Aggregations

Method	Returns	Description
sum(col)	number	Sum of all values in a column
mean(col)	number	Arithmetic mean — returns 0 if rowCount is 0
max(col)	number	Maximum value
min(col)	number	Minimum value

example

df.sum('fare_amount')   // 48291043.21
df.mean('tip_pct')     // 14.82
df.max('trip_distance') // 189.4

unique(columnName: string) → any[]

instance

Returns an array of unique values in a column using a Set.

nunique(columnName: string) → number

instance

Returns the count of unique values. Faster than unique().length.

value_counts(columnName: string) → { value, count }[]

instance

Returns a frequency table sorted from most to least common.

example

const freq = df.value_counts('payment_type')
// [{ value: 1, count: 5821034 }, { value: 2, count: 1823456 }]

join(other: DataFrame, on: string, how?: 'inner' | 'left' = 'inner') → DataFrame

instance

Hash join on a common column. inner returns only matching rows. left keeps all left rows, filling unmatched right columns with null.

example

const enriched = trips.join(zones, 'zone_id', 'left')

toCSV(outputPath: string, options?: object) → Promise<void>

instance

Exports the DataFrame to a CSV file via streaming write. Floats are written with 4 decimal places. Validates the .csv extension.

example

await df.toCSV('output/results.csv')

toJSON(outputPath: string) → Promise<void>

instance

Exports to a JSON file. Validates .json extension.

toTXT(outputPath: string) → Promise<void>

instance

Exports to a plain text file. Validates .txt extension.

toArray() → object[]

instance

Converts the DataFrame back to a plain JS array of row objects. Useful for interoperability.

All export methods validate the file extension and throw if it doesn't match. Pass the path with the extension explicitly.

col(name: string) → Column

instance

Returns a Column instance linked to the underlying TypedArray. Enables chained arithmetic that mutates the column in-place.

example — timestamp parsing

df.col('tpep_pickup_datetime')
  .to_datetime()
  .extract_hour(0)
// Creates new column: tpep_pickup_datetime_hour

example — arithmetic between columns

df.col('total_amount')
  .sub(df.col('tip_amount'))
  .div(1.08)

Column methods

Method	Accepts	Description
add(value)	number \| Column	Addition in-place
sub(value)	number \| Column	Subtraction in-place
mul(value)	number \| Column	Multiplication in-place
div(value)	number \| Column	Division in-place — guards against divide by zero
to_datetime()	—	Converts string timestamps to ms since epoch (Date.getTime)
extract_hour(offsetSeconds)	number	Extracts hour 0–23 from ms timestamp. Creates `{name}_hour` column

Series

new Series(name, data: TypedArray, type: string, indexer?, mask?)

instance

The internal columnar primitive. Each column inside a DataFrame is backed by a Series. Use Series.fromRawBuffer() to reconstruct from raw buffer data. The optional indexer enables transparent numeric-ID to string translation via .get(index).

Series is the internal primitive — most users work with DataFrame and Column methods directly.

Method	Returns	Description
get(index)	any	Value at index — decodes via indexer if present
slice(start, end)	Series	Returns a slice preserving the indexer reference
Series.fromRawBuffer()	Series	Static factory from raw buffer + metadata
Series.formatResults()	object	Formats aggregation results with a .show() method

INTRODUCTION

INSTALLATION

QUICKSTART

OCTOPUS.READ()

CORE & DISPLAY

TRANSFORMATION

FILTERING

AGGREGATION

INSPECTION

EXPORT

COLUMN & SERIES