0
Dependencies
3.44s
Load 7.8M rows
SAB
SharedArrayBuffer
MIT
License
How it works

octopus-data reads your entire dataset into a SharedArrayBuffer — one contiguous block of RAM. Worker Threads receive a reference to that buffer, not a copy. Each worker processes its own partition in parallel, writing results back to shared memory.

Numeric columns are stored as Float64Array views over the shared buffer. This means aggregations, filters, and feature engineering operate directly on raw memory — no object allocation, no garbage collection pressure.

octopus-data uses only Node.js core modules: fs, worker_threads, os, path. No npm install required beyond the package itself.
Core classes

Octopus — Static entry point. Use Octopus.read() to load CSV or JSON files into a DataFrame.

DataFrame — The main data structure. Holds columnar data and exposes all transformation, aggregation, and export methods.

Column — A linked reference to a single column that enables chained arithmetic ops.

Series — The internal columnar primitive. Each column inside a DataFrame is a Series backed by a TypedArray.

npm
install
npm i octopus-data
Import
ESM (recommended)
import { Octopus }   from 'octopus-data'
import { DataFrame } from 'octopus-data'
import { Series, Column } from 'octopus-data'
octopus-data is ESM only. Make sure your package.json has "type": "module" or use .mjs extensions.
Node.js version

SharedArrayBuffer requires Node.js 16.4+ with cross-origin isolation headers if used in a browser context. In Node.js (server-side), no extra configuration is needed.

complete example
import { Octopus } from 'octopus-data'

// 1. Load CSV — uses SharedArrayBuffer + Worker Threads
const df = await Octopus.read('trips.csv')
console.log(df.info())
// { rowCount: 7832546, columnCount: 19, memoryUsage: '...' }

// 2. Select only the columns you need (frees RAM)
const slim = df.select([
  'fare_amount', 'tip_amount', 'trip_distance',
  'passenger_count', 'tpep_pickup_datetime'
])

// 3. Clean — remove invalid rows
const clean = slim.filter(
  ['fare_amount', 'trip_distance', 'passenger_count'],
  (fare, dist, pax) => fare > 0 && dist > 0 && pax > 0
)

// 4. Feature engineering — vectorized over TypedArrays
clean.with_columns([
  { name: 'tip_pct', inputs: ['tip_amount', 'fare_amount'], formula: (tip, fare) => (tip / fare) * 100 },
  { name: 'revenue_per_mile', inputs: ['fare_amount', 'trip_distance'], formula: (fare, dist) => fare / dist }
])

// 5. Extract hour from timestamp
clean.col('tpep_pickup_datetime').to_datetime().extract_hour(0)

// 6. GroupBy — best hour by tip percentage
const byHour = clean.groupBy('tpep_pickup_datetime_hour', { tip_pct: 'mean' }).sort('tip_pct', false)
byHour.show(5)

// 7. Export
await clean.toCSV('output/clean_trips.csv')
static async Octopus.read(filePath: string, options?: ReadOptions) → Promise<DataFrame>
static

Detects file format by extension and routes to the appropriate engine. .csv uses the Nitro engine (SharedArrayBuffer + Workers). .json uses the JSON engine with auto flattening.

ParameterTypeDefaultDescription
filePathstringPath to the file
options.workersnumberos.cpus().lengthNumber of Worker Threads
options.indexerCapacitynumber10_000_000Max rows for column buffers
options.useOffsetsbooleantrueStore byte offsets for string columns
options.type'csv' | 'json'autoForce a specific format
examples
const df = await Octopus.read('data.csv')
const df = await Octopus.read('data.json')
const df = await Octopus.read('data.csv', { workers: 4 })
The CSV engine reads the entire file into a single SharedArrayBuffer, then each Worker Thread receives a reference — not a copy. No data is duplicated in RAM.
JSON Engine
static async Octopus._readJSON(filePath: string) → Promise<DataFrame>
static

Automatically detects the root array (e.g. "prizes" in Nobel dataset). Recursively flattens nested objects into columns. Expands nested arrays into multiple rows, inheriting parent fields.

nested json example
// Input: { prizes: [{ year: "2023", laureates: [{id,name}] }] }
const df = await Octopus.read('nobel.json')
// Each laureate becomes its own row, with year inherited
// Columns: year, laureates_id, laureates_firstname, ...
new DataFrame(config: DataFrameConfig)
instance

Creates a new DataFrame. You typically get DataFrames from Octopus.read(), but you can construct one manually.

PropertyTypeDescription
columnsRecord<string, TypedArray>Column data — use Float64Array for numerics
rowCountnumberTotal number of rows
headersstring[]Column names in order
static DataFrame.fromObjects(data: object[]) → DataFrame
static

Converts a plain JS array of objects into a DataFrame. Numeric values are stored as Float64Array automatically.

example
const df = DataFrame.fromObjects([
  { name: 'Alice', score: 95 },
  { name: 'Bob',   score: 82 },
])
show(n?: number = 5) → void
instance

Prints the first n rows as a console.table. String values longer than 20 characters are truncated with ....

info() → DataFrameInfo
instance

Returns a summary with row count, column count, column names, and estimated memory usage.

example
const info = df.info()
// { rowCount: 7832546, columnCount: 19, columns: [...], memoryUsage: '1139.45 MB' }
describe() → void
instance

Prints descriptive statistics for all numeric columns: count, mean, min, 25%, 50%, 75%, max. Non-numeric columns are skipped.

with_columns(specs: ColSpec[]) → DataFrame
instance

Vectorized feature engineering. Applies formulas row-by-row using direct TypedArray access. Optimized fast paths for 1, 2, and 4 inputs. Returns this for chaining.

ColSpec propertyTypeDescription
namestringName of the new column to create
inputsstring[]Column names fed into the formula
formulaFunction(...values: number[]) => number
example
df.with_columns([
  { name: 'revenue_per_mile', inputs: ['total_amount', 'trip_distance'], formula: (amount, dist) => dist > 0 ? amount / dist : 0 },
  { name: 'speed_mph', inputs: ['trip_distance', 'duration_hours'], formula: (dist, dur) => dur > 0 ? dist / dur : 0 }
])
with_columns mutates in-place and returns this. New columns are stored as Float64Array.
select(columnNames: string[]) → DataFrame
instance

Returns a new DataFrame with only the specified columns. Essential for freeing RAM — drop unused columns as early as possible.

const slim = df.select(['fare_amount', 'tip_amount', 'trip_distance'])
rename(mapping: Record<string, string>) → DataFrame
instance

Renames columns without copying data. Returns a new DataFrame with updated headers.

const df2 = df.rename({ tpep_pickup_datetime: 'pickup', PULocationID: 'zone' })
cast(columnName: string, type: 'float' | 'int' | 'string') → DataFrame
instance

Forces a type conversion on a column. 'float' and 'int' both produce Float64Array. 'string' produces a regular JS Array. Mutates in-place.

cumsum(columnName: string) → DataFrame
instance

Computes a running cumulative sum over a column. Creates a new column named {columnName}_cumsum. Mutates in-place.

with_label(specs: LabelSpec[]) → DataFrame
instance

Applies a StringIndexer to encode a string column as numeric IDs. Creates a new column named {input}_indexed and stores the indexer in metadata.indexers for later decoding.

filter(inputs: string[], predicate: Function) → DataFrame
instance

Returns a new DataFrame containing only rows where the predicate returns true. The predicate receives the values of the listed columns for each row.

example
const valid = df.filter(
  ['fare_amount', 'trip_distance', 'passenger_count'],
  (fare, dist, pax) => fare > 0 && dist > 0 && pax > 0
)
head(n?: number = 5) → DataFrame
instance

Returns a new DataFrame with the first n rows.

tail(n?: number = 5) → DataFrame
instance

Returns a new DataFrame with the last n rows.

dropNA() → DataFrame
instance

Removes all rows containing null, undefined, or NaN in any column.

fillNA(value: number | string) → DataFrame
instance

Replaces all null, undefined, or NaN values in all columns with the specified value. Mutates in-place.

str_contains(columnName: string, pattern: string) → DataFrame
instance

Filters rows where the string column matches the regex pattern. Case-insensitive.

example
const result = df.str_contains('product_name', 'wireless')
groupBy(groupCol: string, aggs: AggSpec) → DataFrame
instance

Groups rows by a column and applies aggregation functions. Supports sum, mean, count, max, min. Pass a string for a single op or an array for multiple — output columns will be named {col}_{op}.

single op
const byHour = df.groupBy('hour', { tip_pct: 'mean', fare_amount: 'sum' })
multiple ops
const byZone = df.groupBy('zone_id', {
  fare_amount: ['sum', 'mean', 'count'],
  tip_amount:  ['sum', 'max']
})
// Columns: fare_amount_sum, fare_amount_mean, fare_amount_count, ...
groupByRange(colName: string, targetCol: string, maxRange: number) → Result[]
instance

O(n) groupBy for bounded integer keys. Uses Uint32Array as a direct lookup — no Map, no hashing, no allocations. Returns array of { group, avg } sorted descending.

This is the fastest aggregation in octopus-data. Use it when group keys are bounded integers (e.g. zone IDs, hour 0–23).
groupByID(colName: string, targetCol: string) → Result[]
instance

Alias for groupByRange(colName, targetCol, 300). Preset for NYC Taxi zone IDs.

sort(columnName: string, ascending?: boolean = true) → DataFrame
instance

Index sort — builds an index array, sorts by target column values, then reorders all columns in one pass. Returns a new DataFrame.

example — top 10 most profitable trips
const top10 = df.sort('revenue_per_mile', false).head(10)
Scalar Aggregations
MethodReturnsDescription
sum(col)numberSum of all values in a column
mean(col)numberArithmetic mean — returns 0 if rowCount is 0
max(col)numberMaximum value
min(col)numberMinimum value
example
df.sum('fare_amount')   // 48291043.21
df.mean('tip_pct')     // 14.82
df.max('trip_distance') // 189.4
unique(columnName: string) → any[]
instance

Returns an array of unique values in a column using a Set.

nunique(columnName: string) → number
instance

Returns the count of unique values. Faster than unique().length.

value_counts(columnName: string) → { value, count }[]
instance

Returns a frequency table sorted from most to least common.

example
const freq = df.value_counts('payment_type')
// [{ value: 1, count: 5821034 }, { value: 2, count: 1823456 }]
join(other: DataFrame, on: string, how?: 'inner' | 'left' = 'inner') → DataFrame
instance

Hash join on a common column. inner returns only matching rows. left keeps all left rows, filling unmatched right columns with null.

example
const enriched = trips.join(zones, 'zone_id', 'left')
toCSV(outputPath: string, options?: object) → Promise<void>
instance

Exports the DataFrame to a CSV file via streaming write. Floats are written with 4 decimal places. Validates the .csv extension.

example
await df.toCSV('output/results.csv')
toJSON(outputPath: string) → Promise<void>
instance

Exports to a JSON file. Validates .json extension.

toTXT(outputPath: string) → Promise<void>
instance

Exports to a plain text file. Validates .txt extension.

toArray() → object[]
instance

Converts the DataFrame back to a plain JS array of row objects. Useful for interoperability.

All export methods validate the file extension and throw if it doesn't match. Pass the path with the extension explicitly.
col(name: string) → Column
instance

Returns a Column instance linked to the underlying TypedArray. Enables chained arithmetic that mutates the column in-place.

example — timestamp parsing
df.col('tpep_pickup_datetime')
  .to_datetime()
  .extract_hour(0)
// Creates new column: tpep_pickup_datetime_hour
example — arithmetic between columns
df.col('total_amount')
  .sub(df.col('tip_amount'))
  .div(1.08)
Column methods
MethodAcceptsDescription
add(value)number | ColumnAddition in-place
sub(value)number | ColumnSubtraction in-place
mul(value)number | ColumnMultiplication in-place
div(value)number | ColumnDivision in-place — guards against divide by zero
to_datetime()Converts string timestamps to ms since epoch (Date.getTime)
extract_hour(offsetSeconds)numberExtracts hour 0–23 from ms timestamp. Creates {name}_hour column
Series
new Series(name, data: TypedArray, type: string, indexer?, mask?)
instance

The internal columnar primitive. Each column inside a DataFrame is backed by a Series. Use Series.fromRawBuffer() to reconstruct from raw buffer data. The optional indexer enables transparent numeric-ID to string translation via .get(index).

Series is the internal primitive — most users work with DataFrame and Column methods directly.
MethodReturnsDescription
get(index)anyValue at index — decodes via indexer if present
slice(start, end)SeriesReturns a slice preserving the indexer reference
Series.fromRawBuffer()SeriesStatic factory from raw buffer + metadata
Series.formatResults()objectFormats aggregation results with a .show() method