INTRODUCTION
A columnar data engine for Node.js. Zero dependencies. Built on SharedArrayBuffer, Worker Threads, and TypedArrays.
octopus-data reads your entire dataset into a SharedArrayBuffer — one contiguous block of RAM. Worker Threads receive a reference to that buffer, not a copy. Each worker processes its own partition in parallel, writing results back to shared memory.
Numeric columns are stored as Float64Array views over the shared buffer. This means aggregations, filters, and feature engineering operate directly on raw memory — no object allocation, no garbage collection pressure.
fs, worker_threads, os, path. No npm install required beyond the package itself.Octopus — Static entry point. Use Octopus.read() to load CSV or JSON files into a DataFrame.
DataFrame — The main data structure. Holds columnar data and exposes all transformation, aggregation, and export methods.
Column — A linked reference to a single column that enables chained arithmetic ops.
Series — The internal columnar primitive. Each column inside a DataFrame is a Series backed by a TypedArray.
INSTALLATION
octopus-data requires Node.js 18+ for SharedArrayBuffer and Worker Threads support.
npm i octopus-data
import { Octopus } from 'octopus-data' import { DataFrame } from 'octopus-data' import { Series, Column } from 'octopus-data'
package.json has "type": "module" or use .mjs extensions.SharedArrayBuffer requires Node.js 16.4+ with cross-origin isolation headers if used in a browser context. In Node.js (server-side), no extra configuration is needed.
QUICKSTART
A complete example loading, cleaning, transforming, and aggregating 7.8M rows.
import { Octopus } from 'octopus-data' // 1. Load CSV — uses SharedArrayBuffer + Worker Threads const df = await Octopus.read('trips.csv') console.log(df.info()) // { rowCount: 7832546, columnCount: 19, memoryUsage: '...' } // 2. Select only the columns you need (frees RAM) const slim = df.select([ 'fare_amount', 'tip_amount', 'trip_distance', 'passenger_count', 'tpep_pickup_datetime' ]) // 3. Clean — remove invalid rows const clean = slim.filter( ['fare_amount', 'trip_distance', 'passenger_count'], (fare, dist, pax) => fare > 0 && dist > 0 && pax > 0 ) // 4. Feature engineering — vectorized over TypedArrays clean.with_columns([ { name: 'tip_pct', inputs: ['tip_amount', 'fare_amount'], formula: (tip, fare) => (tip / fare) * 100 }, { name: 'revenue_per_mile', inputs: ['fare_amount', 'trip_distance'], formula: (fare, dist) => fare / dist } ]) // 5. Extract hour from timestamp clean.col('tpep_pickup_datetime').to_datetime().extract_hour(0) // 6. GroupBy — best hour by tip percentage const byHour = clean.groupBy('tpep_pickup_datetime_hour', { tip_pct: 'mean' }).sort('tip_pct', false) byHour.show(5) // 7. Export await clean.toCSV('output/clean_trips.csv')
OCTOPUS.READ()
Universal entry point for loading CSV and JSON files into a DataFrame.
Detects file format by extension and routes to the appropriate engine. .csv uses the Nitro engine (SharedArrayBuffer + Workers). .json uses the JSON engine with auto flattening.
| Parameter | Type | Default | Description |
|---|---|---|---|
| filePath | string | — | Path to the file |
| options.workers | number | os.cpus().length | Number of Worker Threads |
| options.indexerCapacity | number | 10_000_000 | Max rows for column buffers |
| options.useOffsets | boolean | true | Store byte offsets for string columns |
| options.type | 'csv' | 'json' | auto | Force a specific format |
const df = await Octopus.read('data.csv') const df = await Octopus.read('data.json') const df = await Octopus.read('data.csv', { workers: 4 })
Automatically detects the root array (e.g. "prizes" in Nobel dataset). Recursively flattens nested objects into columns. Expands nested arrays into multiple rows, inheriting parent fields.
// Input: { prizes: [{ year: "2023", laureates: [{id,name}] }] } const df = await Octopus.read('nobel.json') // Each laureate becomes its own row, with year inherited // Columns: year, laureates_id, laureates_firstname, ...
CORE & DISPLAY
DataFrame constructor, static builders, and display methods.
Creates a new DataFrame. You typically get DataFrames from Octopus.read(), but you can construct one manually.
| Property | Type | Description |
|---|---|---|
| columns | Record<string, TypedArray> | Column data — use Float64Array for numerics |
| rowCount | number | Total number of rows |
| headers | string[] | Column names in order |
Converts a plain JS array of objects into a DataFrame. Numeric values are stored as Float64Array automatically.
const df = DataFrame.fromObjects([ { name: 'Alice', score: 95 }, { name: 'Bob', score: 82 }, ])
Prints the first n rows as a console.table. String values longer than 20 characters are truncated with ....
Returns a summary with row count, column count, column names, and estimated memory usage.
exampleconst info = df.info() // { rowCount: 7832546, columnCount: 19, columns: [...], memoryUsage: '1139.45 MB' }
Prints descriptive statistics for all numeric columns: count, mean, min, 25%, 50%, 75%, max. Non-numeric columns are skipped.
TRANSFORMATION
Methods for creating new columns, selecting, renaming, casting, and reshaping data.
Vectorized feature engineering. Applies formulas row-by-row using direct TypedArray access. Optimized fast paths for 1, 2, and 4 inputs. Returns this for chaining.
| ColSpec property | Type | Description |
|---|---|---|
| name | string | Name of the new column to create |
| inputs | string[] | Column names fed into the formula |
| formula | Function | (...values: number[]) => number |
df.with_columns([ { name: 'revenue_per_mile', inputs: ['total_amount', 'trip_distance'], formula: (amount, dist) => dist > 0 ? amount / dist : 0 }, { name: 'speed_mph', inputs: ['trip_distance', 'duration_hours'], formula: (dist, dur) => dur > 0 ? dist / dur : 0 } ])
this. New columns are stored as Float64Array.Returns a new DataFrame with only the specified columns. Essential for freeing RAM — drop unused columns as early as possible.
const slim = df.select(['fare_amount', 'tip_amount', 'trip_distance'])
Renames columns without copying data. Returns a new DataFrame with updated headers.
const df2 = df.rename({ tpep_pickup_datetime: 'pickup', PULocationID: 'zone' })
Forces a type conversion on a column. 'float' and 'int' both produce Float64Array. 'string' produces a regular JS Array. Mutates in-place.
Computes a running cumulative sum over a column. Creates a new column named {columnName}_cumsum. Mutates in-place.
Applies a StringIndexer to encode a string column as numeric IDs. Creates a new column named {input}_indexed and stores the indexer in metadata.indexers for later decoding.
FILTERING
Methods for selecting subsets of rows based on conditions, position, or null values.
Returns a new DataFrame containing only rows where the predicate returns true. The predicate receives the values of the listed columns for each row.
const valid = df.filter( ['fare_amount', 'trip_distance', 'passenger_count'], (fare, dist, pax) => fare > 0 && dist > 0 && pax > 0 )
Returns a new DataFrame with the first n rows.
Returns a new DataFrame with the last n rows.
Removes all rows containing null, undefined, or NaN in any column.
Replaces all null, undefined, or NaN values in all columns with the specified value. Mutates in-place.
Filters rows where the string column matches the regex pattern. Case-insensitive.
exampleconst result = df.str_contains('product_name', 'wireless')
AGGREGATION
GroupBy, sorting, and scalar aggregation methods.
Groups rows by a column and applies aggregation functions. Supports sum, mean, count, max, min. Pass a string for a single op or an array for multiple — output columns will be named {col}_{op}.
const byHour = df.groupBy('hour', { tip_pct: 'mean', fare_amount: 'sum' })multiple ops
const byZone = df.groupBy('zone_id', { fare_amount: ['sum', 'mean', 'count'], tip_amount: ['sum', 'max'] }) // Columns: fare_amount_sum, fare_amount_mean, fare_amount_count, ...
O(n) groupBy for bounded integer keys. Uses Uint32Array as a direct lookup — no Map, no hashing, no allocations. Returns array of { group, avg } sorted descending.
Alias for groupByRange(colName, targetCol, 300). Preset for NYC Taxi zone IDs.
Index sort — builds an index array, sorts by target column values, then reorders all columns in one pass. Returns a new DataFrame.
example — top 10 most profitable tripsconst top10 = df.sort('revenue_per_mile', false).head(10)
| Method | Returns | Description |
|---|---|---|
| sum(col) | number | Sum of all values in a column |
| mean(col) | number | Arithmetic mean — returns 0 if rowCount is 0 |
| max(col) | number | Maximum value |
| min(col) | number | Minimum value |
df.sum('fare_amount') // 48291043.21 df.mean('tip_pct') // 14.82 df.max('trip_distance') // 189.4
INSPECTION
Methods for exploring unique values, frequencies, and joining DataFrames.
Returns an array of unique values in a column using a Set.
Returns the count of unique values. Faster than unique().length.
Returns a frequency table sorted from most to least common.
exampleconst freq = df.value_counts('payment_type') // [{ value: 1, count: 5821034 }, { value: 2, count: 1823456 }]
Hash join on a common column. inner returns only matching rows. left keeps all left rows, filling unmatched right columns with null.
const enriched = trips.join(zones, 'zone_id', 'left')
EXPORT
Write DataFrames to disk in CSV, JSON, TXT, or plain JS arrays.
Exports the DataFrame to a CSV file via streaming write. Floats are written with 4 decimal places. Validates the .csv extension.
await df.toCSV('output/results.csv')
Exports to a JSON file. Validates .json extension.
Exports to a plain text file. Validates .txt extension.
Converts the DataFrame back to a plain JS array of row objects. Useful for interoperability.
COLUMN & SERIES
Low-level column operations and the internal Series primitive.
Returns a Column instance linked to the underlying TypedArray. Enables chained arithmetic that mutates the column in-place.
df.col('tpep_pickup_datetime') .to_datetime() .extract_hour(0) // Creates new column: tpep_pickup_datetime_hourexample — arithmetic between columns
df.col('total_amount') .sub(df.col('tip_amount')) .div(1.08)
| Method | Accepts | Description |
|---|---|---|
| add(value) | number | Column | Addition in-place |
| sub(value) | number | Column | Subtraction in-place |
| mul(value) | number | Column | Multiplication in-place |
| div(value) | number | Column | Division in-place — guards against divide by zero |
| to_datetime() | — | Converts string timestamps to ms since epoch (Date.getTime) |
| extract_hour(offsetSeconds) | number | Extracts hour 0–23 from ms timestamp. Creates {name}_hour column |
The internal columnar primitive. Each column inside a DataFrame is backed by a Series. Use Series.fromRawBuffer() to reconstruct from raw buffer data. The optional indexer enables transparent numeric-ID to string translation via .get(index).
| Method | Returns | Description |
|---|---|---|
| get(index) | any | Value at index — decodes via indexer if present |
| slice(start, end) | Series | Returns a slice preserving the indexer reference |
| Series.fromRawBuffer() | Series | Static factory from raw buffer + metadata |
| Series.formatResults() | object | Formats aggregation results with a .show() method |