Files - CSV, JSON, XML, Logs, S3, Excel & Others – Documentation and Support | Knowi

Knowi enables analysis, visualization, warehousing and reporting automation from file-based data across the following file formats & Channels:

CSV
JSON
Logs
XML
Excel
Files over S3, FTP, HTTP, Email

Note, that data from these sources will be created as widgets directly.

Use our agent inside your network to connect and extract data from a variety of file-based data sources securely. See Cloud9Agent to download your agent along with instructions to run it. Under the examples folder, the agent contains datasource_example_csv.json and query_example_csv.json to get you started.

Edit those to point to your directory and modify the queries to pull your data.
Move it into the config directory (datasource_XXX.json files first if the Agent is running).

CSV/Delimited files

Highlights:

Execute queries on the file when the file is modified, one time or on a schedule.
Extract/Cleanse and query file based data using Cloud9QL.

Datasource Configuration:

Parameter	Comments
name	Unique Datasource Name.
datasource	Set value to csv
url	File path. Example: /usr/local/examples/demoFile.txt
quoteChar	Quote Character for String values, if applicable. Defaults to double quotes if not specified.
delimChar	Delimiter character. Defaults to comma if not specified
nullFindString	(Optional) Searches for fields to replace with nullReplaceWith. Useful to replace values such as 'NA', 'None' etc in a number field
nullReplaceWith	Optional, Replaces the string specified in nullFindString with this
columnTypes	Nested JSON that defines datatypes for fields. Defaults to String if a type is not specified. Datatypes supported: String, Date(), Number, double. See example and the Date Format section.

Date Formats for Date Fields in columnTypes:

Letter	Date/Time Component
y	Year
M	Month
w	Week of Year
W	Week in month
D	Day in Year
d	Day in Month
F	Day of Week in Month
a	AM/PM marker
H	Hour in day (0-23)
h	Hour in am/pm (1-12)
m	Minute in hour
s	Second in minute
S	Millisecond
z	Time zone
Z	Time zone

Example:

  ...
  "columnTypes":{
     "Week":"Date(MM/dd/yy)",
     "Date":"Date(MM/dd/yy HH:mm)"
  }
  ...

Query Configuration:

Query Config Params	Comments
entityName	Dataset Name Identifier
identifier	A unique identifier for the dataset. Either identifier or entityName must be specified.
dsName	Name of the datasource name configured in the datasource_XXX.json file to execute the query against. Required.
queryStr	Cloud9QL query to extract and manipulate data using. Use select * to select all. See Cloud9QL docs
runOnSourceChange	The query is run when the file has been modified if set to true. Possible values: true or false.
frequencyType	Alternative to runOnSourceChange, pick up the file on a schedule. One of minutes, hours, days, weeks, months. If this is not specified, this is treated as a one-time query, executed upon Cloud9Agent startup (or when the query is first saved)
frequency	Indicates the frequency if frequencyType is defined. For example, if this value is 10 and the frequencyType is minutes, the query will be executed every 10 minutes
startTime	(Optional) Can be used to specify when the query should be run for the first time. If set, the frequency will be determined from that time onwards. For example, is a weekly run is scheduled to start at 07/01/2014 13:30, the first run will run on 07/01 at 13:30, with the next run at the same time on 07/08/2014. The time is based on the local time of the machine running the Agent. Supported Date Formats: MM/dd/yyyy HH:mm, MM/dd/yy HH:mm, MM/dd/yyyy, MM/dd/yy, HH:mm:ss,HH:mm,mm
overrideVals	This enables data storage strategies to be specified. If this is not defined, the results of the query are added to the existing dataset. To replace all data for this dataset within Knowi, specify {"replaceAll":true}. To upsert data specify "replaceValuesForKey":["fieldA","fieldB"]. This will replace all existing records in Knowi with the same fieldA and fieldB with the current data and insert records where they are not present.

Datasource Example:

[
  {
    "name":"demoCSV",
    "url":"/usr/local/examples/demo-data.csv",
    "datasource":"csv",
    "quoteChar":"\"",
    "delimChar":",",
    "eolSymbols":"\n",
    "nullFindString":"None",
    "nullReplaceWith":"0",
    "columnTypes":{
      "Week":"Date(MM/dd/yy)",
      "Date":"Date(MM/dd/yy HH:mm)"
    }
  }
]

Query Example:

[
  {
    "entityName":"CSV Demo",
    "dsName":"demoCSVFile",
    "queryStr":"select count(*) where someField like abc",
    "runOnSourceChange":true,
    "overrideVals":{
      "replaceAll":true
    }
  }
]

JSON

Highlights:

Extract data from JSON documents
Extract/Cleanse Query data using Cloud9QL, including nested documents.
Trigger query execution once, on a schedule or when a file is modified.

Datasource Configuration:

Parameter	Comments
name	Unique Datasource Name.
datasource	Set value to json
url	File path. Example: /usr/local/examples/demoFile.json

Query Configuration:

Query Config Params	Comments
entityName	Dataset Name Identifier
identifier	A unique identifier for the dataset. Either identifier or entityName must be specified.
dsName	Name of the datasource name configured in the datasource_XXX.json file to execute the query against. Required.
queryStr	Cloud9QL query to extract and manipulate data using. Use select * to select all. See Cloud9QL docs. Nested queries and objects are supported.
runOnSourceChange	Optional. The query is run when the file has been modified if set to true. Possible values: true or false.
frequencyType	Alternative to runOnSourceChange, pick up the file on a schedule. One of minutes, hours, days, weeks, months. If this is not specified, this is treated as a one-time query, executed upon Cloud9Agent startup (or when the query is first saved)
frequency	Indicates the frequency if frequencyType is defined. For example, if this value is 10 and the frequencyType is minutes, the query will be executed every 10 minutes
startTime	(Optional) Can be used to specify when the query should be run for the first time. If set, the frequency will be determined from that time onwards. For example, is a weekly run is scheduled to start at 07/01/2014 13:30, the first run will run on 07/01 at 13:30, with the next run at the same time on 07/08/2014. The time is based on the local time of the machine running the Agent. Supported Date Formats: MM/dd/yyyy HH:mm, MM/dd/yy HH:mm, MM/dd/yyyy, MM/dd/yy, HH:mm:ss,HH:mm,mm
overrideVals	This enables data storage strategies to be specified. If this is not defined, the results of the query are added to the existing dataset. To replace all data for this dataset within Knowi, specify {"replaceAll":true}. To upsert data specify "replaceValuesForKey":["fieldA","fieldB"]. This will replace all existing records in Knowi with the same fieldA and fieldB with the current data and insert records where they are not present.

Datasource Example:

[
  {
    "name":"demoJSON",
    "url":"examples/demo-data.json",
    "datasource":"json"
  },
  {
    "name":"demoJSON-Nested",
    "url":"examples/demo-nested.json",
    "datasource":"json"
  }

]

Query Example:

[
  {
    "entityName":"JSON Demo - 1",
    "dsName":"demoJSON",
    /* Using Cloud9SQL to query JSON file */
    "queryStr":"select sum(sent) as Sent, customer as Customer group by Customer",
    "runOnSourceChange":true,
    "overrideVals":{
      "replaceAll":true
    }
  },
  {
    "entityName":"JSON Nested Demo",
    "dsName":"demoJSON-Nested",
    "queryStr":"select nestedObj.a, nestedArr[0], sent where nestedArr[0]=150",
    "runOnSourceChange":true,
    "overrideVals":{
      "replaceAll":true
    }
  }
]

Log Files

Highlights:

Extract data from Logs using a regex based parser
Extract/Cleanse Query data using Cloud9QL
Trigger query execution once, on a schedule or when a file is modified.

Note: The examples folder of the agent contains datasourceexampleregex.json and queryexampleregex.json sample configurations.

Datasource Configuration:

Parameter	Comments
name	Unique Datasource Name.
datasource	Set value to regex
url	File path. Example: /usr/local/examples/demoFile.json
logFormat	Format of the log entries to process.

 Example: ${date:DATE(MMM d HH:mm:ss)}${host:WORD()}\\S+:${msg} 
 Given the following log entries
 Jan 25 04:02:02 abhf sudo: admin : TTY=unknown ; PWD=/home/admin ; USER=root ; COMMAND=/usr/bin/dk
 Jan 25 04:02:02 ldklf sshd[27099]: pddam_unix(sshd:session): session closed for user admin

 The results will be three fields:
    Date: parsed in the date format
    host: a host field, parsed after the date
    msg: Rest of the log entry.

Query Configuration:

Query Config Params	Comments
entityName	Dataset Name Identifier
identifier	A unique identifier for the dataset. Either identifier or entityName must be specified.
dsName	Name of the datasource name configured in the datasource_XXX.json file to execute the query against. Required.
queryStr	Cloud9QL query to extract and manipulate data using. Use select * to select all. See Cloud9QL docs. Nested queries and objects are supported.
runOnSourceChange	Optional. The query is run when the file has been modified if set to true. Possible values: true or false.
frequencyType	Alternative to runOnSourceChange, pick up the file on a schedule. One of minutes, hours, days, weeks, months. If this is not specified, this is treated as a one-time query, executed upon Cloud9Agent startup (or when the query is first saved)
frequency	Indicates the frequency if frequencyType is defined. For example, if this value is 10 and the frequencyType is minutes, the query will be executed every 10 minutes
startTime	(Optional) Can be used to specify when the query should be run for the first time. If set, the frequency will be determined from that time onwards. For example, is a weekly run is scheduled to start at 07/01/2014 13:30, the first run will run on 07/01 at 13:30, with the next run at the same time on 07/08/2014. The time is based on the local time of the machine running the Agent. Supported Date Formats: MM/dd/yyyy HH:mm, MM/dd/yy HH:mm, MM/dd/yyyy, MM/dd/yy, HH:mm:ss,HH:mm,mm
overrideVals	This enables data storage strategies to be specified. If this is not defined, the results of the query is added to the existing dataset. To replace all data for this dataset within Knowi, specify {"replaceAll":true}. To upsert data specify "replaceValuesForKey":["fieldA","fieldB"]. This will replace all existing records in Knowi with the same fieldA and fieldB with the current data and insert records where they are not present.

Datasource Example:

[
    {
        "name":"demoRegex",
        "url":"examples/demo-regex.log",
        "datasource":"regex",
        "logFormat":"${date:DATE(MMM d HH:mm:ss)}${host:WORD()}\\S+:${msg}"
    }
]

Query Example:

[
    {
        "entityName":"Sample Regex Query",
        "dsName":"demoRegex",
        "queryStr":"select date(date) as date, count(*) as count group by date(date)",
        "runOnSourceChange":true,
        "overrideVals":{
            "replaceAll":true
        }
    }
]

XML Files

Highlights:

Extract data from XML files
Traverse/Extract/Cleanse/Aggregate data using Cloud9QL
Trigger query execution once, on a schedule or when a file is modified.

Note: The examples folder of the agent contains datasourceexamplexml.json and queryexamplexml.json sample configurations.

Datasource Configuration:

Parameter	Comments
name	Unique Datasource Name.
datasource	Set value to xml
url	File path. Example: /usr/local/examples/demoFile.txt
dataNode	The base node to traverse from
columnTypes	Nested JSON that defines datatypes for fields. For nested elements, use the dot notation. Example: Count.Sent Defaults to String if a type is not specified. Datatypes supported: String, Date(), Number, double. See example and the Date Format section.

Date Formats for Date Fields in columnTypes:

Letter	Date/Time Component
y	Year
M	Month
w	Week of Year
W	Week in month
D	Day in Year
d	Day in Month
F	Day of Week in Month
a	AM/PM marker
H	Hour in day (0-23)
h	Hour in am/pm (1-12)
m	Minute in hour
s	Second in minute
S	Millisecond
z	Time zone
Z	Time zone

Example:

  ...
  "columnTypes":{
     "Week":"Date(MM/dd/yy)",
     "Date":"Date(MM/dd/yy HH:mm)",
     "Count.Sent":"Integer"
  }
  ...

Query Configuration:

Query Config Params	Comments
entityName	Dataset Name Identifier
identifier	A unique identifier for the dataset. Either identifier or entityName must be specified.
dsName	Name of the datasource name configured in the datasource_XXX.json file to execute the query against. Required.
queryStr	Cloud9QL query to extract and manipulate data using. Use select * to select all. See Cloud9QL docs
runOnSourceChange	The query is run when the file has been modified if set to true. Possible values: true or false.
frequencyType	Alternative to runOnSourceChange, pick up the file on a schedule. One of minutes, hours, days, weeks, months. If this is not specified, this is treated as a one-time query, executed upon Cloud9Agent startup (or when the query is first saved)
frequency	Indicates the frequency if frequencyType is defined. For example, if this value is 10 and the frequencyType is minutes, the query will be executed every 10 minutes
startTime	(Optional) Can be used to specify when the query should be run for the first time. If set, the frequency will be determined from that time onwards. For example, is a weekly run is scheduled to start at 07/01/2014 13:30, the first run will run on 07/01 at 13:30, with the next run at the same time on 07/08/2014. The time is based on the local time of the machine running the Agent. Supported Date Formats: MM/dd/yyyy HH:mm, MM/dd/yy HH:mm, MM/dd/yyyy, MM/dd/yy, HH:mm:ss,HH:mm,mm
overrideVals	This enables data storage strategies to be specified. If this is not defined, the results of the query are added to the existing dataset. To replace all data for this dataset within Knowi, specify {"replaceAll":true}. To upsert data specify "replaceValuesForKey":["fieldA","fieldB"]. This will replace all existing records in Knowi with the same fieldA and fieldB with the current data and insert records where they are not present.

Datasource Example:

[
    {
        "name":"demoXML",
        "url":"examples/demo-data.xml",
        "datasource":"xml",
        "dataNode":"data.item",
        "columnTypes":{
            "id":"Integer",
            "Week":"Date(MM/dd/yy)",
            "Date":"Date(MM/dd/yy HH:mm)",
            "Count.Sent":"Integer"
        }
    }
]

Query Example:

[
    {
        "entityName":"XML Demo - 1",
        "dsName":"demoXML",
        "queryStr":"select *",
        "runOnSourceChange":true,
        "overrideVals":{
            "replaceAll":true
        }
    },
    {
        "entityName":"XML Demo - 2",
        "dsName":"demoXML",
        "queryStr":"select sum(Count.Sent) as Sent, Customer where recipients.recipient[0] = c001@cloud9charts.com",
        "runOnSourceChange":true,
        "overrideVals":{
            "replaceAll":true
        }
    }
]

The first query returns the entire document as is with nested elements intact. The second traverses nested relationships to build a simple aggregation, based on specific conditions.

FTP

Highlights:

Extends the CSV/Delimiter functionality to connect and pull files from an FTP site.
Can be set to only process if the file is different than the last pulled file.

Datasource and Query configurations are the same as the CSV/JSON processing, except for a customProcessors attribute that enables FTP.

Example CSV Datasource using FTP to pull file:

[
    {
   "name":"pageViewsLogs",
   "url":"/export/home/akjd/file.csv",
   "datasource":"csv",
   "quoteChar":"\"",
   "delimChar":"|",
   "eolSymbols":"\n",
   "nullFindString":"None",
   "nullReplaceWith":"0",
   "columnTypes":{
     "startCount":"int",
     "endCount":"int"
   },
   "customProcessors":[
     {
       "className":"com.cloud9.connector.processors.FTPProcessor",
       "processOrder":"timed",
       "frequencyType":"days",
       "frequency":1,
       "startTime":"03:15",
       "params":{
         "host":"ftp.box.com",
         "user":"box@cloud9charts.com",
         "password":"somePass",
         "remoteFile":"/BoxDir/\${date:yyyyMMdd}/page_views_\${date:yyyyMMdd-1d}000000_\${date:yyyyMMdd}000000.csv",
         "localFile":"/export/home/akjd/file.csv",
         "overwriteOnlyWhenChanged":true
       }
     }
   ]
  }
]

FTP Processor params configuration:

remoteFileRemote File Name. For files with dates in them, use \${date:dateFormat} to for the current date. You can also add or subtract days or weeks. Example: \${date:yyyyMMdd-1d}

FTP Processor Parameter	Comments
host	FTP Host to connect to
port	Port number, optional
user	user to connect to the FTP site. Connects as anonymous if not specified
password	Password to connect. Connects as anonymous if a user is not specified
mode	Optional security mode. Default is empty. Values: ftps or ftpes
localFile	Where to put the file pulled from the FTP site. Typically, this is set to the same as url parameter in the main datasource configuration
overwriteOnlyWhenChanged	If set to true overwrites local copy only if the remote file is different than the local copy.

Email Attachments

This can be used to turn existing text reports sent via email into dashboards widgets and dashboards, with storage and trends.

Highlights:

Visualize, transform and track your existing CSV/delimited email reports.
CC existing email report to us.

Excel Files

Highlights:

Process Excel files when the file is modified, on a schedule, or one-time.
Extract/Cleanse data from a Worksheet using Cloud9QL.

Datasource Configuration:

Parameter	Comments
name	Unique Datasource Name.
datasource	Set value to excel
url	File path. Example: /usr/local/examples/demoFile.txt

Query Configuration:

Query Config Params	Comments
entityName	Dataset Name Identifier
identifier	A unique identifier for the dataset. Either identifier or entityName must be specified.
dsName	Name of the datasource name configured in the datasource_XXX.json file to execute the query against. Required.
queryStr	Cloud9QL query to extract and manipulate data from the worksheet. Use select * to select everything from the worksheet. See Cloud9QL docs
excelWorksheetName	Worksheet name to process data from
runOnSourceChange	The query is run when the file has been modified if set to true. Possible values: true or false.
frequencyType	Alternative to runOnSourceChange, pick up the file on a schedule. One of minutes, hours, days, weeks, months. If this is not specified, this is treated as a one-time query, executed upon Cloud9Agent startup (or when the query is first saved)
frequency	Indicates the frequency if frequencyType is defined. For example, if this value is 10 and the frequencyType is minutes, the query will be executed every 10 minutes
startTime	(Optional) Can be used to specify when the query should be run for the first time. If set, th frequency will be determined from that time onwards. For example, is a weekly run is scheduled to start at 07/01/2014 13:30, the first run will run on 07/01 at 13:30, with the next run at the same time on 07/08/2014. The time is based on the local time of the machine running the Agent. Supported Date Formats: MM/dd/yyyy HH:mm, MM/dd/yy HH:mm, MM/dd/yyyy, MM/dd/yy, HH:mm:ss,HH:mm,mm
overrideVals	This enables data storage strategies to be specified. If this is not defined, the results of the query are added to the existing dataset. To replace all data for this dataset within Knowi, specify {"replaceAll":true}. To upsert data specify "replaceValuesForKey":["fieldA","fieldB"]. This will replace all existing records in Knowi with the same fieldA and fieldB with the current data and insert records where they are not present.

Datasource Example:

[
  {
    "name":"demoExcel",
    "url":"/usr/local/abc/files/demo-data.xlsx",
    "datasource":"excel"
  }
]

Query Example:

[
  {
    "entityName":"Excel Data - Demo Sent",
    "dsName":"demoExcel",
    "queryStr":"select sum(sent) as Sent, Customer group by customer",
    "runOnSourceChange":true,
    "excelWorksheetName":"demo-data",
    "overrideVals":{
      "replaceAll":true
    }
  }
]

HTTP File Downloads

Highlights:

Download and process CSV/JSON files via HTTP.
Extract/manipulate data using Cloud9QL.
Pull on a schedule, or once, or process if the file has been modified since last run.

Datasource Example:

[
  {
    "name":"demoHttpPublic",
    "url":"https://s3.amazonaws.com/datatesting/demo-data-public.csv",
    "datasource":"http",
    "dataType":"csv",
    "localFile":"./downloadedFile.csv"
  }
]

Query Example:

[
  {
    "entityName":"HTTP File Demo",
    "dsName":"demoHttpPublic",
    "queryStr":"select date(date) as Date, sum(sent) as Total Opened, sum(opened) as Opened group by date(date)",
    "overrideVals":{
      "replaceAll":true
    }
  }
]

Amazon S3

Highlights:

Process CSV/JSON files stored in S3.
Extract/manipulate data using Cloud9QL.
Pull on a schedule, once, or process if the file has been modified after last run.
Process File names with dates dynamically based on the current date.

Authorization

For files that require authentication/authorization, create credentials from AWS:

Login to the AWS IAM console:
Click on 'Create a New group of Users', enter a group name. Select 'Read Only Access'
Enable "s3:Get*" in the Policy Document, remove all others:
Click on 'Create New Users'. Create a user name with 'Generate an access key for each user' checked.
Once created, use the show credentials to note down the credentials for the user.

Datasource Configuration:

Parameter	Comments
name	Unique Datasource Name.
datasource	Set value to s3
s3RegionName	Region associated to the bucket. Example: us-east-1
s3BucketName	S3 Bucket Name/Identifier
s3FileKey	S3 File Name
datatype	File format. Supported: json/csv
s3FileKey	S3 File Name
datatype	File format. Supported: json/csv. Defaults to csv if not specified.
localFile	Location and name of the local file upon download. Optional - if not specified, it'll pull same name as the S3 key into the base agent directory.
authAccessToken	S3 Credentials key obtained for the user during the auth step
authAccessTokenSecret	S3 Credentials secret obtained for the user using the auth step

Query Configuration:

Query Config Params	Comments
entityName	Dataset Name Identifier
identifier	A unique identifier for the dataset. Either identifier or entityName must be specified.
dsName	Name of the datasource name configured in the datasource_XXX.json file to execute the query against. Required.
queryStr	Cloud9QL query to extract and manipulate data from the worksheet. Use select * to select everything from the worksheet. See Cloud9QL docs
runOnSourceChange	The query is run when the file has been modified, if set to true. Possible values: true or false.
frequencyType	Alternative to runOnSourceChange, pick up the file on a schedule. One of minutes, hours, days, weeks, months. If this is not specified, this is treated as a one time query, executed upon Cloud9Agent startup (or when the query is first saved).
frequency	Indicates the frequency, if frequencyType is defined. For example, if this value is 10 and the frequencyType is minutes, the query will be executed every 10 minutes.
startTime	Optional, can be used to specify when the query should be run for the first time. If set, the the frequency will be determined from that time onwards. For example, is a weekly run is scheduled to start at 07/01/2014 13:30, the first run will run on 07/01 at 13:30, with the next run at the same time on 07/08/2014. The time is based on the local time of the machine running the Agent. Supported Date Formats: MM/dd/yyyy HH:mm, MM/dd/yy HH:mm, MM/dd/yyyy, MM/dd/yy, HH:mm:ss,HH:mm,mm
overrideVals	This enables data storage strategies to be specified. If this is not defined, the results of the query is added to the existing dataset. To replace all data for this dataset within Knowi, specify {"replaceAll":true}. To upsert data specify "replaceValuesForKey":["fieldA","fieldB"]. This will replace all existing records in Knowi with the same fieldA and fieldB with the the current data and insert records where they are not present.

Datasource Example:

[
  {
    "name":"demoS3CSV",
    "datasource":"s3",
    "s3RegionName":"us-east-1",
    "s3BucketName":"datatesting",
    "s3FileKey":"demo-data.csv",
    "dataType":"csv",
    "authAccessToken":"someToken",
    "authAccessTokenSecret":"someSecret",
    "localFile":"./downloadedFile.csv"
  },
  {
    "name":"demoS3JSON",
    "datasource":"s3",
    "s3RegionName":"us-east-1",
    "s3BucketName":"datatesting",
    "s3FileKey":"demo-data.json",
    "dataType":"json",
    "authAccessToken":"someToken",
    "authAccessTokenSecret":"someSecret"
  }
]

For processing s3BucketName or s3FileKey that has a date based name, use ${date: <dateFormat>} to process based on current date.

Examples:

i. If the current date is 08/01/2014 and today's data is identified by 20140801-file.csv in S3, this can be defined as ${date:yyyyMMdd}-file.csv for processing.

ii. If the current date is 08/01/2014 and the file has multiple days worth of data, identified by 20140801-20140727-file.csv, define it as ${date:yyyyMMdd}-${date:yyyyMMdd-5d}-file.csv

Query Example:

[
  {
    "entityName":"S3 CSV",
    "dsName":"demoS3CSV",
    "queryStr":"select * limit 10",
    "overrideVals":{
      "replaceAll":true
    }
  },
  {
    "entityName":"S3 JSON",
    "dsName":"demoS3JSON",
    "queryStr":"select * where sent > 10000",
    "overrideVals":{
      "replaceAll":true
    }
  }
]

Public Files

For public files on S3 that does not require S3 credentials, see our HTTP File Downloads section

CSV/Delimited files

JSON

Log Files

XML Files

FTP

Email Attachments

Excel Files

HTTP File Downloads

Amazon S3

Authorization

Public Files

Related articles