Tabular Display of Project Result Data
Goncalo asked me (Paul) to look into a means of displaying GWAS trait association study data in such a way to make it easier to view, extract, and download.
Here's my notes about this effort so far.
If this matures further, we can use this page as a How-To so researchers can get help posting large result datasets.
Technology[edit]
Javascript in HTML content allows web pages to be dynamic, and response to specific events such as a button being pushed.
To view a table, one can tie events, such as button pushes, together with a CGI (common gateway interface) script to get more data, such that you can scroll through tables, sort them, search them and so forth.
There are numerous table viewers that go further and use fancier techniques that effectively dynamically re-write the HTML used to show a page. The code itself is a javascript library that uses jQuery to compose an AJAX query to a backend CGI. The backend CGI can be a PHP file that does the actual database query, which is returned as a JSON encoded result (JSON is javascript object notation, and is sort of like a dict in python).
The advantage of this approach is that scrolling, searching, sorting, exporting and so forth all look more like a desktop application, and work more smoothly than doing a full page reload every time.
For our purposes, all table viewers would be tied to a PHP CGI that queries either a mysql or a sqlite3 database.
DataTables Examples[edit]
I looked at one specific table viewer called http://www.datatables.net. I did so based on the recommendation of a person who does this type of display for a living. There are no doubt many other viewers and ways of accomplishing this type of display.
The DataTables web site is pretty well laid out. There are three example pages of particular interest to us.
The links below are reached from http://www.datatables.net/examples/.
- http://www.datatables.net/examples/data_sources/server_side.html shows a simple table with search and browse. The javascript and PHP code necessary are shown on that page (you also have to install the DataTables library javascript).
- http://www.datatables.net/release-datatables/examples/basic_init/scroll_y.html shows vertical scrolling.
- http://www.datatables.net/extras/tabletools/ shows a toolbar that can copy/export/pdf and print.
Local Testing[edit]
I did several experiments with database backends. Initially, I converted their PHP example to use sqlite3, hoping it might be fast enough. I chose to try this because the administration of data tables in sqlite3 is far easier than mysql.
After trying sqlite3 with a contrived dataset, I found it to be so slow that it was effectively non-interactive.
So I tried a few more things, and ended up with these two local examples:
- https://statgen.sph.umich.edu/traitView/traitView.html
- This example is with contrived data
- My goal here was performance analysis to see how TableView would scale
- Generally, more than around 1M lines became "too slow" - around 1 second per million lines
- The SQL query ORDER BY clause is the limiting factor - not memory or disk
- https://statgen.sph.umich.edu/espldl/espldl.html
- This is real data from Youna
- My goal is to show live data to see something of possibly real use.
- 256611 records.
- "Quick" paging and searching response. Good enough to be interactive.
Data Models for Local Tests[edit]
The contrived data schema looks like this:
create table traits (
markerName CHAR(16),
effect REAL,
pValue REAL,
trait CHAR(16),
PRIMARY KEY (markerName, trait),
FOREIGN KEY (markerName) REFERENCES snpAnnotations(markerName)
);
Youna's real data schema looks like this:
create table espldl (
chromosome INTEGER,
gene CHAR(16),
rvFreq REAL,
coeff REAL,
stats REAL,
p REAL,
annotation CHAR(16),
cutoff REAL,
model CHAR(8)
);
I did create an index for each column. Here's an example:
create index chromosomeIndex on espldl (chromosome);
Test Queries for SQLite3 and mysql[edit]
I need to do more testing. I used mysql on a real machine (gdbs0) but SQLite3 on a VM (2 core, 4GigRAM). This plus some query variations that had bigger impact suggest I need to test SQLite3 more.
Contrived data with 10 rows starting at offset 199 using mysql (10 seconds):
SELECT SQL_CALC_FOUND_ROWS markerName, effect, pValue, trait from traits order by markerName asc limit 199,10;
Contrived data same as above, but using sqlite3 (oops, this is now nearly instantaneous - I see I had a misoptimization in the SQL for the SQLite3 test case):
SELECT markerName, effect, pValue, trait from traits order by markerName asc limit 199,10;
Here was one booboo - the two following queries return the same result, but are very different in performance on SQLite3:
SELECT COUNT(*) FROM traits;
SELECT COUNT(markerName) FROM traits;