Databases and large data-sets

I started my career in the software side of IT in an exclusively Microsoft shop. Things such as Perl and Mysql were looked down upon as ‘unsupportable’. What I have found is something to the contrary but I more than anything these days the tool for the basic task is less relevant as there are  a lot of ways to get things done.

I have been using mysql and Perl for large data manipulation for the past two years and have come to love both. T-SQL is more or less the same across platforms, where you the focus has to lie is in the configuration and the thought you put in to how you architect your code. What you need to accomplish affects this strategy the most. If you just need to load large sets of data that’s easy. But if you need to know if the data is already in the database or needs to be updated that is where things get tricky.

So the the reason for this post. I decided to see what I could do with basic financial data I could buy through a financial service. I run an app that they provide that pulls the latest end of day stock information for every ticker in every financial exchange in the world. I can also grab fundamentals. Since I don’t care what changed each day doing a mass database load of each file is really easy. I can load upwards of a million records in a second or two. Here is how I did it using perl and mysql. Nothing revolutionary here. I just read the mysql documentation and some perl cpan and perlmonks.com posts. I chose to use the mysql bulk Load command in this case instead of building a single insert for each file with an array of values mostly out of laziness and desire for simplicity. In other scripts I use for forensic purposes I wouldn’t do this as i need to evaluate, scrub and probably transform the raw data in to some usable format first.

So you know what you are looking at basically there are two steps. First load the end of day information for the exchanges, this includes volume and such. The second step is to loop through the list of end of day files that have the basics for each stock, open, close, high low, volume,etc… The next phase which that i have not yet included is in development and will take the data for stock splits, adjustments and fundamentals like dividend, p/e and such (my fav) but more difficult to automate. It’s not in a nice and neat csv format. If I stay motivated enough I will start building a list of stocks I care about and parse the perspectus, 10k’s and such and build an analytics database but that sounds exhausting and I’d rather just borrow Watson! If someone offered to pay me i’d do all this but heck, all the high frequency traders already have this stuff and more. “The world isn’t run by people anymore Marty. It’s all little 1’s and 0’s “. Cudo’s for posting the movie and actor that line.

Now you will see why a nice simple perl script trumps a big bloated dot net app. Plus I can compile this making it portable to any win32 box without dealing with dependencies or version problems. It of course will only run on the computer hosting the mysql db as it is written however.

The raw code:

#lets do stuff. This is the guts
find(&Wanted,$workitems_path);

sub Wanted {
#print “$_n”; next;

my $dt = &GetDateTime();

my $path = “$File::Find::dir\$_”;
my($exchange,$dl_date,$ext) = split(/[_|.]/,$_,3);
$path =~ s////;
$path =~ s/\/\\/g;

###START## ExchangeList logic only
if($_ eq “ExchangeList.xml”){

###For checking if we’ve already loaded the file for the day.
my $select1 = “select count(*) from financial.exchangelist_log where process_date >= ‘” . &GetDate() .”‘ “;

#print “$select1n”; exit;
my @rowcount = $dbh->selectrow_array($select1) or die “#2.a count check select failed: $!”, undef, $dbh->errstr(), “n”;
my $rc = $rowcount[0];
###

if($rc == “0” )        {
qx(mysql -u root –password=<password> -D financial -e “LOAD XML LOCAL INFILE ‘$_’ IGNORE INTO TABLE financial.ExchangeList ROWS IDENTIFIED BY ‘<EXCHANGE>’ “);

#checks for errors
if($? == 0) {
my @lt1 = localtime(    (stat(“$path”))[9]    );
my $fdate = strftime(“%Y-%m-%d”,@lt1);
#if no errors add a log entry. (ghetto, i know niy $? actually works to catch qx errors)
$dbh->do(“insert into financial.exchangelist_log (id,date,path,process_date) values (null,’$fdate’,’$path’,’$dt’)”);

print “Loading: latests ExchangeList.xml data…n”;
}
} else        {
print “ExchangeList already loadedn”;
}
#exit;
}
###END## ExchangeList logic only

###START##This is for End of Day Data only
if($_ =~ m/(.csv)$/)
{
###SECTION: Some logic to see if the file has be run already!!
my $select1 = “select count(*) from financial.eod_data_log where filename = ‘$_’ “;

#print “$select1n”; exit;
my @rowcount = $dbh->selectrow_array($select1) or die “#2.b count check select failed: $!”, undef, $dbh->errstr(), “n”;
my $rc = $rowcount[0];

### END dupe check

if($rc == “0” )        {
print “Loading: $_n”;

#Load EOD file. ***make sure you have the mysql install bin dir in your PATH
qx(mysql -u root –password=<password> -D financial -e “LOAD DATA INFILE ‘$path’ IGNORE INTO TABLE financial.eod_data FIELDS TERMINATED BY ‘,’ LINES TERMINATED BY ‘\r\n’ IGNORE 1 LINES (Symbol,@Date,Open,High,Low,Close,Volume)  SET Exch = ‘$exchange’, Date = str_to_date(@Date, ‘%d-%b-%Y’)   “);

#checks for errors
if($? == 0) {
#if no errors add a log entry. (ghetto, i know niy $? actually works to catch qx errors)
$dbh->do(“insert into financial.eod_data_log (id,filename,date,path,process_date) values (null,’$_’,’$dl_date’,’$path’,’$dt’)”);
}

} #else { print “Exists: $_n”; }

}
###END##This is for End of Day Data only
}

sub GetDateTime {

my @lt = localtime();
return strftime(“%Y-%m-%d %T”,@lt);
}

sub GetDate {

my @lt = localtime();
return strftime(“%Y-%m-%d”,@lt);
}