Converting from Unix epoch to SAS datetime

I recently had a need to convert from a long value stored as the offset from the Unix epoch (milliseconds) to a SAS datetime value. This is occuring more and more dealing with NoSQL and other datasets. This wasn’t a very easy thing to find, but it was easy enough to do. I’m sharing here in case anybody else might find it useful.

The code below:

  1. Reads a dataset with the numeric milliseconds value,
  2. Divides by 1000 (SAS is seconds, timestamps are milliseconds),
  3. Uses the INTNX function to add 10 years (SAS datetimes are based on 1960, not 1970) in days (between 1960 and 1970 there are 3 leap days. Using “dtyear” for “10” will only give you 2 leap days),
  4. And finally, the GMTOFF function to localize the SAS datetime value for the current system (Unix epoch values are always computed from GMT, SAS seconds are based from the current offset).

data converted(drop=millis);
set source;
format sas_dt datetime19.;
sas_dt = intnx('dtday', millis/1000, 365*10+3, 's') + gmtoff();

And for those of you who are generating content in SAS for other systems to consume, here is the function performed in reverse (generates millisecond values from the SAS datetime):

data output(drop=sas_dt);
set source;
format millis best15.;
millis = intnx('dtday', sas_dt - gmtoff(), -365*10+3, 's') * 1000;

Automatically Clean Old GMail

I really hate the way the Internet accumulates information about us. 30 years from now, I don’t want my kids to be able to pull up some tirade from a comment thread or a very compromising photo. Despite my efforts, I bet they will. This is why I try to automate filtering as much as I can into my local (or private cloud service) storage. Then, every once in awhile, you can go through the old stuff and prune down to only that which is useful or go flat out scorched earth on entire service families.

One thing I have found tedious is remembering to occasionally prune the old gmail history. I remove absolutely everything older than two years. I did a couple quick searches for a script or something, but didn’t get any hits. I finally wrote a script that will do it for me and scheduled it up in cron. It took me enough work jiggering around with setting labels, removing labels and flat-out deleting to finally come up with the working “just copy it to the folder” method that I thought it warranted sharing.

from imapclient import IMAPClient
from datetime import timedelta, date

# Computing datetime
now =
d = timedelta(days=(-365 * 2))
two_years = now + d

# Gmail IMAP delete
CUTOFF = 'before:' + two_years.isoformat()
HOST = ''

server = IMAPClient(HOST, use_uid=True, ssl=True)
server.login(USERNAME, PASSWORD)
server.select_folder('[Gmail]/All Mail')
messages =['X-GM-RAW ' + CUTOFF])
if len(messages) > 0:
    server.copy(messages, u'[Gmail]/Trash')

print '%d messages deleted' % (len(messages),)

I don’t save e-mail indefinitely outside of work, but it could probably be altered a little bit to save out the contents if you wanted to archive it somewhere other than the Google farm. It also plays nice such that it just puts the messages into the Trash folder (to be automatically cleaned by the service after another 30 days), but it could be further changed to really delete things for good without the Trash step.

Default PostgreSQL String Sort Order Bites Me in the SAS

Sometime during the development of an internal PHP and SAS mixed application, I’ve had some interesting transitions. Notably:

  • MySQL -> PostgreSQL
  • LATIN1 -> UTF-8
  • SAS 9.1.3 -> SAS 9.2

Most of these transitions went pretty straightforward. However, one bug got introduced somewhere along the way and I just couldn’t ever seem to figure out what would cause it.

For some reason, to download a list of features to get real ID numbers and then match by name, this wouldn’t work:

PROC SORT data=PG.features(RENAME=(id=feature_id name=feature_name)) out=features;
LABEL feature_id=”feature_id”;
BY feature_name;
WHERE release_id=&release_id;

DATA folders(KEEP=feature_id name);
MERGE folders(IN=in1) features(IN=in2);
BY feature_name;
IF in1;
IF in2;

I’d get bizarre errors out of SAS that the list wasn’t sorted. Whenever it occurred, I’d inspect the resultant (and intermediate) datasets, and everything seemed sorted just fine. Instead, I had to have something like this:

/* Downloading the features for this release */
PROC SORT data=PG.features(RENAME=(id=feature_id name=feature_name)) out=features;
LABEL feature_id=”feature_id”;
BY feature_name;
WHERE release_id=&release_id;

/* 9.2 workaround? For some reason if I sort on a RENAME or don’t, then try to */
/* MERGE after RENAME on a sorted field, it won’t work. */
PROC SORT data=features;
BY feature_name;

CREATE INDEX feature_name ON
DATA folders(KEEP=feature_id name);
MERGE folders(IN=in1) features(IN=in2);
BY feature_name;
IF in1;
IF in2;

You can clearly see my frustration (and the blaming of my employer’s own software over Postgres) in the comments. In addition, I probably overkilled the solution by re-sorting and then also creating an index, but it did make the problem go away.

Eventually, I got another bug report from a user with the phrasing, “Incorrect sorting within letter group in Features table”. That led me to this entry in the Postgres wiki:

I discovered SAS sorts strings based on rules found in “C” locale collation. Even though I read some documentation attributing the default LC_COLLATE setting as “C”, in fact, for my database, it was set to “en_US.UTF-8”. What this basically means is when sorting the following list:

  • GLMMOD : Tests
  • GLM : ODS Graphics
  • GLM : Checklist
  • GLMMOD : Checklist

You’ll get:

  1. GLM : Checklist
  2. GLMMOD : Checklist
  3. GLMMOD : Tests
  4. GLM : ODS Graphics

Which seems incorrect at first, until you realize it sorts disregarding whitespace and special characters. However, SAS and “C” locale collation sort like this:

  1. GLM : Checklist
  2. GLM : ODS Graphics
  3. GLMMOD : Checklist
  4. GLMMOD : Tests

Because PROC SORT and other SAS mechanisms issue and rely on native database commands for some operations, this behavior can produce results in ordering undesirable for SAS. SAS actually performed very admirably by delegating the sorting to the database, setting the appropriate flags on the dataset, but then still catch the match merge problem at runtime!

Long story short, when using PostgreSQL with SAS, it’s probably a good idea to make sure the database is created with the correct setting for LC_COLLATE. If it is not, you may end up with crazy gyrations like mine in your code. Luckily for me, it’s a fixable scenario whereby the database only needs dumped, then restored after it has been recreated with the desired collation.

Burning a Blu-ray in Linux

I’ve had a blu-ray burner in my Linux system for quite a while. Since about Fedora 8, I’ve been using commands to burn backups onto single-layer BD-RE media. I gleaned those commands out of a posting about dvd+/-rw tools (google cache). Here are the basics…

Ad-Hoc Burning

growisofs -Z /dev/dvd -R -J /path/to/files

– later –

growisofs -M /dev/dvd -R -J /more/files

– to finalize –

growisofs -M /dev/dvd=/dev/zero

Writing an ISO

growisofs -dvd-compat -Z /dev/dvd=/path/image.iso

Erase the Disk

growisofs -Z /dev/dvd=/dev/zero


dvd+rw-format -ssa=1G /dev/dvd

I’ve recently been trying to do some new things and I thought I would post that as of Fedora 11, Brasero can recognize and write files to my BD-RE media and also erase the disk to do it again. K3b is still at 1.0.5 (not a KDE 4 compatible version) on Fedora and does not recognize the disk correctly for type and capacity, nor does it allow burning.

My recent searches pull up the same results from 2007 and 2008 as before where people were unsure, etc. This is the current state of my world though.

*UPDATE 2010-06-25*

It’s been awhile since I’ve tried this, but now using Fedora 13, I am able to use K3B to burn single-layer BD-RE media at 2.0x speed. I have a Sony BDRW BWU-200S.

[cuppett@home ~]$ k3b –version
Qt: 4.6.2
KDE Development Platform: 4.4.4 (KDE 4.4.4)
K3b: 1.92.0
[cuppett@home ~]$ rpm -q dvd+rw-tools

Providing SOAP (non-REST) web services with CakePHP

I recently had a need to support a complex SOAP web service from CakePHP. Cake provides some built-in support for REST based web services; however, this situation required more. This post should show how to set this up on your own projects and still utilize all your normal controller and model goodness without too much screwing around.

Please see this attachment for the source code described in this article.

The method I will outline here requires the php-soap module.

First, the WSDL. For my project, I started with a WSDL created in another tool. My WSDL specifies a slightly different object set than my CakePHP application. I’m sure with PHP5 and some finessing of the Model classes, you could probably use the same set; however, it was easy enough to just create some really vanilla objects to house the transport objects and use those to communicate with the webservice. Both the WSDL and the receiving controller are present in the attachment.

What you will notice is that the *DTO objects defined in the controller file reciprocate the structure of the objects in the WSDL and the methods also are represented in the controller. I put them in the controller file because it wasn’t really obvious to me where in Cake’s structure “outside code” should really go. I have a separate I pull in up the class hierarchy, but that’s about as non-conventional as I want to get. Also, this controller is dedicated to just handling webservice requests and I only need these *DTO objects in that case, so locality wins and they are here. No real engineering genius here, their structure mimics what is defined in the WSDL file.

The real magic is in the controller. The controller’s remote() method is what handles the POST from the web via the port binding in the WSDL file. The remote() function sets up some of the basic stuff for SoapServer and is easily identified in the PHP manual. It’s even pretty easy to deduce we’re going to need to use SoapServer->setClass() somewhere and plug the name of our Controller in. However, there was one tidbit in the comments section of the manual regarding SoapServer->setObject(). It wasn’t documented (at the time), but after experimenting and looking at the PHP source, it does exactly what we need here, sets the handling class to an instantiated (aka existing) class object instead of trying to spawn a new one. Because we are already inside the CakePHP framework and running the remote() function, we already have the variables we want from beforeFilter(), we have our models loaded up, we may even have a user context from mod_auth_something. Perfect!!! So, we tell SoapServer to use our instantiated Controller. Once the *DTO classes are mapped and SoapServer is configured, it’s as simple as having it handle STDIN to tickle the rest of the methods in your Controller with the parameters populated. Two more tricks/problems remain: debug level & autoRender.

First, debug level. There’s bound to be a way around it; however, since I test with a web service client, when I do have a problem, I have to debug with lots of $this->log() calls. Turning up debugging to 1 or 2 is problematic because then CakePHP doesn’t spit back properly formed XML to the web service client and usually the client takes a SoapFault when that happens. I stick to debug level of 0 during development and deployment wrt the web service stuff.

Second, autoRender. Because SoapServer does the actual outputting of XML response to the client, I set the layout in the Controller to Ajax and also explicitly call exit() at the end of the remote() method. This ensures that CakePHP doesn’t send back a “Missing View”, half rendered $layout, or any other kind of automatic goodies.

I hope this article is helpful for anybody who might want/need to integrate a more elegant/esoteric webservice into their CakePHP architectures. I’m sure there are probably cleaner ways to put this into custom View classes, utilize Components, etc… however, this was a straightforward approach I found has been working really well for one of my applications.

HowTo: PostgreSQL – Adding more values to an ENUM type

I recently had trouble manipulating an ENUM field I had created in PostgreSQL.  I couldn’t find any suggestions or samples easily on Google or in the manual and was able to get it to work, so I post it here.  The basic premise is there is an ENUM field type created, I need more possible values and to preserve the existing values I already have to keep code working.

Initial creation of the type and table:

CREATE TYPE var_type AS ENUM('text', 'number', 'date', 'boolean');

CREATE TABLE custom_fields (
id bigserial PRIMARY KEY,
name varchar(50) NOT NULL,
pdf_type var_type NOT NULL

Running with this table for some time, invariably, new rows are created and there’s now a migration consideration.  As long as you are not using the table column as a reference in a foreign key, the following should work to preserve the data, drop and re-create the type.

The following creates a new column to hold the original text value:

ALTER TABLE custom_fields ADD COLUMN type_text varchar(15);
UPDATE custom_fields SET type_text = pdf_type::text;

We, then, need to drop the existing type and re-create it with the new values we want.  CASCADE automatically drops columns that depend on the type:

CREATE TYPE var_type AS ENUM('text', 'number', 'date', 'boolean', 'list');

This last part was what I couldn’t figure out without thinking a little more.  When you add it back, you have to cast the varchar column back into the ENUM type.  I had tried a variety of concoctions here before getting this to work:

ALTER TABLE custom_fields ADD COLUMN pdf_type var_type;
UPDATE custom_fields SET pdf_type = type_text::var_type;
ALTER TABLE custom_fields ALTER pdf_type SET NOT NULL;
ALTER TABLE custom_fields DROP COLUMN type_text;

Web log anonymizer

I recently had need to anonymize the IP addresses in an Apache access log.  It seemed like a simple task; however, there weren’t any really good code samples out there directly for it.  It’s a pretty simple exercise; however, given there wasn’t anything readily available, I figured I’d post it here so others might make use of it.  The only requirement it really had was to be able to process large logs rather fast and to maintain the same IP address mappings for multiple entries in the logs in order to preserve the actual traffic data as it relates to sessions.  With a little more work, I’m sure it could select random IP addresses in the same geo as the original one whereas this will probably evenly distribute the IPs across the globe (skewed for actual ownership of the ranges).

So here are the few lines of Perl that got the job done:

if ($#ARGV + 1 < 1) {
        print "ntUsage:n";
        print "t------nn";
        print "tperl file1 [file2 [file3 [...]]]nn";
        die "Please specify at least one file to use this script.nn";

my %forward = ();
my %reverse = ();

foreach (@ARGV) {
        open(ORIG, $_)
          or die "Failed to open input file for reading.";
        open(ANON, "+>", $_.".anon")
          or die "Failed to open destination file for writing.";
        while (<ORIG>) {
                if (/([0-9]+.[0-9]+.[0-9]+.[0-9]+)/) {
                        if (!($forward->{$1})) {
                                $newIp = getNewIp();
                                while ($reverse->{$newIp}) {
                                        $newIp = getNewIp();
                                print "New mapping created: $1 -> $newIpn";
                                $forward->{$1} = $newIp;
                                $reverse->{$newIp} = $1;
                        $repl = $forward->{$1};
                        $_ =~ s/$1/$repl/;
                print ANON $_;

exit 0;

sub getNewIp {
        return int(rand(256)) . "." . int(rand(256)) . "." . int(rand(256)) . "." . int(rand(256));

It is fairly straightforward.  You invoke the Perl script with one or more arguments.  Every argument should be a path to an access log.  For each file, a new file of the same name and “.anon” appended gets created.  Across all those files, the script maintains an internal hash of the IPs it has mapped to a new, random IP address and will re-use those mappings as they are encountered.  It spits out a little message when the mappings occur so you could do some counts using ‘wc’ or something similar to see how many you had… or you could make it output a count at the end, it’s pretty simple to do either.

So that’s it, easy web log anonymizing via random IP address remapping.