The boring rants of a lazy nerd

Wednesday, May 24, 2006

Work - Phone

All officers (commissioned, non-commissioned and academic) in my unit have been issued a cellular slavery device, carrying and answering which is mandatory. Stood half the day in the sun on god forsaken base waiting in line. Turns out their records indicate I'm not an NCO. Flashed shiny new NCO card I got especially for this occasion — no go. Faxed(!) a hard copy (!!) of my record over - all ok (!!!). World class authentication, that is. Seems our systems don't integrate too well with the civilian cell service provider's. At least free calls with colleagues should cut my phone bill in half.

Sunday, May 07, 2006

Me - Haircut

Today I went to a barber and had him, besides the haircut, trim my beard, which was growing untouched since December (when I became an NCO and was allowed to grow a beard). I was unhappy with the unkempt look (think al-Qaeda), but I was afraid of change. I think my fear was justified. I hate the result. My consolation is that it seems to have a healthy growth rate, so I'll experiment more. People's reactions are the strangest thing.

Saturday, May 06, 2006

FFDB - Quality of Writing

Have you read something that was obviously spell checked but not actually proofread? The way to tell is that all the spelling mistakes and typos are the kind that would fool a dictionary-based spelling checker (meaning they're actual words, just not the right ones, or words that would not appear in a dictionary even when spelled correctly (like proper names) so a person could still click "ignore" and not notice the typo).

Well, fan fiction that regularly misspells the Hogwarts houses and uses non-words like "comon" (as in, "comon dude, let's go") is painful to read, even if it has actual original plot twists (in this fandom?). So I want the FFDB to have a Quality of Writing score for stories. I just need to figure out the scale and how to make it difficult to abuse.

So far I have thought of these possible scores:

  • Ready for the printing press
  • Multiple editors, multiple passes
  • Proofread by a literate human being
  • Spell-checked
  • Recognizable English
  • AOL chatroom transcript

Please contribute!

Wednesday, May 03, 2006

FFDB/code - design

So, after sitting there two1 years(!) waiting for Big Design Up Front, I've decided to just code something and now I have screen-scraped three (well, two and a half) site's indexes. This proves instant gratification is king in combating procrastination. Problem is, the code is right horrible. That was expected, from a first project in such a different programming language than what I'm used to (C,C++,C# person doing Python), but still, I thought I was a better programmer than that. And it's not my perfectionism, the code really is very bad.

I didn't use "new style" classes because the tutorial I've used was too old. Also, I've listened to bad advice about getters and setters from one of the "Java is Evil" people, which proved to be a mistake. Good design should have a bit of Bondage and Discipline because I do want to force the Story object's properties to be recalculated when you add a Chapter object (e.g. word count, rating and even the story's authors list).

I have not used any tools or frameworks but BeautifulSoup (which is a bit slow but very easy to use), so everything is very much ad hoc. I have not used my spec, because it was too feature-heavy. I didn't even use httplib2 (in part because I'm unhappy with its caching, and in part because of the not invented here syndrome), which was good because I've learned something about how wildly underutilized HTTP is by cheap PHP CMSs ("Last-Modified header? What's that? Gzip compression? We don't need no stinking compression, we like paying for five2 times more bandwidth than we need!").

Fiction archive systems in the wild have vastly varying notions of "useful metadata". SQ, for example, is particularly bad, criminally bad even, considering the content's quality. Not only doesn't the site publish any useful metadata (ship, category and word count I suppose I can live without, and even chapter post dates (never mind per-chapter review links), but don't you agree it would be nice to know whether a story is complete or not?) but it continues to blindly embed whole html files in the template, disregarding the encoding mismatch or how wrong it is to nest html elements. It's a testament to the quality of modern tag soup parsers that they can even read it with any degree of success, with only the occasional artifacts, especially when non-ascii is used, like non-straight single and double quotation marks, smilies, the copyright sign or anything typed on a Mac

I don't want to repeat myself, but it has been a few years and things didn't change much, so I'll rant again about Word produced HTML. I still remember how, at 16, before I knew anything about regular expressions or Perl, I've coded a very naive "pattern search and replace engine" in C and used it to discover that overhead of hundreds of percent was not uncommon. Bandwidth cost on both ends, server busy time, and frustrating client wait time (on dialup! Ubiquitous broadband was but a dream back then) were all ok, when faced with the difficulty of using tidy, or any other tool (Yael had something, IIRC).

But I got wildly off topic there. The topic is a complete rewrite (more or less) of my business objects and persistence layer. One side is technical, the other is business analysis.

On the technical side, currently I envision the application as persistent in some kind of app server, holding the current data in ram and persisting saves to disk. Querying must always be done from ram, while updates must be atomic and resilient to power-outages etc. For performance, it may be needed to push resulting html/xml to be served as static files by apache, like a cache. When an object is updated, the static file is deleted. When apache wants to serve it and it's not there, a 404 hook asks the app server to produce it (or confirm it doesn't exist in the db) so xml is produced on demand only the first time, and apache can do what it does best. Data can be represented as either html, html form for editing, html search result table, per story/author atom feed, or atom search result feed.

For persistence I need:

  • Versioning, with efficient storage of diffs (and occasional extraction of arbitrary versions).
  • Locality, i.e. an object should sit in consecutive pages/blocks, not scattered records in lots of tables. In my experience, this is crucial for getting good performance.
  • Indexing of specified columns of the "current" version of an object, with different indexes depending on data, e.g. bitmap for status and something else for text search.

I have ideas and I'll explore alternatives. I am confident in my ability to solve this.

On the analysis side, things look more gloomy. I need to figure out how to best automatically merge these ToCs:

Prologue - In from the lake Prologue - In From the Lake
Chapter 1 - Ringing Harry1: Ringing Harry1: Ringing Harry
Chapter 2 The Minder's Visit2: The Minder's Visit2: The Minder’s Visit
Chapter 3 Letters and Liberty3: Letters and Liberty3: Letters and Liberty
Chapter 4 Ride and Remember4: Ride and remember4: Ride & Remember
Chapter 5 - Dreams, Clouds and Darkness5: Dreams, Clouds and Darkness5: Dreams, Clouds and Darkness
Chapter 6 - The day before and the evening afterwards6: The Day Before & The Evening After6: The Day Before & The Evening After
Chapter 7 - The Big One7: The Big One7: The Big One
Chapter 8 - Meet the Tutor8: Meet the Tutor8: Meet the Tutor
Chapter 9 - Letters from Elsewhere9: Letters from Elsewhere9: Letters from Elsewhere
Chapter 10: Lessons10: Lessons10: Lessons
Chapter 11 - Arrangements11: Arrangements11: Arrangements
Chapter 12 - Changing of the Guard12: The Changing of the Guard12: The changing of the guard
Chapter 13 - dinner at Abelard's13: Dinner with Abelard13: Dinner with Abelard
Chapter 14 - Motoring with Moony14: Motoring with Moony14: Motoring with Moony
Chapter 15 - Dragons, Dreams, Discussions15: Dragons, Dreams and Discussions15: Dragons, Dreams and Discussions
Chapter 16 – More letters16: More Letters16: More letters
Chapter 17 - Dragons Made Small17: Dragons Made Small17: Dragons Made Small
Chapter 18 - Birthday at the Burrow18: Birthday at the Burrow18: Birthday at the Burrow
Chapter 19 - Cooking with Harry19: Cooking with Harry19: Cooking With Harry
Chapter 20 - the Daze of August20: Daze of August20: Daze of August
 21: The End of August21: The End of August

SQ: Please notice inconsistent use of spaces, colons, hyphens and dashes to separate chapter number from chapter title, making automatic processing difficult.

FFA: Please notice lack of prologue. Actually, it's hiding on the same page as the first chapter. But there's no way of knowing that. Also numbering is generated by site template and is not part of the actual data.

PS: Here the numbers are part of the titles.

The epilogue and first chapter are sometimes glued together, and sometimes not. deleted the fic before the final chapter could be posted. Two intermission chapters appear at and nowhere else (and it is important that they be read before the chapters that appear after them!). The titles, provided they are somehow extracted, don't match between archives, not only due to punctuation (which confuses computers) but due to editing. So, the question is: how should the FFDB ToC for TLoS look like?

1 Cringe-worthy musings: Experiment, More on Fics, a few clarifications, progress, Tech research, sample data

2 I have the data to prove upward of 80% savings are not uncommon, and 60% is basically guaranteed. The site I linked to above will test any URL you type in, see for yourself how much faster would a fic page load had it been compressed on the wire. There are many an article on the subject, and yet most of the public remains unaware.

3 You may have deleted it, but Google remembers!

Monday, May 01, 2006

Linguistics - Russians

It appears the Russian language has no distinction between "security" and "safety". Does it mean they are very paranoid and attribute all safety-related incidents to sabotage or do they understand the word to mean "protection from harm, caused by negligence or enemy action"? Think of the implications! To me, "safety officer" paints the picture of an anal-retentive civil engineer, and "security officer" of a counter-intelligence style paranoid ex-field agent. But it could also be so that the KGB's purpose would be more ambiguous…

About Me

GCS d- s-: a-- C++$ UL++ P+++ L+++ E--- W+++ N o? K? w++$ !O !M !V PS-(+) PE Y+ PGP+(-) t--@ 5++(+++) !X R-- tv-- b+>++ DI+++ D+ G e h! r* y--(-)>+++