Welcome to queirozf.com

Recent Posts


The world of code and software has well defined rules and few exceptions.

When, however, we use code to write applications and information systems, we're venturing into the world of humans (who are, after all, the ones who derive value from our applications), which has few, ill-defined rules and a lot of exceptions.

The software world needs fixed rules and few exceptions. The human world, whose interactions define what code needs to be written, is infinitely malleable and any rules have many exceptions.

If you've ever tried to model the day-to-day operations of a modern company into a computer system that's supposed to automate and optimize those processes, you know that this is neither easy nor clean. It's decidedly unpredictable and messy.

Process modeling is a world in and of itself. I'll not dwell too much on the ins and outs of process modelling; I'll focus on practical advice to make your experience in software development run smoother.

Suppose you have modeled everything you need and your system has gone into production. People start using it: creating, reading, updating and deleting data (this is what the CRUD acronym stands for).

But every now and then (or quite frequently if your system is used by a lot of people), strange errors start popping up: users start using your system in ways you hadn't thought of, funny data appears in the database and a decent amount of your time is spent correcting broken functionality rather than implementing new features.

Every now and then, bad data causes errors at runtime which force you to work reactively (fixing bugs) rather than proactively (think about and implementing new features).

Lately I've spotted a recurring pattern when I need to fix something or find out why something that used to work now doesn't. It's generally due to bad data in the database.

And the fix to bad data problems is nearly always implementing more data validation when users enter data into your system.

What is meant here by bad data?

It's data that's maybe syntactically correct (i.e. there's a date where your system expects a date, or an integer where your system expects an integer) but semantically wrong.

In other words, data that is generally correct, but not for your system.

  • Example: Car Inventory System

    Suppose you have written a car inventory system for a car dealership in your neighbourhood to keep track of the cars they have in stock.

    Such a system would have a Carentity, which would probably have attributes like manufacture_date (the date the car was manufactured) and entry_date (the date the car arrived at the dealership to be sold).

    This system would probably have a view called Insert new Car that will prompt user to enter a new car into the system: simplest-car-insert-form

    Now you have correctly added validators to validate that users only enter valid dates in each of the date inputs and valid strings (not numbers) in each of the text inputs:

    form-with-validation

    Note that your type validation helps but it is nowhere near enough. Note what could happen when users start using your system:

    possible-mistakes

    There's much you can do here; your Web Framework probably provides some validators for your out of the box:

    enter image description here

    Note some changes:

  • Autocomplete inputs instead of regular text inputs: this helps prevent spelling mistakes as well as slight variations in names (which cause problems in databases).

  • Drop Down lists instead of free-text inputs: this helps avoid spelling mistakes and forces users to choose from a set of preset (unmistakably valid) values.
  • Semantic Validators that don't just validate whether data is the right type, but validate that data is semantically correct given you system's context. In the example, both dates are valid but it doesn't make sense to input a purchase_date that is prior to the car's manufacture_date! (A car must be manufactured before it is sold!)

Smite these inconsistencies before they enter your system and grow into something worse! Think runtime errors!

Always Remember

Don't give any more freedom to users than they absolutely need. Given the slightest opportunity, users will make mistakes and (mostly unwittingly) introduce bad data into your systems!

Bad data will cause you headaches in the worst possible moments.

So validate data as much as you can before it enters your system!

Afterthought

Think of data validation in web applications like application-level asserts in your code.

I confess I used to be prejudiced against using asserts in my code (they're not very elegant) but I catch myself using them every now and then, in particular when I'm coding something mathy or otherwise related to complex calculations.

As someone whose opinion I respect told me, they're useful for documentation purposes, as they make make clear what expectations you have whenever they appear, helping anyone who may need to study or verify/revise your code later on (or even yourself in a couple of months).

That's aside from the obvious advantage of having code break earlier rather than later (you spot inconsistencies before they cause a runtime error and blow up in the face of your users) - which is an all-around good practice and hard to argument against.


When defining methods (and their parameters) in Ruby, there's at least two ways of doing it.

You can:

  • Define all arguments (and possible default values) one by one. For example:

    def my_method(foo,bar,baz)
        # method body
    end
    
  • Define all arguments as a parameter hash:

    def my_method(hsh={})
        # use hsh[:foo]
        # use hsh[:bar]
        # use hsh[:baz]
    end
    

    Using one or the other will not automatically make your project better or worse, but it may have implications in how coupled client code will be to your code.

Coupling

Coupling can be thought of as the level of explicit dependence one piece of code (function, class, module, etc) has with another; in general, we consider dependency upon external code (which you have no control of) to be a especially important kind of coupling.

Code you've written that uses external code (external libraries, third-party code, etc) is said to be coupled with those if changes on those would cause your code to break.

In the same vein, we can talk about code that is tightly coupled (i.e. very dependent upon external code) and also about code that is loosely coupled (not very dependent upon external code).

It is generally well-understood that instantiating external classes in your code and using methods from it will lead to your code being dependent upon (i.e. coupled with) those classes:

require 'external-project'
def my_method(foo,bar)
    var = ExternalClass.new
    result = var.do_task_1
    another_result = result.process_further
end

If you look at the previous snippet, you can see that you code is coupled because of the following:

  • it knows module external-module by name.
  • it knows class ExternalClass by name.
  • it knows that the constructor for class ExternalClass takes no parameters.
  • it knows that class ExternalClass has an instance method called do_task_1 and that it takes no parameters.
  • it know that instance method do_task_1 returns a value.
  • it knows that that value responds to a method called process_further and returns another value.

Each of the bullet points reflects an expectation you have about the external library. If any of those expectations stops being met (due to updates in the external code, changes in the API, the project maintainer being hit by a bus, etc), your code will break.

Each dependency reflects an expectation your code has about external code.

This is not necessarily bad. Your code needs to interact with the external world otherwise it would be useless. You just need to be aware of this.

Parameter hashes

Using parameter hashes can reduce coupling client code (code that uses your code) will have with yours.

At the end of the day, it will give you more freedom because you are not required to maintain the order your arguments are defined to prevent client code from breaking.

In addition, you will be able to add as many parameters (in the hash) as you want to your methods, without breaking client that uses older versions of your APIs.

def my_method
    var1 = SomeExternalClass.new({
        :foo => 10,
        :bar => 'a string'
    })
    var2 = SomeExternalClass.new({
        :bar => 'a string',
        :foo => 10
    }) 
    # when using parameter hashes, the order doesn't matter
    var3 = SomeExternalClass.new({
        :foo => 'bar'
    })
    # nor does it matter if you only define some parameters
end

Possible Drawbacks

This technique has its pros and cons. Some potential pitfalls are as follows:

  • More tendency to create classes/method that do too much. - When using parameter hashes, you may be tempted to provide a single method/class with many possible uses (because you can hide the number of parameters in a single hsh variable, for instance) and thus create few, overly burdened objects instead of many "lighter" ones.

  • Greater need for documentation - Code should be clear enough to show what and how you are doing, and comments should generally be concerned with the why you are doing something. Explicit parameter lists can help others understand what you are trying to do with a method. If you use parameter hashes, these will be missing and you'll probably need to add some documentation defining accepted parameters.

  • Less support in IDEs - Most IDEs use method definition and explicitly-defined parameters to supply you with suggestions when you use inteli-sense (autocompletion) and other similar features.


Software developers program to a machine, but we must be reminded that the actual end-users for systems we build are always humans.

With that in mind, and taking into account what we know about how humans interact with Computer Applications, we can derive a few ways in which we can enhance our applications using insights from neuroscience and psychology.

Cognitive Load

Broadly speaking, the term cognitive load refers to the amount of information you can keep in your head (short-term memory, actually) at the same time.

One of the things that demand more of our attention when working with information systems, for instance, would be dealing with irreversible actions.

Deleting or otherwise making permanent changes to data requires that users think very carefully over all possible outcomes such actions could lead to.

Irreversible actions put a strain on users' minds. All possible scenarios have to be considered before action is taken.

Cognitive load can also contribute to stress in the workplace. It also affects the impression users will have of your systems. Systems which don't overburden users' cognitive load (with irreversible actions and otherwise requiring them to have large amounts of data in their minds (as opposed to in the screen)) will seem easier and more pleasant to work with.

An example of How 2-step Removals can Lessen your Users' Cognitive Load

A simple way you can help users feel more at ease when using an information system you have designed (and help reduce errors as well) is to implement some form of 2-step removal of domain objects.

Deferring actual removal until the day after (like a rubbish bin) can help reduce users' tension and cognitive load.

Most information systems deal with objects - in fact, the core of information systems is actually concerned with managing these objects (creating, reading, updating and deleting them - the old CRUD acronym). So you can bet that a significant percentage of your users' time will be spent on destructive actions (not necessarily destroying objects, but changing system state, such as updating, creating and deleting).

A significant percentage of the actions carried out by users in software systems will cause some sort of state change.

If you have a simple safeguard in place to "let objects sit in the rubbish bin until they are actually removed" for instance, it could help users interact with your system in a more relaxed manner; they don't have to think very hard before deleting an object, because changes won't be actually put into effect until a day after - plenty of time to change one's mind should there be any need.

Implementation

As to the actual implementation of such safeguards, one relatively simple way to do it is to use database flags (could be a BOOLEAN or INT column on the table that represents that record) where your objects are actually persisted.

Rather than actually removing a record from the database when a user clicks Delete, just set this flag to true to signal that this item is marked for deletion.

In addition to that, you need a script that runs regularly (maybe at the end of each day) and searches the database for records which have been marked for deletion - only then are such records actually deleted.


References


These are some of the relevant factors that have contributed to the rise of new concepts like Internet of Things (IOT for short) and Big Data - terms that have since left the realm of academia and entered the mainstream.

IPv4 to IPv6 Transition

We are right in the middle of a large transition from old-fashioned IPv4 to IPv6, but what does it mean for us? In comparison with IPv4, IPv6 supports 10²⁸ as many endpoints!

This means that it will be possible for every single device (no matter if we're talking about thousands of heat sensors in a forest) to be connected to a network interface - being able to send and receive data from potentially any other Internet-ready device in the world. Any tiny piece of hardware could, theoretically, be uniquely identifiable via an IP.

IPv6 supports 10²⁸ more addresses than IPv4

Explosion of Devices and Data

In addition to the wider amount of addresses available through IPv6, the cost of hardware has gone down over the last few years, while newer and faster CPUs and hard disks have been developed.

This has contributed to what is being called the commoditization of processing power and storage space.

Last year (2013), there were over 10 billion connected devices, and this number will climb as high as 50 billion by 2020, according to an estimate by networking equipment maker Cisco. source

hockey stick effect

                            The hockey stick effect

Key areas

  • GI Systems

    GIS, short for Georaphical Information Systems is the umbrella term for systems whose objective is to store large quantities of coordinates and/or some extra information related to them.

    With the increase in the number of mobile and handheld devices, as well as the aforementioned explosion in the number of overall (including static) devices, it has become ever more convenient to store event locations and/or user actions as defined points in time and space in GISs.

  • Sensors

    Sensors are becoming economically viable for many industry sectors such as manufacturing, agriculture, energy generation and so on.

Each sensor typically emits data at a predefined rate or when some threshold conditions are met. This means lots of data gets sent to a database and needs to be acted upon, sometimes even in real time.

  • Social Media

    User-generated content is rising to heights never before seen, now that large populations (which until very recently didn't have access to the Internet) are becoming regular Internet users all around the world.

    It is hard to evaluate if social media has been more of a consequence of this phenomenon than one of the causes thereof, but social networks are among the organizations where most data is being kept nowadays - many popular open source tools for big data manipulation originated in places like Facebook, Google and Yahoo.

  • Logs

    Disk space has become so cheap that most devices and applications are configured to log everything that can be logged in the off chance it might some day, somehow, be useful for someone.


References


The sheer scale of the data required for and new developments in monitoring IT infrastructures with traditional SIEM (Security Information and Event Management) solutions has been prompting changes for all but the most naive of these systems, and most of these changes involve dealing with and analysing large data sets, hence the connection with the whole Big Data movement.

Big Data is changing the landscape for SIEM providers; in most cases it's not just a difference of scale - just throwing bigger and faster hardware just won't do.

Some of the issues that arise in the day-to-day operation of such systems are as follows:

Long Time Horizons

Data (in the form of logs, mostly) needs to be stored for increasingly long periods of time because sometimes the context is what separates a real threat from false positives.

One small incident is perhaps not relevant if it happens only once but the same issue happening every day for six months might be indicative of something lurking around the corner.

This means that an effective SIEM system needs to have elements to detect and act upon these APTs (Advanced Persistent Threats).

Inadequate Technologies

Most SIEM solutions are based on a traditional, relational Database Management System, which are not meant for this type of large, unstructured and relatively static data.

Inconsistent Data Formats

The sheer variety of log types and formats presents, in and of itself, a challenge for traditional SIEMs which are generally based upon database systems which really need some sort of regularity to the data. Companies are trying to move away from having to define each new log format in terms the underlying persisting layer can understand.

Store Once, Read Multiple Times

Logs and other types of monitoring information (both real-time and otherwise) aren't meant to be edited or changed in any way. They are mostly timestamped and automatically generated by devices and/or applications.

Many companies therefore find themselves using technologies meant for other types of data, which further contribute to aggravate the problem.

Not Knowing what to Look For

Users don't always know what they must look for when trying to establish a correlation between different events (now and/or in the past); maybe after an incident has taken place they want to carry out a forensic examination.

SIEM solutions must allow for ad hoc reporting and visualization so that end-users can use the system in ways the original designer didn't think about.

Stretching this notion a little bit, we can see many users using their SIEMs as some sort of log search engine which provides unopinionated visualization for the logs, providing tools for users themselves to see correlations and connections between the data sources rather than doing it itself.

Similar Data that Doesn't Look So

Different devices sometimes describe data in specific ways that makes it extra difficult for systems to determine what's similar and what's not.

For example, you might have two firewalls in your network and one logs drops as DROP: <IP> <TIMESTAMP> and the other logs it as DENY <TIMESTAMP> <IP> or something like that. Systems need to be able to infer similarities like these and treat them as a single entities (Firewall Drops) and smooth out small noise like this.


References


reminders 21 Jul 2014 00:16
R Heads-up and Tips for Beginners
reminders 19 Jul 2014 01:31
Grep usage examples
reminders 16 Jul 2014 01:14
Linux find Examples
reminders 15 Jul 2014 21:01
Querying an Elasticsearch Server using Scala