Satya's blog
http://www.thesatya.com/blog/
Satya's weblogAlgorithms and data structures
http://www.thesatya.com//blog/2023/03/algorithms.html
<p>
In day-to-day programming, I rarely think about algorithms and data structures.
Binary trees. Sorting.
</p><p>
That's because algorithms are already encapsulated in
the standard libraries of the programming languages I use.
Most languages like Python, Ruby, and Golang have sort functions.
</p><p>
For most data structure purposes, lists and associative arrays (dictionaries, aka key-value) are enough.
</p></p>
Faster algorithms? Processors are relatively cheap, parallelize it.
</p></p>
Memory? Memory is cheap and if that's not enough, spool it out to disk. If that's slow, parallelize it.
</p></p>
No, what's expensive today is network calls. And that's only because everything's fast enough that we can make API calls -- that's <i>Application Programming Interface</i> -- over the Internet.
</p></p>
Of course there are still situations where algorithms and data structures
matter. The preceding just talks about 90% of general programming these days:
websites and backends. Someone still has to make the standard and
not-so-standard libraries of all
these languages work, and if you're in embedded-devices land then all bets are
off.
</p>
<br/><hr/>Comments:<br/>2023-03-28T21:32:46+00:00Git bisect
http://www.thesatya.com//blog/2022/07/git_bisect.html
<p>
For those that don't know, and those that do, git bisect is a very cool tool/strategy/usage to find out the answer to "which commit broke everything?"
I used it recently. I have a page which suddenly started
showing blanks instead of the numbers that should be there. So I used git
bisect. First I did `git bisect start`.
</p><p>
I knew this was on my "test" branch, so I checked that master was good (it
was), and that the tip of test was bad (it was), so I marked both of them with
`git bisect good` and `git bisect bad`.
</p><p>
Then I just kept reloading the page, marking each commit as good or bad
depending on whether the numbers showed or not.
</p><p>
`git bisect` does the work of checking out the "next" commit after you name
two commits good and bad. It does a binary search, so it checks out a commit
half-way between the last-marked good and bad ones, and if you mark that one
good/bad it moves HEAD to half-way between that one and bad/good. It's cool to
watch in action.
</p><p>
<a href="https://git-scm.com/docs/git-bisect">https://git-scm.com/docs/git-bisect</a>
</p><p>
You can also use this to find the "commit that introduced the change", it
doesn't have to be a breaking change. Just so long as you have a quick way to
test whether a particular commit falls on the "good" or "bad" side.
In my case, that was just reloading the page to see if numbers showed up.
Other ways are to run a unit test, run a script, or whatever.
</p>
<br/><hr/>Comments:<br/>2022-07-13T07:59:58+00:00Smart car
http://www.thesatya.com//blog/2004/10/smartcar.html
I want a Smart car!
<br/><br/>
http://www.zapworld.com/cars/smartCar.asp
(dead link, new <a
href="https://boxercycles.com/zap-electric-vehicles/">https://boxercycles.com/zap-electric-vehicles/</a>
)
<br/><br/>
I wish it was cheaper, though :-(
<br/><hr/>Comments:<br/>2020-02-22T09:08:42+00:00Agile best practices
http://www.thesatya.com//blog/2018/09/agile_practices.html
There are many things that people call "Agile software development". Here's
how I do it. It's similar to how Pivotal Labs did it (and for all I know,
still do).
<h3> Have a force-ranked issue system </h3>
<p>
Issues (tickets, bugs, stories, whatever you call them) should be in a single
list.
The list should be ordered by priority in which your project manager wants
things done. That way the people working on things can always just pull the
top-most item off the list. Yes, please self-assign items as you start work on
them so others know that it is being worked on.
</p>
<p>
There should be one list per team or whatever unit of
organization you have. Any member of the team can pick up the next item.
There is no confusion about which team owns an item.
</p>
<p>Please include tech-debt items! And please allow ICs to add items.</p>
<h3>What counts as a ticket/issue story?</h3>
<p>
Preferably include smaller self-contained features or bugs. Try not to put
different things in the same issue, even if related or "in the same area".
This is of course subjective, but one rule-of-thumb is "can it be tested by the
PM or QA independently?"
</p>
<h3>Sprint</h3>
<p>
Sprint length of two weeks seems ideal. One week is too short, four is too
long. Three is just odd.
</p>
<p>
Sprint lifecycle should be something like: backlog grooming before the sprint,
planning meeting on the first day (or before), check-ins/standups during, and
a retro on the last day.
</p>
<p>
Backlog grooming: PM, EM, tech leads, and maybe some senior individual
contributors (depends on size of team) can go through the list of issues,
maybe triage if needed, re-order them if needed. Basically, prepare for the
actual planning meeting. Ideally backlog grooming is ongoing, but sometimes it
doesn't happen. Should be a 30-minute or less meeting.
</p>
<p>
Please ensure a
healthy, non-zero, number of tech-debt/IC-contributed items are addressed.
This is a good time to ask for more info from stakeholders or to close stale
issues.
</p>
<p>
Planning: Whole team, issues should be quickly discussed so the whole team has
a general idea of what's involved. Issues can be assigned now, but try to let
ICs self-assign rather than handing them down. Should be an hour or less.
</p>
<p>
Check-ins/standups: Daily fast standup or every-other-day check-ins to ask and
answer "are we on target?" (and adjust stakeholder expectations).
</p>
<p>If it's a
daily fast standup, each person should quickly (within 30 seconds -- do NOT
time people) tell what they got done
yesterday, what they intend to work on today, and importantly, any blockers or
requests for help. No interruptions, jokes, side-tracking, or requests for
clarification. Break out after the stand-up if any of this is desired, and no
one has any obligation to stay for this followup. Note that this is often done
asynchronously in the team chat channel.
</p>
<p>
The check-ins or standups shouldn't be more than 15 minutes; if you're always
going over, you have a big team or a talkative team. Try to find a time that
works for everyone. Just before lunch is good motivation!
</p>
<p>
Retro: Various teams use various forms of retro, but the core seems to be
"what went well" and "what we could do better". Let everyone contribute by
writing their lists all together on a board or paper or something, and then
going around one by one to expand verbally. Shouldn't take more than 2 minutes
each, adjust for team size. Try to get action items for improvements. Try to
always have some positive items.
</p>
<p>
The retro can segue into backlog grooming or even planning for the next
sprint.
</p>
<br/><hr/>Comments:<br/>2018-09-19T17:32:05+00:00Effective issue tracker usage
http://www.thesatya.com//blog/2017/11/issue_tracker_practices.html
<p>
Suppose you use an issue tracker like JIRA or Pivotal Tracker (PT). Here's how
I use it effectively.
</p>
<p>
Stories, short for "user stories", issues, and tickets are used
interchangeably here. There are subtle differences (or not so subtle ones)
that don't matter for this discussion.
</p>
<p>
In JIRA you have multiple projects. Each project has a list
of tickets. This should be called the "Backlog". Additionally, you can create
a "Sprint".
</p>
<p>
In PT, in the default view, there are three columns: the "current
sprint", the "backlog", and an "icebox". I'd advise finding the setting to
enable stacking of current on top of backlog, because that gives a "truer"
view.
</p>
<p>
The (current) sprint contains stories that are being worked on in this
sprint. These stories will be in various states, such as Done, In Review, In
Progress, To Do. Those are the 4 major states for stories. PT enforces the
states while JIRA is more configurable. PT places stories in the sprint
automatically based on project velocity (which is based on story points). In
JIRA, the current sprint is manually determined during your sprint planning
meeting (aka iteration planning (IPM)).
</p>
<p>
Every story in the current sprint must have a story owner and (if your team
uses estimates) an estimate. Some shops do not assign a story owner until
someone starts the story, and then that person is the owner. This works well
in smaller teams consisting of mostly senior devs. PT enables this by
auto-assigning owners.
</p>
<p>
Additionally, stories may belong to epics. That's a good way in JIRA to group
stories. PT also has epics but I prefer to use Labels in PT. JIRA has labels
but they're less visible to me. I'd advise using Components in JIRA for the
same purpose.
</p>
<p>
The Backlog contains stories for upcoming sprints. They may be un-assigned,
un-estimated, not even fully fleshed-out. All of those things can happen
during the sprint planning meeting.
</p>
<p>
PT has an additional space, the Icebox. This is sort-of the ideas bin: stories
that haven't been thought through, or that are a reminder to actually add a
"real" story, or stuff that's deferred until some later date.
</p>
<p>
If team members run out of stories for the current sprint, they may be able to
pick up stories from the backlog. After checking with the project
manager, perhaps. Beware that stories in the backlog may not be quite ready to
be worked on.
</p>
<p>
PT forces all the stories to be in force-ranked order, which means someone
(project manager, in consultation with senior developers) needs to order the
stories. JIRA can also be configured to do force-ranking. Stories should
usually be placed in order of business priority and according to depenecies.
Feature Y has priority but also depends on Feature X? Place X first (higher) in the
sprint.
</p>
<p>
(Why the PM? That's who's running the project. Why the senior devs?
Because that's who can point out hidden dependencies. In a small enough team,
it'd be all the devs... which means you're having a sprint planning meeting.
Or a backlog-grooming meeting, which is a pre-planning meeting. Because of
time constraints. These meetings need to be extremely focussed, and I will
write a separate article about them.)
</p>
<br/><hr/>Comments:<br/>2017-11-23T15:40:08+00:00Straightforward git branching workflow
http://www.thesatya.com//blog/2017/11/git_branching_workflow.html
<p>
Here's a simple git branching workflow that I like and use.
</p>
<p>
All "production" work is on the master branch.
</p>
<p>
Start a feature, first create a new branch:
</p>
<pre>
git checkout -b feature-name
</pre>
<p>
Perhaps prefixed or suffixed with a ticket/issue identifier, e.g.
axp-1090-add-phone-field
</p>
<p>
Usually useful to push the branch to your git origin (often, GitHub) right
away, and have git track the remote. This is done in one command with:
</p>
<pre>
git push -u origin HEAD
</pre>
<p>
HEAD is a special thingy referring to the commit we're "on" now.
</p>
<p>
Do the work. Maybe multiple commits, maybe some work-in-progress commits.
</p>
<pre>
git add
git add -p
git commit -m "AXP-1090 Add phone field"
</pre>
<p>
add with -p makes git prompt for each change. Useful to review the changes and
to be selective. Maybe some changes should be in a different commit?
</p>
<p>
The git commit command can open a text editor for the commit message, or you
can specify the message as shown here.
</p>
<p>
To re-order commits, use `git rebase -i` which will rebase against the
upstream, in this case master.
</p>
<p>
Move the commits as presented in the editor in the order you want. Note that
work-in-progress commits can be squashed by changing "pick" to "squash". Each
squashed commit will become part of the commit above it. And yes you can squash
multiple commits in a row.
</p>
<p>
Solve any conflicts. In the default configuration git should show you lots of help on how.
</p>
<p>
To integrate any changes that have happened in master, every now and then (with
a clean branch! see `git status`) do a fetch and rebase (and do one of these
when you're ready to merge your changes back to master):
</p>
<pre>
master: M1 -> M2 -> M3
\
your branch: B1 -> B2
</pre>
<pre>
git fetch
git rebase
</pre>
<pre>
master: M1 -> M2 -> M3
\
your branch: B1 -> B2
</pre>
<p>
Beware that if you switch to master (git checkout master) you will still be
behind origin/master, and need to do `git rebase` *on master*.
</p>
<p>
When the branch is in a sufficiently clean state, push your work to the remote:
</p>
<pre>
git push
</pre>
<p>
If you've rebased, you'll need to use `git push -f` i.e. "force-push". And that
is why we use branches, and that is why everyone's should be on their own
branch. Otherwise, force-pushing will overwrite other people's work on the
branch. That is why we never force-push master (except when we do).
</p>
<p>
Use `git status` often, which is why I have a shell alias `gs` for `git
status`. And I have `git` itself aliased as `g`, with various git commands
shortened in my <a href="https://github.com/satyap/dotfiles/blob/master/gitconfig">~/.gitconfig</a>
</p>
<p>
To merge your changes to master, open a Pull Request on GitHub or, if you're
doing it yourself manually, you can merge.
First, rebase against master, then switch to master and then merge your branch.
</p>
<pre>
git rebase
git checkout master
git merge axp-1090-add-phone-field
git branch -d axp-1090-add-phone-field # optional, delete your branch
</pre>
<p>
At this point you should have a clean master that you can push (not force-push) to origin.
</p>
<ul>
<li>Learn git branching: <a
href="https://learngitbranching.js.org/?demo">https://learngitbranching.js.org/?demo</a></li>
<li>How to Write a Git Commit Message: <a
href="https://chris.beams.io/posts/git-commit/">https://chris.beams.io/posts/git-commit/</a></li>
</ul>
<br/><hr/>Comments:<br/>2017-11-18T13:41:19+00:00Spark with Databricks and AWS EMR
http://www.thesatya.com//blog/2017/11/databricks.html
<p>
<a href="https://databricks.com/">Databricks</a> is a web-based Spark notebook
system, and also the company that wrote several spark libraries. The
notebooks support Python and Scala.
</p>
<p>
Databricks runs your jobs in your AWS account, on EC2 instances that Databricks
manages. The initial setup is a little tricky.
</p>
<p>
Once everything it set up, it's a pretty good system for developing and
debugging Spark-based jobs. It can run jobs on a cron-like schedule. Databricks
has support for retrying failed jobs, and can notify about success/failure by
email. It gives good access to the spark worker logs for debugging.
</p>
<p>
However, this is expensive to run. I came up with a workflow that involves
development/debugging on databricks, and then export the notebook as a script
to be run in EMR (Elastic Map-Reduce, an AWS product).
I use Python and pyspark, so this works pretty well.
</p>
<p>
I do need to import the libraries that Databricks imports automatically. I
forget the exact import statements I used, but they are easy enough to figure
out.
</p>
<p>
I used <a href="https://pypi.python.org/pypi/luigi">Luigi</a>, which is a
Python library for setting up task workflows (see also: Airflow). I set up a
few abstract classes to encapsulate "this is an EMR job". Then I
extended one of those classes to actually run a spark job, something like:
</p>
<pre>
class ActualJob(AbstractSparkJob):
luigi_param = luigi.Parameter() # I usually pass a timestamp to every task
def spark_cmd:
return "this_command.py --runs --spark-script --located-in s3://example.com/foo.py"
</pre>
<p>
spark_cmd returns a command line string.
</p>
<p>
My AbstractSparkJob takes the given command and does one of two things: either
submit an EMR step to the EMR cluster, using command_runner.jar to run the
command, or ssh into the cluster and run spark-submit with the given command as
its parameters.
</p>
<p>
(I don't have all the code do that available to post right now, sorry)
</p>
<p>
The abstract classes encapsulate all the work of spinning up an EMR cluster
(and shutting it down), and making the AWS API calls via the boto library, and
the work to ssh in and run spark-submit.
</p>
<p>
The abstract classes makes it easy for any other data infrastructure developer
person to add more jobs.
</p>
<p>
The actual script that was exported from Databricks lives in S3 and is
referenced by the command shown above. EMR does the work of fetching it from S3 and running
it. Part of my data pipeline startup is to copy all the EMR scripts, which I
keep in a convenient subdirectory, up to a specific prefix in S3. That way my
actual Spark scripts all live in the same code repository as the rest of the
pipeline.
</p>
<br/><hr/>Comments:<br/>2017-11-17T12:18:59+00:00Power consumption
http://www.thesatya.com//blog/2017/02/powercomp.html
<table border="1">
<tr>
<td>Component</td><td>Wattage</td>
</tr><tr>
<td>CPU+RAM+fans</td><td>23W 210mA</td>
</tr><tr>
<td>+ HDD
</td><td>37W 240mA</td>
</tr><tr>
<td>+ GTX 1050</td><td>57W 500mA</td>
</tr>
</table>
<br/>
<br/><hr/>Comments:<br/>2017-02-20T14:17:06+00:00How to back up Samsung Galaxy S4 using adb
http://www.thesatya.com//blog/2017/01/s4_backup.html
<p>
Connect the phone via USB. Go into Settings, About Phone, click on Build
Version about 7 times. Then Developer Options becomes available, use that to
turn on USB Debugging.
</p>
<p>
Use `lsusb` to get the phone's code. Example output:
</p>
<pre>
Bus 001 Device 059: ID 04e8:6866 Samsung Electronics Co., Ltd GT-I9...
</pre>
<p>
Create a file or directory `~/.android/adbusb.ini` containing the line
"0x04e8" based on the lsusb output above.
</p>
<p>
As root, drop a file called /etc/udev/rules.d/s4-android.rules containing the
following:
</p>
<pre>
SUBSYSTEM=="usb", SYSFS{idVendor}=="04e8", MODE="0666"
</pre>
<p>
Note that the vendor id is again from the lsusb above.
</p>
<p>
Run this too (as root):
</p>
<pre>chmod 644 /etc/udev/rules.d/s4-android.rules</pre>
<p>Chef recipe for that file:</p>
<pre>
file 'udev_rule_s4_android' do
path '/etc/udev/rules.d/s4-android.rules'
mode '0644'
owner 'root'
group 'root'
end
execute 'udev-restart' do
command '/etc/init.d/udev restart'
subscribes :create, 'udev_rule_s4_android', :immediately
action :nothing
end
</pre>
<p>
Connect the phone in 'PTP' mode:
<br/>
Connect the phone via USB. Drag down the notifications, click on the
'Connected as...' notification, and set it to PTP or Camera mode.
</p>
<p>
`adb shell` should now get you a shell on the phone.
</p>
<p>
References:
</p>
<ul>
<li><a
href="http://stackoverflow.com/questions/5510284/adb-devices-command-not-working">http://stackoverflow.com/questions/5510284/adb-devices-command-not-working</p></li>
<li><a
href="https://ubuntuforums.org/showthread.php?t=2298370">https://ubuntuforums.org/showthread.php?t=2298370</a></li>
<li><a
href="http://askubuntu.com/questions/213874/how-to-configure-adb-access-for-android-devices">http://askubuntu.com/questions/213874/how-to-configure-adb-access-for-android-devices</a></li>
<li><a
href="http://android.stackexchange.com/questions/21112/new-phone-how-to-transfer-game-progress">http://android.stackexchange.com/questions/21112/new-phone-how-to-transfer-game-progress</a></li>
<br/><hr/>Comments:<br/>2017-01-22T17:11:54+00:00Star Wars: Rogue One
http://www.thesatya.com//blog/2016/12/rogue_one.html
<p>
Watched Rogue One on Dec 15th at the <a href="https://hired.com">hired.com</a> event.
</p>
<p>
Spoilers ahead!
</p>
<p>
Call-backs to the original movies:
</p>
<ol>
<li> Blue milk (is that an even older call-back to "Blue Harvest"?)</li>
<li> The Governator</li>
<li> The Princess</li>
<li> The reprise of the blockade runner boarding scene (at the airlock *to* the blockade runner)</li>
<li> Choking people</li>
<li> Imperial battle station design elements</li>
<li> "Commence primary ignition" (Which I always heard as the nonsensical "Commence spider ignition")</li>
<li> The guy who has the death sentence in twelve systems</li>
<li> Nice reveal at the end there. Saw it coming at the exact right moment (not too early, not too late)</li>
</ol>
<p>
Random observations:
</p>
<p>
Somewhere, some time, I remember reading a thing that was supposed to be an early draft of the original movie. It had a subtitle "Journal of the Whills", and Biggs Darklighter and his brother Luke were searching for (the? singular?) Kyber crystal. I can't find this thing anywhere, though Gizmodo seems to remember the same thing: http://io9.gizmodo.com/all-the-major-star-wars-cameos-and-connections-you-may-1790195147
</p>
<p>
Who cleans all the imperial installations and mirror-polishes everything?
</p>
<p>
At least that platform at the data archive had a hand-rail. Who builds platforms over sheer drops, and why? Was that control panel there so you can see the dish while aiming it? How would that even work, you don't aim a dish like that by eye!
</p>
<p>
Things I didn't like:
</p>
<p>
The punning: "Choke on your aspirations"? (That's a double pun btw). It doesn't fit the character, I'm sorry, I can't take that. It's as bad as Han shot first.
</p>
<p>
While the quasi-Jedi guy was awesome, I felt that the character was simply shoe-horned in there for comic relief. One droid (played by Alan Tudyk! +1!) is enough for comic relief, thanks.
</p>
<p>
Inexplicable things:
</p>
<p>
They have a shield around an entire planet, where is it generated from? The DS-2 shield in ROTJ was generated from a base on (the moon of) Endor.
</p>
<p>
That's a short list.
</p>
<p>
Observations about the Governor:
</p>
<p>
So we know the actor has been dead for two decades. Apparently they got someone else to be the body double (see wikipedia article for the movie). I assume they texture-mapped the original actor's face on. To me it seemed as if they'd got a character out of a video game.
</p>
<p>
The effect was *just* a *little* bit short of real. Maybe because I was looking for it. I'm trying to read it as the Governor has cold dead eyes because he's that much of a stone-cold whatever, not because it's animated.
</p>
<p>
Sadly I spent so much time staring at the effect that I missed the dialogue.
</p>
<br/><hr/>Comments:<br/>2016-12-18T07:37:46+00:00Moving things in AWS S3
http://www.thesatya.com//blog/2016/02/move_aws.html
<p>
Recently I had several hundred small files in an AWS S3 bucket, in folders by date.
</p>
<p>
Something like s3://bucket/2016-02-08/ (and a couple layers deeper), with a few "directories" under the dated "directory".
(The sub-directories were the first letter of the ... things, so I had about 62
sub-directories (A-Z, a-z, 0-9). This is a naive hash function.)
</p>
<p>
I wanted to move them into a year-based "directory", so s3://bucket/2016/2016-02-08/
</p>
<p>
(Why am I quoting the word "directory"? Because they're not really directories,
they're "prefixes". This is relevant if you use the AWS S3 SDK libraries.)
</p>
<p>
Moving them via the S3 web "console"'s cut/paste interface is slow. REALLLLLY slow. Like, multiple-days slow.
</p>
<p>
So I (after trying a few other things) pulled out the aws command-line tool (AWS CLI).
</p>
<p>
Since the sub-directories were letters and numbers, I could do this:
</p>
<p>
`for x in a b c d;do echo aws s3 mv --recursive s3://bucket/2016-02-08/$x
s3://bucket/2016/2016-02-08/$x \&;done > scr.sh`
</p>
<p>
The `for` loop runs through a, b, c, d (different from A, B, C, D), and sets up
a recursive move operation. This move is much faster using the AWS CLI.
Additionally, I background the process of moving the 'a's (using the `\&`) so
the 'b's can start right away, and so forth.
</p>
<p>
But I don't run the commands right away. Notice that they're being `echo`ed.
Capture the output in a file scr.sh, and run the scr.sh. Why?
</p>
<p>
Because I can now
set up a second file with d e f g, to go right after the first, or even in
parallel. So now I have up to 4 or 8 move operations going at once.
watch the whole thing with `watch "ps axww|grep scr"` in a separate terminal,
of course.
</p>
<p>
But mainly because the `&` backgrounding interacts weirdly with the for loop.
</p>
<p>
With this, I was done in well, a couple of hours. A lot of that was waiting for
the last copy-paste I ran in the web console to finish.
</p>
<br/><hr/>Comments:<br/>2016-02-15T22:03:08+00:00